Please do not A/B test my workflow

3/14/2026 at 12:16:50 PM

The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much. I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

by krisbolton

3/14/2026 at 12:35:31 PM

> I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo.

I disagree in the case of LLMs.

AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".

It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.

And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

> That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.

People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.

by SlinkyOnStairs

3/14/2026 at 10:49:11 PM

Would you have a problem with the following scheme?

Every client is free and encouraged to feed back its financial health: profit for that hour/day/month/...

The AB(-X) test run by the LLM provider uses the correlation of a client's profit with its AB(-X) test, so that participating with the testing improves your profit statistically speaking (sometimes up sometimes down, but on average up).

You may say, what about that hiring decision? One thing is certain: when companies make more profit they are more likely to seek and accept more employees.

by DoctorOetker

3/15/2026 at 12:49:13 PM

That sounds like a good way to get extreme short-term optimization.

Say a particular finetune prioritizes profits right now and makes recommendations like "cut down on maintenance, you can make up for it later with your increased profits and their interest". It produces more profits, and wins the AB test. Later the chickens come home to roost.

You can reduce the problem by using long-term indicators, but then each AB test is very slow.

by 986aignan

3/14/2026 at 5:20:42 PM

Anyone who trusts LLMs to do anything has shit coming. You can not trust them. If you do, that's on you. I don't care if you want to trust it to manage hiring, you can't. If you do anyway then the ethical problems are squarely on you.

People keep complaining about LLMs taking jobs, meanwhile others complain that they can't take their jobs and here I am just using them as a useful tool more powerful than a simple search engine and it's great. No chance it'll replace me, but it sure helps me do ny job better and faster.

by sfn42

3/14/2026 at 1:13:37 PM

I think you would be hard pushed to find any big tech company which doesn't do some kind of A B testing. It's pretty much required if you want to build a great product.

by londons_explore

3/14/2026 at 1:33:12 PM

Yeah, that's why we didn't have anything anyone could possibly consider as a "great product" until A/B testing existed as a methodology.

Or, you could, you know, try to understand your users without experimenting on them, like countless of others have managed to do before, and still shipped "great products".

by embedding-shape

3/14/2026 at 10:17:35 PM

I know this is a salty take, but reliance on A/B testing to design products is indicative of product deciders who don't know what they are doing and don't know what their product should be. It's like a chef saying, I want to make a pancake, but trying 50 different combinations of ingredients until one of them ends up being a pancake. If you have to test whether a product works / is good / is profitable, then you didn't know what you were doing in the first place.

Using A/B tests to safely deploy and test bug fixes and change requests? Totally different story.

by ryandrake

3/14/2026 at 2:36:49 PM

A responsible company develops an informed user group they can test new changes with and receive direct feedback they can take action on.

by wavefunction

3/14/2026 at 4:32:18 PM

A big tech company has ~10k experiments running at once. Some engineers will be kicking off a few experiments every day. Some will be minor things like font sizes or wording of buttons, whilst others will be entirely new features or changes in rules.

Focus groups have their place, but cannot collect nearly the same scale of information.

by londons_explore

3/14/2026 at 4:39:54 PM

I think a lot of people (myself included) would just like to not be constantly part of some sort of revenue optimization effort.

I don't care, at all, about the "scale of information" for the company's sake.

by rkomorn

3/14/2026 at 8:38:32 PM

Often the experiments are not for revenue - many of them will be optimizing user experience metrics - ie. Load time or user dropoff rate.

They are clearly good for both user satisfaction and the companies bottom line.

by londons_explore

3/14/2026 at 9:35:56 PM

As someone who works in these orgs, only a small fraction are about user experience metrics. 90+% are extracting more short term value with unknown second order effects on usability.

by jjj123