375

Lines of code that beat A/B testing (2012)

Pure, disinterested A/B testing where the goal is just to find the good way to do it, and there's enough leverage and traffic that funding that A/B testing is worthwhile is rare.

More frequently, A/B testing is a political technology that allows teams to move forward with changes to core, vital services of a site or app. By putting a new change behind an A/B test, the team technically derisks the change, by allowing it to be undone rapidly, and politically derisks the change, by tying it's deployment to rigorous testing that proves it at least does no harm to the existing process before applying it to all users. The change was judged to be valuable when development effort went into it, whether for technical, branding or other reasons.

In short, not many people want to funnel users through N code paths with slightly different behaviors, because not many people have a ton of users, a ton of engineering capacity, and a ton of potential upside from marginal improvements. Two path tests solve the more common problem of wanting to make major changes to critical workflows without killing the platform.

4 days agoawkward

> politically derisks the change, by tying it's deployment to rigorous testing that proves it at least does no harm to the existing process before applying it to all users.

I just want to drop here the anecdata that I've worked for a total of about 10 years in startups that proudly call themselves "data-driven" and which worshipped "A/B testing." One of them hired a data science team which actually did some decently rigorous analysis on our tests and advised things like when we had achieved statistical significance, how many impressions we needed to have, etc. The other did not and just had someone looking at very simple comparisons in Optimizely.

In both cases, the influential management people who ultimately owned the decisions would simply rig every "test" to fit the story they already believed, by doing things like running the test until the results looked "positive" but not until it was statistically significant. Or, by measuring several metrics and deciding later on to make the decision based on whichever one was positive [at the time]. Or, by skipping testing entirely and saying we'd just "used a pre/post comparison" to prove it out. Or even by just dismissing a 'failure,' saying we would do it anyway because it's foundational to X, Y, and Z which really will improve (insert metric) The funny part is that none of these people thought they were playing dirty, they believed that they were making their decisions scientifically!

Basically, I suspect a lot of small and medium companies say they do "A/B testing" and are "data-driven" when really they're just using slightly fancy feature flags and relying on some director's gut feelings.

4 days agoxp84

At a small enough scale, gut feelings can be totally reasonable; taste is important and I'd rather follow an opinionated leader with good taste than someone who sits on their hands waiting for "the data". Anyway, your investors want you to move quickly because they're A/B testing you for surviveability against everything else in their portfolio.

The worst is surely when management make the investments in rigor but then still ignores the guidance and goes with their gut feelings that were available all along.

4 days agomikepurvis

Huge plus one to this. We undervalue when to bet on data and when to be comfortable with gut.

4 days agolegendofbrando

I think your management was acting more competently than you are giving them credit for.

If A/B testing data is weak or inconclusively, and you’re at a startup with time/financial pressure, I’m sure it’s almost always better to just make a decision and move on than to spend even more time on analysis and waiting to achieve some fixed level of statistical power. It would be a complete waste of time for a company with limited manpower that needs to grow 30% per year to chase after marginal improvements.

4 days agoweitendorf

One shouldn't claim to be "data-driven", when one doesn't have a clue what that means. Just admit, that you will follow the leader's gut feeling at this company then.

4 days agozelphirkalt

In all cases, data-driven means we establish context for our gut decisions. In the end, it's always a judgement call.

4 days agoclosewith

Robin Hanson recently related the story of a firm which actually made data-driven decisions, back in the early 80's: https://www.overcomingbias.com/p/hail-jeffrey-wernick

It went really well, and then nobody ever tried it again.

4 days agokhafra

That's a counter-example. The prediction markets were used to inform gut feelings, not controlling the company, and in the end, when he wanted to stop, he stopped following the markets and shut the company.

3 days agoclosewith

> Basically, I suspect a lot of small and medium companies say they do "A/B testing" and are "data-driven" when really they're just using slightly fancy feature flags and relying on some director's gut feelings.

see also Scrum and Agile. Or continuous deployment. Or anything else that's hard to do well, and easier to just cargo-cult some results on and call it done.

4 days agopetesergeant

I worked at an almost-medium-sized company and we did quite a lot of A/B testing. In most cases the data people would be like "no meaningful difference in user behaviour". Going by gut feeling and overall product metrics (like user churn) turns out to be pretty okay most of the time.

The one place that A/B testing seem to have a huge impact was on the acquisition flow and onboarding, but not in the actual product per se.

4 days agoDanielHB

> In short, not many people want to funnel users through N code paths with slightly different behaviors, because not many people have a ton of users, a ton of engineering capacity, and a ton of potential upside from marginal improvements.

I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results. I figure by the law of probabilities they would have gotten at least a single significant experiment but most products have such small user bases and make such large changes at a time that it’s completely pointless.

All my complaints fell on deaf ears until the PM in charge would get on someone’s bad side and then that metric would be used to push them out. I think they’re largely a political tool like all those management consultants that only come in to justify an executive’s predetermined goals.

4 days agothrowup238

> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.

What I've seen in practice is that some places trust their designers' decisions and only deploy A/B tests when competent people disagree, or there's no clear, sound reason to choose one design over another. Surprise surprise, those alternatives almost always test very close to each other!

Other places remove virtually all friction from A/B testing and then use it religiously for every pixel in their product, and they get results, but often it's things like "we discovered that pink doesn't work as well as red for a warning button," stuff they never would have tried if they didn't have to feed the A/B machine.

From all the evidence I've seen in places I've worked, the motivating stories of "we increased revenue 10% by a random change nobody thought would help" may only exist in blog posts.

4 days agodkarl

I think trusting your designers is probably the way to go for most teams. Good designers have solid intuitions and design principles for what will increase conversion rates. Many designers will still want a/b tests because they want to be able to justify their impact, but they should probably be denied. For really important projects designers should do small sample size research to validate their designs like we would do in the past.

I think a/b tests are still good for measuring stuff like system performance, which can be really hard to predict. Flipping a switch to completely change how you do caching can be scary.

4 days agozeroCalories

A/B tests for user interface is very annoying when you are on the phone trying to guide someone how to use a website. "Click the green button on the left" - "What do you mean? There is nothing green on the screen." - "Are you on xyz.com? Can you read out the adress to me please?" ... Oh so many hour wasted in tech support.

4 days agoMoru

Meta apps are like that, particularly around accessibility.

It's pretty common for one person to have an issue that no other people have, just because they fell for some feature flag.

4 days agomiki123211

Good designers generally optimize for taste but not for conversions. I have seen so many designs that were ugly as sin that won, as measured by testing. If you want to build a product that is tasteful, designers are the way to go. If you want to build a product optimized for a clear business metric like sales or upgrades or whatnot, experimentation works better.

It just depends on the goals of the business.

4 days agoedmundsauto

In paid SaaS B2B A/B testing is usually a very good idea for use acquisition flow and onboarding, but not in the actual product per se.

Once the user has committed to paying they probably will put up with whatever annoyance you put in their way, also if they are paying if something is _really_ annoying they often contact the SaaS people.

Most SaaS don't really care that much about "engagement" metrics (ie keeping users IN the product). These are the kinda of metrics are are the easiest to see move.

In fact most people want a product they can get in and out ASAP and move on with their lives.

4 days agoDanielHB

Many SaaS companies care about engagement metrics, especially if they have to sell the product, like their revenue depends on salespeople convincing customers to renew or upgrade their licenses at a certain level for so many seats at $x/year.

For example, I worked on a new feature for a product, and the engagement metrics showed a big increase in engagement by several customers' users, and showed that their users were not only using our software more but also doing their work much faster than before. We used that to justify raising our prices -- customers were satisfied with the product before, at the previous rates, and we could prove that we had just made it significantly more useful.

I know of at least one case where we shared engagement data with a power user at a customer who didn't have purchase authority but was able to join it with their internal data to show that use of our software correlated with increased customer satisfaction scores. They took that data to their boss, who immediately bought more seats and scheduled user training for all of their workers who weren't using our software.

We also used engagement data to convince customers not to cancel. A lot of times people don't know what's going on in their own company. They want to cancel because they think nobody is using the software, and it's important to be able to tell them how many daily and hourly users they have on average. You can also give them a list of the most active users and encourage them to reach out and ask what the software does for them and what the impact would be of cancelling.

4 days agodkarl

> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.

Well, at least it looks like they avoided p-hacking to show more significance than they had! That's ahead of much of science, alas.

4 days agoeru

> I’ve been in companies that have tried dozens if not hundreds of A/B tests with zero statistically significant results.

Yea, I've been here too. And in every analytics meeting everyone went "well, we know it's not statistically significant but we'll call it the winner anyway". Every. Single. Time.

Such a waste of resources.

4 days agorco8786

Is it a waste? You proved the change wasn't harmful.

4 days agohamandcheese

Statistically insignificant means you didn't prove anything by usual standards. I do agree that it's not a waste, as knowing that you have a 70% chance that you're going in the right direction is better than nothing. The 2 sigma crowd can be both too pessimistic and not pessimistic enough.

4 days agozeroCalories

Statistical significance depends on what you're trying to prove. Looking for substantial harm is a lot easier than figuring out which option is simply better, depending on what level of substantial you're looking for.

If your error bar for some change goes from negative 4 percent to positive 6 percent, it may or may not be better, but it's safe to switch to.

3 days agoDylan16807

To prove that a change isn't harmful is still a hypothesis.

4 days agomenaerus

You can still enshittify something by degrees this way.

I think the disconnect here is some people thinking A/B testing is something you try once a month, and someplace like Amazon where you do it all the time and with hundreds of employees poking things.

3 days agohinkley

Tracks that I’ve primarily seen A/B tests used as a mechanism for gradual rollout rather than pure data-driven experimentation. Basically expose functionality to internal users by default then slowly expand it outwards to early adopters and then increment it to 100% for GA.

It’s helpful in continuous delivery setups since you can test and deploy the functionality and move the bottleneck for releasing beyond that.

4 days agoljm

I wouldn’t call that A/B testing but rather a gradual roll-out.

4 days agobaxtr

If you roll it back upon seeing problems, then you're doing something meaningful, at least. IMO 90+% of the value of A/B testing comes from two things, a) forcing engineers to build everything behind flags, and b) making sure features don't crater your metrics before freezing them in and making them much more difficult to remove (both politically and technically).

Re: b), if you've ever gotten into a screaming match with a game designer angry over the removal of their pet feature, you will really appreciate the political cover that having numbers provides...

4 days agocornel_io

I think parent is confusing A/B testing with feature flags, which can be used for A/B tests but also for roll-outs.

4 days agoalex7o

Not the parent but some actual practitioners. A change is based on the gut feeling, and it's usually correct, but the internal politics require to demonstrate impartiality, so an "A/B test" is run, to show that the change is "objectively better", whether statistics show that or not.

4 days agonine_k

Feature flags tend to be all or nothing and/or A/B testing instrumentation can be used to roll out feature flags.

It’s complicated.

3 days agohinkley

I’m aware of the distinction. A/B testing is the killer app for feature flags from the perspective of business decision makers.

4 days agoawkward

I think gradual rollout can use the same mechanism, but for a different readon: avoiding pushing out a potentially buggy product to all users in one sweep.

It becomes an A/B test when you measure user activity to decide whether to roll out to more users.

4 days agoSomeone

Has my CPU use gone up? No.

Have my error logs gotten bigger? No.

Have my tech support calls gone up? No.

Okay then turn the dial farther.

3 days agohinkley

I feel like you are trying to say "sometimes people just need a feature flag". Which is of course true.

4 days agohamandcheese

A feature flag can still be fully on or fully off.

Why they might conflate A/B testing with gradual rollout is control over who gets the feature flag on and who doesn't.

In a sense, A/B testing is a variant of gradual rollout, where you've done it so you can see differences in feature "performance" (eg. funnel dashboards) vs just regular observability (app is not crashing yet).

Basically, a gradual rollout for the purposes of an A/B test.

4 days agonecovek

Derisking changes may not work sometimes. For example I don't use Spotify anymore, because of their ridiculous Ab tests. In one month I saw 3 totally different designs of the home and my fav playlists page on my Android phone. That's it. When you open Spotify only when you start your car then it's ridiculous that you can't find anything and you are in a hurry. That was it. I am no longer subscriber and a user of this shit service. Sometimes these tests are actually harmful. Maybe others are just driving and trying to manage Spotify at the same time and then we have actual killed people because of this. Harmless Indeed.

4 days agokavenkanum

Ever tried a support call helping someone using a website that has A/B testing on? It's a very frustrating experience where you start to think the user on the other side must have mistyped the url. A lot of time wasted on such calls. And yes, the worst is when these things only happen in things you use only when in a hurry.

4 days agoMoru

Why do you consider it political. Isn't it just a wise thing to do?

4 days agomewpmewp2

One of the assumptions of vanilla multi-armed bandits is that the underlying reward rates are fixed. It's not valid to assume that in a lot of cases, including e-commerce. The author is dismissive and hard wavy about this and having worked in in e-commerce SaaS I'd be a bit more cautious.

Imagine that you are running MAB on an website with a control/treatment variant. After a bit you end up sampling the treatment a little more, say 60/40. You now start running a sale - and the conversion rate for both sides goes up equally. But since you are now sampling more from the treatment variant, its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.

Fluctuating reward rates are everywhere in e-commerce, and tend to destabilise MAB proportions, even on two identical variants, they can even cause it to lean towards the wrong one. There are more sophisticated MAB approaches that try to remove the identical reward-rate assumption - they have to model a lot more uncertainty, and so optimise more conservatively.

4 days agosweezyjeezy

> ...the conversion rate for both sides goes up equally.

If the conversion rate "goes up equally", why did you not measure this and use that as a basis for your decisions?

> its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.

This sounds simply like using bad math. Wouldn't this kill most experiments that start with 10% for the variant that do not provide 10x the improvement?

4 days agonecovek

No. This isn't just bad math.

The problem here is that the weighting of the alternatives changes over time and the thing you are measuring may also change. If you start by measuring the better option, but then bring in the worse option in a better general climate, you could easily conclude the worse option is better.

To give a concrete example, suppose you have two versions of your website, one in English and one in Japanese. Worldwide, Japanese speakers tend to be awake at different hours than English speakers. If you don't run your tests over full days, you may bias the results to one audience or the other. Even worse, weekend visitors may be much different than weekday visitors so you may need to slow down to full weeks for your tests.

Changing tests slowly may mean that you can only run a few tests unless you are looking at large effects which will show through the confounding effects.

And that leads back to the most prominent normal use which is progressive deployments. The goal there is to test whether the new version is catastrophically worse than the old one so that as soon as you have error bars that bound the new performance away from catastrophe, you are good to go.

3 days agoted_dunning

I mean, sure you could test over only part of the day, but if you do, that is, imho, bad math.

Eg. I could sum up 10 (decimal) and 010 (octal) as 20, but because they were the same digits in different numbering systems, you need to normalize the values first to the same base.

Or I could add up 5 GBP, 5 USD, 5 EUR and 5 JPY and claim I got 20 of "currency", but it doesn't really mean anything.

Otherwise, we are comparing incomparable values, and that's bad math.

Sure, percentages is what everybody gets wrong (hey percentage points vs percentage), but that does not make them not wrong. And knowing what is comparable when you simply talk in percentages, even more so (as per your examples).

3 days agonecovek

It is a universal truth that people fuck up statistical math.

    There are three kinds of lies: lies, damn lies, and statistics

If you aren’t testing at exactly 50/50 - and you can’t because my plan for visiting a site and for how long will never be equivalent to your plan, then any other factors that can affect conversion rate will cause one partition to go up faster than the other. You have to test at a level of Amazon to get statistical significance anyway.

And as many if us have told people until they’re blue in the face: we (you) are not a FAANG company and pretending to be one won’t work.

3 days agohinkley

One of the other comment threads has a link to a James LeDoux post about MAB with EG, UCB1, BUCB and EXP3, with EXP3, from what I've seen, marketed as an "adversarial" MAB method [0] [1].

I found a post [2] of doing some very rudimentary testing on EXP3 against UCB to see if it performs better in what could be considered an adversarial environment. From what I can tell, it didn't perform all that well.

Do you, or anyone else, have an actual use case for when EXP3 performs better than any of the standard alternatives (UCB, TS, EG)? Do you have experience with running MAB in adversarial environments? Have you found EXP3 performs well?

[0] https://news.ycombinator.com/item?id=42650954#42686404

[1] https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...

[2] https://www.jeremykun.com/2013/11/08/adversarial-bandits-and...

3 days agoabetusk

Motivations can vary on a diurnal basis too. Or based on location. It means something different if I’m using homedepot.com at home or standing in an aisle at the store.

And with physical retailers with online catalogs, an online sale of one item may cannibalize an in-store purchase of not only that item but three other incidental purchases.

But at the end of the day your 60/40 example is just another way of saying: you don’t try to compare two fractions with a different denominator. It’s a rookie mistake.

3 days agohinkley

Good point about fluctuating rates for e.g the sales period. But couldn't you then pick a metric that doesn't fluctuate?

Out of curiosity, where did you work? In the same space as you.

4 days agoalex5207

I don't follow. In this case would sampling 50/50 always give better/unbiased results on the experiment?

4 days agoertdfgcvb

Sampling 50/50 will always give you the best chance of picking the best ultimate 'winner' in a fixed time horizon, at the cost of only sampling the winning variant 50% of the time. That's true if the reward rates are fixed or not. But some changes in reward rates will also cause MAB aggregate statistics to skew in a way that they shouldn't for a 50/50 split yeah.

4 days agosweezyjeezy

What do you think of using the epsilon-first approach then? We could explore for that fixed time horizon, then start choosing greedy after that. I feel like the only downside is that adding new arms becomes more complicated.

4 days agozeroCalories

What percent of companies using A/B testing do you think know what the Texas Sharpshooter is and how to identify it, let alone what epsilon is or what it means?

3 days agohinkley

Yes.

4 days agolern_too_spel

I agree that there's an exploration-exploitation tradeoff, but for what you specifically suggest wouldn't you presumably just normalize by sample size? You wouldn't allocate based off total conversions, but rather a percentage.

4 days agotjbai

Imagine a scenario where option B does 10x better than option A during the morning hours but -2x worse the rest of the day. If you start the multi armed bandit in the morning it could converge to option B quickly and dominate the rest of the day even though it performs worse then.

Or in the above scenario option B performs a lot better than option A but only with the sale going, otherwise option B performs worse.

4 days agojvans

One of the problems we caught only once or twice: mobile versus desktop shifting with time of day, and what works on mobile may work worse than on desktop.

We weren’t at the level of hacking our users, just looking at changes that affect response time and resource utilizations, and figuring out why a change actually seems to have made things worse instead of better. It’s easy for people to misread graphs. Especially if the graphs are using Lying with Statistics anti patterns.

3 days agohinkley

Yes but here's a exaggerated version - say were to sample for a week at 50/50 when the base conversion rate was at 4%, then we sample at 25/75 for a week with the base conversion rate bumped up to 8% due to a sale.

The average base rate for the first variant is 5.3%, the second is 6.4%. Generally the favoured variant's average will shift faster because we are sampling it more.

4 days agosweezyjeezy
[deleted]
4 days ago

Uhm, this still sounds like just bad math.

While it's non-obvious this is the effect, anyone analyzing the results should be aware of it and should only compare weighted averages, or per distinct time periods.

And therein is the largest problem with A/B testing: it's mostly done by people not understanding the math subtleties, thus they will misinterpret results in either direction.

4 days agonecovek

Agreed, and articles like this don't help. That's the only point I was trying to make really.

3 days agosweezyjeezy

The problem with this approach is that it requires the system doing randomization to be aware of the rewards. That doesn't make a lot of sense architecturally – the rewards you care about often relate to how the user engages with your product, and you would generally expect those to be collected via some offline analytics system that is disjoint from your online serving system.

Additionally, doing randomization on a per-request basis heavily limits the kinds of user behaviors you can observe. Often you want to consistently assign the same user to the same condition to observe long-term changes in user behavior.

This approach is pretty clever on paper but it's a poor fit for how experimentation works in practice and from a system design POV.

4 days agotaion

I don't know, all of these are pretty surmountable. We've done dynamic pricing with contextual multi-armed bandits, in which each context gets a single decision per time block and gross profit is summed up at the end of each block and used to reward the agent.

That being said, I agree that MABs are poor for experimentation (they produce biased estimates that depend on somewhat hard-to-quantify properties of your policy). But they're not for experimentation! They're for optimizing a target metric.

4 days agohruk

Surmountable, yes, but in practice it is often just too much hassle. If you are doing tons of these tests you can probably afford to invest in the infrastructure for this, but otherwise AB is just so much easier to deploy that it does not really matter to you that you will have a slightly ineffective algo out there for a few days. The interpretation of the results is also easier as you don't have to worry about time sensitivity of the collected data.

4 days agoempiko

You do know Amazon got sued and lost for showing different prices to different users? That kind of price discrimination is illegal in the US. Related to actual discrimination.

I think Uber gets away with it because it’s time and location based, not person based. Of course if someone starts pointing out that segregation by neighborhoods is still a thing, they might lose their shiny toys.

3 days agohinkley

Indeed, we are well aware.

3 days agohruk

You can do that, but now you have a runtime dependency on your analytics system, right? This can be reasonable for a one-off experimentation system but it's not likely you'll be able to do all of your experimentation this way.

3 days agotaion

No, you definitely have to pick your battles. Something that you want to continuously optimize over time makes a lot more sense than something where it's reasonable to test and the commit to a path forever.

3 days agohruk

Hey, I'd love to hear more about dynamic pricing with contextual multi-armed bandits. If you're willing to share your experience, you can find my email on my profile.

3 days agojacob019

You can assign multiarm bandit trials on a lazy per user basis.

So first time user touches feature A they are assigned to some trial arm T_A and then all subsequent interactions keep them in that trial arm until the trial finishes.

4 days agoivalm

The systems I’ve use pre-allocate users effectively randomly an arm by hashing their user id or equivalent.

4 days agokridsdale1

To make sure user id U doesn’t always end up in eg control group it’s useful to concatenate the id with experiment uuid.

4 days agoivalm

How do you handle different users having different numbers of trials when calculating the "click through rate" described in the article?

4 days agoryan-duve

careful when doing that though! i've seen some big eyes when people assumed IDs to be uniform randomly distributed and suddenly their "test group" was 15% instead of the intended 1%. better generate a truely random value using your languages favorite crypto functions and be able to work with it without fear of busting production

4 days agos1mplicissimus

The user ID is non uniform after hash and mod? How?

4 days agonp_tedious

additional to the other excellent comments they will become non-uniform once you start deleting records. that will break all hopes you might have had in modulo and percentages being reliable partitions because the "holes" in your ID space could be maximally bad for whatever usecase you thought up.

a day agos1mplicissimus

If you mod by anything other than a power of two, it won't be. https://lemire.me/blog/2019/06/06/nearly-divisionless-random...

4 days agolern_too_spel

That article is mostly about speed. The following seems like the one thing that might be relevant:

> Naively, you could take the random integer and compute the remainder of the division by the size of the interval. It works because the remainder of the division by D is always smaller than D. Yet it introduces a statistical bias

That's all it says. Is the point here just that 2^31 % 17 is not zero, so 1,2,3 are potentially happening slightly more than 15,16? If so, this is not terribly important

3 days agonp_tedious

> If so, this is not terribly important

It is not uniformly random, which is the whole point.

> That article is mostly about speed

The article is about how to actually achieve uniform random at high speed. Just doing mod is faster but does not satisfy the uniform random requirement.

2 days agolern_too_spel

Just make sure you do the hash right so you don’t end up with cursed user IDs like EverQuest.

3 days agohinkley

As one of the comments below the article states, the probabilistic alternative to epsilon-greedy is worth exploring ad well. Take the "bayesian bandit", which is not much more complex but a lot more powerful.

If you crave more bandits: https://jamesrledoux.com/algorithms/bandit-algorithms-epsilo...

4 days agoisoprophlex

Just a warning to those people who are potentially implementing it: it doesn't really matter. The blog author addresses this, obliquely (says that the simplest thing is best most of the time), but doesn't make it explicit.

In my experience, obsessing on the best decision strategy is the biggest honeypot for engineers implementing MAB. Epsilon-greedy is very easy to implement and you probably don't need anything more. Thompson sampling is a pain in the butt, for not much gain.

4 days agotimr

"Easy to implement" is a good reason to use bubble sort too.

In a normal universe, you just import a different library, so both are the same amount of work to implement.

Multiarmed bandit seems theoretically pretty, but it's rarely worth it. The complexity isn't the numerical algorithm but state management.

* Most AB tests can be as simple as a client-side random() and a log file.

* Multiarmed bandit means you need an immediate feedback loop, which involves things like adding database columns, worrying about performance (since each render requires another database read), etc. Keep in mind the database needs to now store AB test outcomes and use those for decision-making, and computing those is sometimes nontrivial (if it's anything beyond a click-through).

* Long-term outcomes matter more than short-term. "Did we retain a customer" is more important than "did we close one sale."

In most systems, the benefits aren't worth the complexity. Multiple AB tests also add testing complexity. You want to test three layouts? And three user flows? Now, you have nine cases which need to be tested. Add two color schemes? 18 cases. Add 3 font options? 54 cases. The exponential growth in testing is not fun. Fire-and-forget seems great, but in practice, it's fire-and-maintain-exponential complexity.

And those conversion differences are usually small enough that being on the wrong side of a single AB test isn't expensive.

Run the test. Analyze the data. Pick the outcome. Kill the other code path. Perhaps re-analyze the data a year later with different, longer-term metrics. Repeat. That's the right level of complexity most of the time.

If you step up to multiarm, importing a different library ain't bad.

4 days agoblagie

Multi-armed bandit approaches do not imply an immediate feedback loop. They do the best you can do with delayed feedback or with episodic adjustment as well.

So if you are doing A/B tests, it is quite reasonable to use Thompson sampling at fixed intervals to adjust the proportions. If your response variable is not time invariant, this is actually best practice.

3 days agoted_dunning

Having significant experience with bandits in production, I strongly recommend only using them for immediate feedback. If the rewards are at all disconnected from the action you likely won’t be happy with the results.

12 hours agoorasis

Sorry but bubble sort is a terrible example here. You implement a more difficult sorting algorithm, like quicksort, because the benefits of doing so, versus using bubble sort, are in many cases huge. I.e., the juice is worth the squeeze.

Whereas the comment you’re responding to is rightly pointing out that for most orgs, the marginal gains of using an approach more complex than Epsilon greedy probably aren’t worth it. I.e., the juice isn’t worth the squeeze.

4 days agobartread

You can use FFT, if you prefer. There's no reason to not use optimized numerical code, since it's just a different import.

The difference in performance is smaller and the difference in complexity is much greater. Optimized FFTs are... hairy. But now that someone wrote them, free.

4 days agoblagie

Bubble sort is a great example because it can out perform quicksort if the input is small.

4 days agozeroCalories

You've either missed the point of what I wrote, or you're arguing with someone else.

I'm talking about the difference between epsilon-greedy vs. a more complex optimization scheme within the context of implementing MAB. You're making arguments about A/B testing vs MAB.

3 days agotimr

Thompson Sampling is trivial to implement, especially with binary rewards. ChatGPT can do it reliably from scratch.

12 hours agoorasis

There's a good derivation of EXP3 algorithm from standard multiplicative weights which is fairly intuitive. The transformation between the two is explained a bit in https://nerva.cs.uni-bonn.de/lib/exe/fetch.php/teaching/ws18.... Once you have the intuition, then the actual choice of parameters is just cranking out the math

4 days agokrackers

A lot of sites don't have enough traffic to get statistical significance with this in a reasonable amount of time and it's almost always testing a feature more complicated than button color where you aren't going to have more than the control and variant.

4 days agotracerbulletx

I’ve only implemented A/B/C tests at Facebook and Google, with hundreds of millions of DAU on the surfaces in question, and three groups is still often enough to dilute the measurement in question below stat-sig.

4 days agokridsdale1

> A lot of sites don't have enough traffic to get statistical significance with this in a reasonable amount of time

What's nice about AB testing is the decision can be made on point estimates, provided the two choices don't have different operational "costs". You don't need to know that A is better than B, you just need to pick one and the point estimate gives the best answer with the available data.

I don't know of a way to determine whether A is better than B with statistical significance without letting the experiment run, in practice, for way too long.

4 days agoryan-duve

If the effect size x site traffic is so small it's statistically insignificant, why are you doing all this work in the first place? Just choose the option that makes the PHB happy and move on.

(But, it's more likely that you don't know if there's a significant effect size)

4 days agowiml

The PHB wanted A/B testing! True story. I've spent two months convincing them that it made no sense with the volume of conversion events we had.

4 days agokoliber

Another option, "I'm already doing A/B testing, trust me."

4 days agowussboy
[deleted]
4 days ago

Yes wondering what the confidence intervals are.

4 days agodouglee650

No, multi-armed bandit doesn't "beat" A/B testing, nor does it beat it "every time".

Statistical significance is statistical significance, end of story. If you want to show that option B is better than A, then you need to test B enough times.

It doesn't matter if you test it half the time (in the simplest A/B) or 10% of the time (as suggested in the article). If you do it 10% of the time, it's just going to take you five times longer.

And A/B testing can handle multiple options just fine, contrary to the post. The name "A/B" suggests two, but you're free to use more, and this is extremely common. It's still called "A/B testing".

Generally speaking, you want to find the best option and then remove the other ones because they're suboptimal and code cruft. The author suggests always keeping 10% exploring other options. But if you already know they're worse, that's just making your product worse for those 10% of users.

4 days agocrazygringo

Multi-arm bandit does beat A/B testing in the sense that standard A/B testing does not seek to maximize reward during the testing period, MAB does. MAB also generalizes better to testing many things than A/B testing.

4 days agoLPisGood

This is a double-edged sword. There are often cases in real-world systems where the "reward" the MAB maximizes is biased by eligibility issues, system caching, bugs, etc. If this happens, your MAB has the potential to converge on the worst possible experience for your users, something a static treatment allocation won't do.

4 days agocle

I haven’t seen these particular shortcomings before, but I certainly agree that if your data is bad, this ML approach will also be bad.

Can you share some more details about your experiences with those particular types of failures?

4 days agoLPisGood

Sure! A really simple (and common) example would be a setup w/ treatment A and treatment B, your code does "if session_assignment == A .... else .... B" . In the else branch you do something that for whatever reason causes misbehavior (perhaps it sometimes crashes or throws an exception or uses a buffer that drops records under high load to protect availability). That's suprisingly common. Or perhaps you were hashing on the wrong key to generate session assignments--ex you accidentally used an ID that expires after 24 hours of inactivity...now only highly active people get correctly sampled.

Another common one I saw was due to different systems handling different treatments, and there being caching discrepancies between the two, like esp in a MAB where allocations are constantly changing, if one system has a much longer TTL than the other then you might see allocation lags for one treatment and not the other, biasing the data. Or perhaps one system deploys much more frequently and the load balancer draining doesn't wait for records to finish uploading before it kills the process.

The most subtle ones were eligibility biases, where one treatment might cause users to drop out of an experiment entirely. Like if you have a signup form and you want to measure long-term retention, and one treatment causes some cohorts to not complete the signup entirely.

There are definitely mitigations for these issues, like you can monitor the expected vs. actual allocations and alert if they go out-of-whack. That has its own set of problems and statistics though.

4 days agocle

No -- you can't have your cake and eat it too.

You get zero benefits from MAB over A/B if you simply end your A/B test once you've achieved statistical significance and pick the best option. Which is what any efficient A/B test does -- there no reason to have any fixed "testing period" beyond what is needed to achieve statistical significance.

While, to the contrary, the MAB described in the article does not maximize reward -- as I explained in my previous comment. Because the post's version runs indefinitely, it has worse long-term reward because it continues to test inferior options long after they've been proven worse. If you leave it running, you're harming yourself.

And I have no idea what you mean by MAB "generalizing" more. But it doesn't matter if it's worse to begin with.

(Also, it's a huge red flag that the post doesn't even mention statistical significance.)

4 days agocrazygringo

> you can't have your cake and eat it too

I disagree. There is a vast array of literature on solving the MAB problem that may as well be grouped into a bin called “how to optimally strike a balance between having one’s cake and eating it too.”

The optimization techniques to solve MAB problem seek to optimize reward by giving the right balance of exploration and exploitation. In other words, these techniques attempt to determine the optimal way to strike a balance between exploring if another option is better and exploiting the option currently predicted to be best.

There is a strong reason this literature doesn’t start and end with: “just do A/B testing, there is no better approach”

4 days agoLPisGood

I'm not talking about the literature -- I'm talking about the extremely simplistic and sub-optimal procedure described in the post.

If you want to get sophisticated, MAB properly done is essentially just A/B testing with optimal strategies for deciding when to end individual A/B tests, or balancing tests optimally for a limited number of trials. But again, it doesn't "beat" A/B testing -- it is A/B testing in that sense.

And that's what I mean. You can't magically increase your reward while simultaneously getting statistically significant results. Either your results are significant to a desired level or not, and there's no getting around the number of samples you need to achieve that.

4 days agocrazygringo

I am talking about the literature which solves MAB in a variety of ways, including the one in the post.

> MAB properly done is essentially just A/B testing

Words are only useful insofar as their meanings invoke ideas, and in my experience absolutely no one thinks of other MAB strategies when someone talks about A/B testing.

Sure, you can classify A/B testing as one extremely suboptimal approach to solving MAB problem. This classification doesn’t help much though, because the other MAB techniques do “magically increase the rewards” compared this simple technique.

4 days agoLPisGood

> Sure, you can classify A/B testing as one extremely suboptimal approach to solving MAB problem. This classification doesn’t help much though, because the other MAB techniques do “magically increase the rewards” compared this simple technique.

You are quite simply wrong. There is nothing suboptimal about an A/B test between two choices performed until desired statistical significance. There is nothing you can do to magically increase anything.

If you think there is, you'll have to describe something specific. Because nowhere in the academic MAB literature does anyone attempt to state the contrary. And which, again, is why this blog post is so flawed.

2 days agocrazygringo

Another way of seeing the situation: let run your MAB solution for a while. Orange has been tested 17 times and blue has been tested 12 times. This is exactly equivalent of doing a A/B testing where you display 1 time the orange button to 17 persons and 1 time the blue button to 12 persons.

The trick is to find the exact best number of test for each color so that we have good statistical significance. MAB does not do that well, as you cannot easily force testing an option that was bad when this option did not get enough trial to have a good statistical significance (imagine you have 10 colors and the color orange first score 0/1. It will take a very long while before this color will be re-tested quite significantly: you need to first fall into the 10%, but then you still have ~10% to randomly pick this color and not one of the other). With A/B testing, you can do a power analysis before hand (or whenever during) to know when to stop.

Literature does not start with "just do A/B testing" because it is not the same problem. In MAB, your goal is not to demonstrate that one is bad, it's to do your own decision when faced with a fixed situation.

4 days agocauch

> The trick is to find the exact best number of test for each color so that we have good statistical significance

Yes, A/B testing will force through enough trials to get statistical significance(it is definitely a “exploration first strategy), but in many cases, you care about maximizing reward as well, in particular during testing. A/B testing does very poorly at balancing exploitation with exploitation in general.

This is especially true if the situation is dynamic. Will you A/B test forever in case something has changed and give up that long term loss in reward value?

4 days agoLPisGood

But the proposed MAB system does not even propose a method to know when this system needs to be stopped (and remove all the choices except the best one).

With the A/B testing, you can do power analysis whenever you want, including in the middle of the experiment. It will just be an iterative adjustment that converges.

In fact, you can even run on all possibilities in advance (if A get 1% and B get 1%, how many A and B do I need, if A get 2% and B get 1%, if A get 3% and B get 1%, ...) and it will give you the exact boundaries to stop for any configurations before even running the experiment. You will just have to stop trialing option A as soon as option A crosses the already decided significance threshold for A.

So, no, the A/B testing will never run forever. And A/B testing will always be better than the MAB solution, because you will have a better way to stop trying a bad solution as soon as you have crossed the threshold you decided is enough to consider it's a bad solution.

4 days agocauch

Isn't that the point of testing (to not maximize reward but rather wait and collect data)? It sounds like maximizing reward during the experiment period can bias the results

4 days agoertdfgcvb

The great thing is that you can do both.

4 days agoLPisGood
[deleted]
4 days ago

Multi-armed bandits make a big assumption that effectiveness is static over time. What can happen is that if they tip traffic slightly towards option B at a time when effectiveness is higher (maybe a sale just started) B will start to overwhelmingly look like a winner and get locked in that state.

You can solve this with propensity scores, but it is more complicated to implement and you need to log every interaction.

4 days agojbentley1

This objection is mentioned specifically in the post.

You can add a forgetting factor for older results.

4 days agoLPisGood

This seems like a fudge factor though. Some things are changed bc you act on them! (e.g. recommendation systems that are biased towards more popular content). So having dynamic groups makes the data harder to analyze

4 days agorandomcatuser

A standard formulation of MAB problem assumes that acting will impact the rewards, and this forgetting factor approach is one which allows for that and still attempts to find the currently most exploitable lever.

4 days agoLPisGood

That's a different problem. In jbentley1's scenario, A could be better, but this algorithm will choose B.

4 days agolern_too_spel

I don't like how this dismisses the old approach as "statistics are hard for most people to understand." This algo beats A/B testing in terms of maximizing how many visitors get the best feature. But is that really a big enough concern IRL that people are interested in optimizing it every time? Every little dynamic lever adds complexity to a system.

4 days agoiforgot22

Indeed perhaps we should applaud people for choosing statistical tools that are relatively easy to use and interpret, rather than deride them for not stepping up to the lathe that they didn't really need and we admit has lots of sharp edges.

4 days agorecursivecaveat

I think you missed the point. It's not about which visitors get the best feature. It's about how to get people to PUSH THE BUTTON!!!!! Which is kind of the opposite of the best feature. The goal is to make people do something they don't want to do.

Figuring out best features is a completely different problem.

4 days agorerdavies

I didn't say it was the best for the user. Really the article misses this by comparing a new UI feature to a life-saving drug, but it doesn't matter. The point is, whatever metric you're targeting, do you use this algo or fixed group sizes?

4 days agoiforgot22

Yeah basically. The idea is that somehow this is the data-optimal way of determining which one is the best (rather than splitting your data 50/50 and wasting a lot of samples when you already know)

The caveats (perhaps not mentioned in the article) are: - Perhaps you have many metrics you need to track/analyze (CTR, conversion, rates on different metrics), so you can't strictly do bandit! - As someone mentioned below, sometimes the situation is dynamic (so having evenly sized groups helps with capturing this effect) - Maybe some other ones I can't think of?

But you can imagine this kind of auto-testing being useful... imagine AI continually pushes new variants, and it just continually learns which one is the best

4 days agorandomcatuser

It still misses the biggest challenge though--defining "best", and ensuring you're actually measuring it and not something else.

It's useful as long as your definition is good enough and your measurements and randomizations aren't biased. Are you monitoring this over time to ensure that it continues to hold? If you don't, you risk your MAB converging on something very different from what you would consider "the best".

When it converges on the right thing, it's better. When it converges on the wrong thing, it's worse. Which will it do? What's the magnitude of the upside vs downside?

4 days agocle

Are you saying that it may do something like improve click-the-button conversion but lead to less sales overall?

4 days agodesert_rue

Facebook or YouTube might already be using an algo like this or AI to push variants, but for each billion user product, there are probably thousands of smaller products that don't need something this automated.

4 days agoiforgot22

        # for each lever, 
            # calculate the expectation of reward. 
            # This is the number of trials of the lever divided by the total reward 
            # given by that lever.
        # choose the lever with the greatest expectation of reward.
If I'm not mistaken, this pseudocode has a bug that will result in choosing the expected worst option rather than the expected best option. I believe it should read "total reward given by the lever divided by the number of trials of that lever".
4 days agoMichaelDickens

Correct, that's why I don't trust reading code comments

4 days agoreader9274

Nothing shows on this page without JavaScript except for the header and a grey background. A bit strange for a blog.

4 days agozahlman

Pre-CSS grid masonry layout. Author hides the content with CSS, and JS reveals it, to avoid a flash.

CSS to make it noscript friendly: `.main { visibility: visible !important; max-width: 710px; }`

4 days agolelandfe

Utterly unremarkable for 2012 though.

4 days agoforgetfreeman

I've been using a Bernoulli bandit for many years now. Like the original author, I am not quite sure why more people don't use it — for all practical purposes, if you have stuff to do, it is superior to simple A/B testing in every way. The "set it and forget it" feature is really nice.

Another thing that I noticed while writing the code (and took advantage of) is that it is insanely scalable in a distributed system, using HyperLogLog and Bloom filters. I may or may not have totally over-engineered my code to take advantage of that, even though my site gets ridiculously low click numbers :-)

4 days agojwr

lol. My first MAB implementation also used HyperLogLog for tracking unique conversions. We over engineer alike it seems.

12 hours agoorasis

So he compares users to slot machines. OK, I see why the internet sucks now 10 years later.

4 days agoemsign

After this:

> hundreds of the brightest minds of modern civilization have been hard at work not curing cancer. Instead, they have been refining techniques for getting you and me to click on banner ads

I was really hoping this would slowly develop into a statistical technique couched in terms of ad optimization but actually settling in on something you might call ATCG testing (e.g. the biostatistics methods that one would indeed use to cure cancer).

4 days ago__MatrixMan__

What an odd, self-contradictory post. "In recent years, hundreds of the brightest minds of modern civilization have been hard at work not curing cancer.", along with phrases like "defective by design" and implied loss by not giving all people in a drug trial the new medicine, imply he thinks all of a thing needs allocated to what he perceives (incorrectly, trivially, shown below) as the best use. Then he states in the multiarm bandit to waste (from his multiple other statements) 10% of what is the best use on random other uses for exploration.

However all this fails. For optimal output (be it drug research, allocation of brains, how to run a life), putting all resources on the problem/thing that is the "most important" is sub-optimal use of resources. It's always better expected return to allocate resources to where that spent resource has the best return. If that place is apps, not cancer, then wishing for brains to work on cancer because some would view that as a more important problem may simply be a waste of brains.

So if cancer is going to be incredibly hard to solve, and mankind empirically gets utility from better apps, then a better use is to put those brains on apps - then they're not wasted on an probably not solvable problem and are put to use making things that do increase value.

He also ignores that in real life the cost to have a zillion running experiments constantly flipping alternatives does not scale, so in no way can a company at scale replace A/B with multiarm bandits. One reason is simple: at any time a large company is running 1000s to maybe 100k A/B tests, each running maybe 6 months, at which point code path is selected, dead paths removed, and this repeats continually.If that old code is not killed, and every feature from all time needs to be on/off randomly, then there is no way over time to move much of the app forwards. It's not effective or feasibly to build many new features if you must also allow interacting with those from 5-10 years ago.

A simple google shows tons more reasons, form math to practical, that this post is bad advice.

3 days agoSideQuark

Multi arm bandits are fine but their limited to tests where its ok to switch users between arms frequently and tests that have more power

4 days agoasdasdsddd

> where its ok to switch users between arms frequently

It's not hard to keep track of which arm any given user was exposed to in the first run, and then repeat it.

4 days agotantalor

There are often product limitations

4 days agoasdasdsddd

Check out improve.ai if you want to see this taken to the next level. We combined Thompson Sampling with XGBoost to build a multi-armed bandit that learns to choose the best arm across context. MIT license.

4 days agoorasis

Interesting post - certainly I can see myself wanting to tinker with Epsilon greedy - but the comments at the bottom are pretty off the chain, and many not in a good way. No auth commenting is certainly a brave decision in 2024.

4 days agobartread

I wish I could make this work without the mess though. Maybe an AI moderator could throw some of these away. Not that we need AI for everything, but I don’t have time to edit comments on my blog.

4 days agocodazoda

"People distrust things that they do not understand, and they especially distrust machine learning algorithms, even if they are simple."

How times have changed :)

4 days agonottorp

Just had to anthropomorphize machine learning.

4 days agosaintfire

More like, had to make it really good

4 days agoiforgot22

If your aim is to evaluate an effect size of your treatment because you want to know whether it’s significant, you can’t do what the article advises.

4 days agousgroup

I really like multi armed bandit approach, but struggles with common scenarios involving delayed rewards or multiple success criteria, such as testing ecommerce search with number of orders and GMV guardrails.

For simple, immediate-feedback cases like button clicks, the specific implementation becomes less critical.

4 days agosarpdag

It’s best for immediate rewards. If you have delayed rewards there is a paper on sampling from the “delay distribution” that solves this.

12 hours agoorasis

If you only keep your entire site static while test one variable change at a time, it could be statistically significant, other wise if your flow changes some where while you do this algo, it may be misleading you into a color and then under perform because you've made a change else where before users get to this page.

4 days agom3kw9

The "20" is missing from the title.

4 days agoHeliumHydride

I believe hacker news automatically truncates the number at the beginning of titles

4 days agojerrygenser

This is fine as long as your users don't mind your site randomly changing all the time.

4 days agoIshKebab

That’s also a problem for AB testing and solvable (to a degree) by caching assignments

4 days agodata-ottawa

This can be addressed with some variant of

    random.seed(hash(user_id))

I think the bigger problem is handling the fact that not all users click through the same number of times.
4 days agoryan-duve

Are there any component libraries that just do this for you? Like an angular or Blazor component you put your variations in and it handles storing results and handling the logic.

4 days agoshireboy

A/B testing is a learning tool, not an optimization tool.

4 days agoquaxi

How do you ensure the same user always gets the same treatment, even on subsequent visits to the site? You need the bucket sizes to be consistent for consistent hashing.

4 days agormacqueen

I'm a layman. Isn't MAB changing the experiment parameters while the experiment is still running? That sounds like an easy way towards biased experiment results

4 days agoertdfgcvb

From a purely technical definition of bias (difference in expected value of the estimator and the true value), MAB is not biased because "changing the experiment parameters" is just dynamically allocating a different sample size to each of the estimators, so the estimator still converges to the correct value.

You are correct that this setup can potentially mislead you, but this is because you might end up getting estimators with high variance. So, you might mistakenly see some early promising results for experiment group A and greedily assign all the requests to that group, even though it is not guaranteed that A is actually better than B.

This is the famous exploration-exploitation dilemma—should you maximize conversions by diverting everyone to group A or still try to collect more data from group B?

4 days agotjbai

That's a statistically valid approach. Technically correct, the best kind of correct.

Meanwhile, if your users get presented a different button whenever they come by, because the MAB is still pursuing its hill climbing, they'll rightfully accuse you of having extremely crappy UX. (And, sure, you can have MAB with user stickiness, but now you do need to talk about sampling bias)

And MAB hill climb doesn't work at all if you want to measure the long-term reward of a variation. You have no idea if the orange button has long-term retention impact. There are sure situations where you'd like to know.

Yes, it's a neat technique to have in your repertoire, but like any given technique, it's not the answer "every time".

4 days agogroby_b

A/B testing has the same problem unless you figure out how to treat the same user the same way each time. But yeah I generally don't give much credence to an assertion like this that isn't based on a real-world experience. It's not even like "this algo might help," it's "will beat A/B testing every time."

4 days agohot_gril

Why would I introduce randomness where I don't have to? I'd regret this as soon as I had to debug something related to it.

4 days agospmartin823

Regular A/B testing has the same kind of randomness. But this is more complex for other reasons.

4 days agohot_gril

> Statistics are hard for most people to understand.

True, but that's exactly what statistics helps with, though also hard to understand. :)

4 days agokazinator

People are hard for statistics to understand.

4 days agohot_gril

I've only read the first paragraph so bear with me but I'm not understanding the reasoning behind "A/B testing drugs is bad because only half of the sample can potentially benefit" when the whole point is to delineate the gots and got-nots ...

4 days agofitsumbelay

If the drug is effective and safe, then one half of the patients lost out on the benefit. You are intentionally "sacrificing" the control arm.

(Of course, the whole point is that the benefit and safety are not certain, so I think the term "sacrifice" used in the article is misleading.)

4 days agoatombender

And the control group is also sacrificed from potentially deadly side effects.

4 days agokridsdale1

My understanding is they usually do small trials early where they figure out if there are deadly side effects, and then do larger effectiveness trials once there’s determined to be minimal danger. So being in the control group is probably a negative thing on average.