The cost of not measuring

There’s something I believe that I can’t always defend in the moment: that you should measure before you act. So I went looking for the formal argument.

Probability theory has a precise model for what happens when you act without measuring: the multi-armed bandit problem.

Imagine you enter a casino. There are k slot machines, each paying out at a different rate. You don’t know which pays more, and you only have n rounds to play. You could pick the first machine that looks good and pull it forever. Or you could spend some rounds testing before you commit.

There are a few algorithms for this.

The simplest one is called Explore-Then-Commit. First you try each machine m times. Then you compute the average payout. Then you stick with the winner for all remaining rounds.

But you might lock into the wrong machine if m is too small (you committed with bad data). Or, if m is too large, you burn rounds gathering information you didn’t need.

Now consider the degenerate case: m = 0. No exploration at all. You walk into the room, you like the blue color, you see the blue machine, and pull it forever.

Your regret - the formal term for the gap between what you got and what you could have gotten - grows linearly with time. Every round, you might be pulling the wrong lever. And you will never discover this. You have no mechanism to find out you are wrong.

But ETC has an obvious flaw: once you commit, you stop learning. If the environment changes after your exploration phase, you are blind. ETC is a good proof that exploration beats ignorance, but it’s not the optimal algorithm.

There is another algorithm called Upper Confidence Bound, which does not have phases. Imagine you pick the first arm 200 times, and the average payout is $10. Then you pick the third arm 2 times, and the average payout is $5. You know the first machine better, because you tested it 200 times. You can be confident that the next payout will be around $10. The third machine is less known. You can’t be sure about its next output because you tested it only 2 times.

UCB adds the uncertainty to the computation: your next pick is the arm with the estimated average reward plus a bonus proportional to uncertainty.

Arms you haven’t tried much have high uncertainty, so they get a large bonus. Arms you have tried many times have tight estimates, so the bonus shrinks and they only get selected if they are actually good. The algorithm balances exploration and exploitation on every round, without you having to decide when to stop exploring. You never stop exploring. You just explore less as you become more confident.

UCB has logarithmic regret - O(log n). The cumulative cost of being wrong grows, but slower and slower over time. ETC cannot match this in general.

In any case, ETC with m = 0 (the instinct-only strategy) is fundamentally inferior to both ETC (m > 0) and UCB.

Try it.

strategy: m = 0 ETC UCB
 
round 0/100  ·  regret 0.0
true payout rates are hidden.

m = 0 will not always perform worse. If instinct happens to pick the right machine, it can match or even beat ETC and UCB over 100 rounds. Run it a few times and you will see this.

The problem is not that instinct is always wrong, but if you are acting purely on instinct, you have no tools to know if you are right or wrong. There’s no signal. The regret accumulates invisibly. Measurement, on the other hand, doesn’t guarantee you’re right, but it at least guarantees you can be shown wrong.

When there is no shared evidence, what fills the void? Authority, seniority, confidence. The loudest voice wins. A team running at m = 0 accumulates regret linearly and has no common language to discuss why.

A team that measures can still be wrong. But at least they are wrong about something real.