KNOWLEDGE

We play twice, tails comes up twice and you owe me $20. You probably will chalk this up to bad luck; after all there’s a 25% chance a fair coin will produce this result. So you decide to play 8 more times and get 8 more tails. That’s 10 tails out of 10 flips, you have now owe me $100 and I’m grinning ear-to-ear… are you suspicious yet? You should be, the chance of this happening with a fair coin is less than 1 in a thousand (<0.1%).

Somewhere between 2 and 10 coin flips is a point where you should call bullshit. I recommend picking a high threshold so you don’t use foul words due to everyday bad luck. But you don’t want it to be too high because you’re not a sucker. I suggest you call me out if the outcome has a less than 1 in 20 chance of occurring (<5%). This means if you get 4 tails out of 4 (a 6% chance), you chalk it up to bad luck. If you get 5 tails out of 5 (a 3% chance), you decide you were cheated and call bullshit.

Congrats, you now understand Frequentist hypothesis testing! You assumed the coin was fair (the null hypothesis), and only when we ended up with a result below a reasonable threshold did we call bullshit (5 tails out of 5 flips, <5%). We rejected the null hypothesis, meaning we accept the alternate hypothesis that the coin was biased.

Congrats! You’ve just learned hypothesis testing for $50.

This is the most common misconception around p-values, confidence intervals, and hypothesis testing. Hypothesis testing does not tell us the probability we made the right decision; we simply don’t know. To know this requires information like: did the coin come from yours or my pocket? Was I just inside a magic shop? Do I have a large stack of money I’ve won from other people? While these answers should affect your estimate of the chances the coin is unfair, it’s really hard to objectively quantify it. Instead, hypothesis testing ONLY tells us that the result is odd when we assume the coin is fair.

This is directly applicable to AB testing… we don’t know the probability that a test will work and guessing only introduces bias. Instead we assume there will be no effect, and only if we see an unlikely result will we make a big deal of it. The cool thing about hypothesis testing is it’s unbiased, and doesn’t require us to estimate the chance of success (which can be a highly subjective process).

We have the confusing definition of p-values and significance to blame for this. A p-value of 0.05 means that the result (and anything as extreme) has a 5% chance of occurring under the null hypothesis. In our example, we’re stating that the outcome would has a <5% chance of occurring IF the coin is fair. This is also called the false positive rate, and it is something we do know and can control, but it’s not the same as knowing the chance we’re wrong.

We know that the outcome is unlikely if the coin was fair, so we concluded it must not be fair. But we don’t know how the coin truly behaves: Does it have two tails? Or is it only 60% biased? We were only able to reject the null hypothesis and conclude that the coin isn’t fair. It’s somewhat standard practice to accept the observed result (5 times out of 5 = 100%), with some margin of error as our best guess of the coin’s behavior (after rejecting the null hypothesis). But the truth is that many different degrees of biased coins could have easily produced this result.

This misconception largely originates from AB testing leaders like Microsoft, Google, and Facebook who talk a lot about experimentation on hundreds of millions of users. Larger samples also do tend to give better tests. But statistical power is more than just sample size, it also depends on effect size. Small companies almost always see big effect sizes giving them MORE statistical power than large companies (See You Don’t Need Large Sample Sizes to Run A/B Tests). Many scientific studies are based on small sample sizes (< 20). The coinflip example required only 5 flips. The whole point of statistics is to identify which results are plausibly due to signal or noise; a small sample size has already been accounted for.

Some readers will call me out on the peeking problem which I ignored for simplicity. In a nutshell, the more times you peek at or reevaluate your results should affect your statistics. A correct way is to pick a fixed number of flips to make a decision before you start (this is called a fixed horizon test).

The smart folks at the Netflix experiment team wrote a more thorough and statistically rigorous explainer using coin flips on their blog post: Interpreting A/B test results: false positives and statistical significance). Be sure to check this out.

*Thanks to *ZSun Fu* on *Unsplash* for the photo!*

STATSIG

Thanks to our support team, our customers can feel like Statsig is a part of their org and not just a software vendor. We want our customers to know that we're here for them.

STATSIG

Migrating experimentation platforms is a chance to cleanse tech debt, streamline workflows, define ownership, promote democratization of testing, educate teams, and more.

STATSIG

Calculating the right sample size means balancing the level of precision desired, the anticipated effect size, the statistical power of the experiment, and more.

STATSIG

The term 'recency bias' has been all over the statistics and data analysis world, stealthily skewing our interpretation of patterns and trends.

STATSIG

A lot has changed in the past year. New hires, new products, and a new office (or two!) GB Lee tells the tale alongside pictures and illustrations:

STATSIG

A deep dive into CUPED: Why it was invented, how it works, and how to use CUPED to run experiments faster and with less bias.

Explore Statsig’s smart feature gates with built-in A/B tests, or create an account instantly and start optimizing your web and mobile applications. You can also schedule a live demo or chat with us to design a custom package for your business.