Kom bij ons werken. We zoeken een Data-analist! Bekijk de vacature

Power calculations explained: what is power and why is it important?

Naomi Smulders

Naomi Smulders

29-11-2022 - 7 minutes reading time

In the CRO world, you hear a lot about power calculations. As a data scientist at Online Dialogue, I often get asked by our clients, ‘what power is it about?’ And, ‘when do you do this power calculation?’ So I thought I'd write an article about it. In this blog, I'll explain to you what power is, how you calculate it and what you use it for. 

TL;DR

  • A-priori power (MDE) ensures that our experiments are sensitive enough to find an effect when it is actually present.
  • A posteriori power calculations are useless because of the 1:1 relationship between power and p-value. So it is much more informative to look at the exact p-value and confidence interval.
  • A-priori power is also necessary when we apply Bayesian statistics to ensure that we collect sufficient evidence.

What is power?

Statistical power is a definition of the probability that a test or experiment will find an effect if it is actually present. In experiments, we make the assumption that our random sample is representative of reality. But it may be the case that the result you find in this sample is not the real result; thus, you make a measurement error. To minimize the risk of such measurement error, we apply statistics. Power calculations are therein methods to limit the measurement error concerning Type II errors (False Negatives).

Power calculations are about the sensitivity of your experiment to detect an actual effect. The higher this sensitivity, the more likely the test is going to find an effect in your sample when it is present in reality. With a high sensitivity, we reduce the chance that we will miss a true effect in our test result (False Negative). Thus, we guarantee that we can capitalize on all possible improvements.

The importance of a priori power analysis (or MDE)
We determine the power level, or sensitivity of an experiment, before conducting the experiment. In science, we do a power calculation to determine the sample size needed to find an effect of interest. In online experimentation, we reverse this formula, as we already know the sample size (namely, the number of visitors who visit the test page) and use this to determine the size of the effect (such as uplift) that we can detect. Thus, we calculate the Minimal Detectable Effect (MDE) of different weeks duration and choose the best-fit duration from these.
Many companies employ a power level of 80%, this is considered in the industry to be the best trade-off between the probability of finding winners in the shortest possible run time. Using as high a power level as possible ensures that we do not miss any effects.

Minimal Effect of Interest (MEI) as an alternative to MDE

In his book and recent blog, Georgi Georgev describes that the MDE is actually an oversimplification of the thorough statistical skill needed to calculate a good sample size or duration for an experiment. He argues that we should not only look at the effect to be detected, but that it should include a cost-benefit analysis that determines what the minimum effect is so that implementation becomes useful and interesting. So here we look not only at the minimum detectable effect (MDE) but at the minimum interesting effect (MEI). Cost-benefit analyses are usually done after the fact, for example, if we see that a winner will generate approximately 50,000 euros in additional revenue, we only start to look critically at the costs of building, implementing and maintaining the tested adaptation. So within the MEI framework, this is determined in advance. Read the entire article here.

The nonsense of observed power

So Power you calculate before you run the test and is based on the average number of visitors who have seen the test page in the past period. But what if the number of visitors in your experiment ends up being less than you expected in your calculation? In that case, can you trust this underpowered test?

The short answer is ‘Yes, when you look closely at the p-value.’

There are several ways to calculate power after the test has been run (post- or observed power). The formulas for this are almost identical to the MDE, only now you enter the actual sample size you have in your test. However, these calculations are not really necessary. This is because the p-value of a test has a 1:1 relationship with the power contained in the test (Hoenig & Heisey, 2001). In other words, when the power in your test is lower than calculated beforehand (because the numbers are disappointing), the p-value of your test will be higher. By critically examining your p-value, you already overcome the additional uncertainty of the effect found. So instead of calculating post-power, it is more useful to look at the p-value of the test and the corresponding confidence interval.

The importance of power in Bayesian Statistics

Everything above is described from a frequentist perspective, in which we thus use our statistical test to try to figure out whether or not we can reject the null hypothesis. With the p-value, we calculate the reliability of our sample. With bayesian statistics, we do not calculate whether or not we can reject the null hypothesis but rather how big the ‘belief’ (probability) is that our stated hypothesis is correct. The bayesian probability calculation thus indicates how much evidence has been found for the effect under study. The lower the burden of evidence, the lower the bayesian probability that the effect is actually there.

Is power then also required for bayesian statistics?


It is sometimes argued that power analyses are not necessary in bayesian tests, since low power gives low evidence and thus low bayesian probability. In online experiment practice, however, this is different. This is because bayesian statistics assumes that you only need to find a burden of proof for an expectation of the outcome. So before you start experimenting you already have a prior belief, in your experiment you see if the data confirms your prior belief and form your posterior belief. If your prior belief and your posterior belief are close to each other, you have a high bayesian probability; however, if the expectation and the data contradict each other, your bayesian probability also decreases.

The problem here is that in online experimentation we do not use a prior belief. In the vast majority of cases, we set a ‘non-informative prior,’ that is, we make the assumption that we do not yet know anything at all about the effect of our experiment. This makes the bayesian calculation very similar to the frequentist one, and thus power is essential to ensure that we collect sufficient evidence. This is also demonstrated in this item by David Robinson at varianceexplained.org.

In conclusion

I hope I've given you new insights into power. Do you have any questions, comments or other issues? I'd love to hear them! Just email me at: analisten@onlinedialogue.com

Naomi Smulders

Naomi Smulders