This blog post is a practical guide for analyzing AB test results easily and on the cheap. I want to emphasize on the word practical first. You will not find here in depth statistical explanations but more a practical view regarding methodologies you can use. On the cheap refers to the fact that I will be sharing with you tools, resources and code that you can use right now for free in Google Sheets and Apps Script.
You probably have to compute statistical significance all the time and buying a new AB test platform is not an option. It is either too expensive for your company or it will take time before you get the approval to invest in such a platform. So you have to do with what you have and if this sounds familiar, well welcome my friend! I have been there. I will share with you how you can dimension an experiment, calculate statistical significance, create a Bayesian AB test calculator just by leveraging Google Sheets.
As such this post is dedicated to Marketeers, UXers, Designers or any Data Driven person who wants to analyse the result of split tests without having to graduate from a PHD in Statistics first. Also, I believe this post can be invaluable to Data Analysts supporting Growth or Marketing departments. The Maths and Stats behind analysing split tests can sometimes also seem opaque to seasoned data practitioners.
I will be sharing my experience supporting various departments where AB testing is at the core of their inner way of working. I will go through the tools (built in Google Sheets for most of them but you will also find some Apps Script and Python) and techniques I use to answer the usual questions around dimensioning a test, analysing and reporting its results. In this post I will assume that you have a Data Engineering Team that can help you implement and track AB test data and all you need to do is analyse the results.
When no experimentation platform is available to you, it is primordial to abide by a process in terms of designing experiments. With no tool on your side the easiest way is to rely on frequentist statistics fixed term horizon methodology. It is quite a mouthful but also very simple to use. I am going to share with you how you can leverage this methodology to define how long you should be running your test for.
In designing an experiment there are a bunch of metrics that you will have to surface. Most of the time it requires the help of a Data Analyst that might be able to grab the data for you. If you are data savvy and have access to your data warehouse, a bit of SQL will do.
To understand how long you should run an AB test for you need to:
You can now calculate the sample size needed based on the abovementioned metrics and by comparing the sample size to your daily traffic you can then infer how many days you should be running the test for.
Here is a Google Sheets with the Power Analysis formula already set for you. Feel free to make a copy for your own use or even better create your own version.
Let’s use an example so you can familiarize yourself with the tool. Let’s say you are running an AB test on your landing page, and the goal is to make a CTA more visible to improve the click through rate.
We input these numbers into our power analysis tool:
The tool tells us that we need around 12,000 visitors per variant, which is 24,000 in totals. We know that the traffic on our landing page is 1,000 visitors per day and as a consequence we know that we need to run the AB test for 24,000 / 1,000 = 24 days.
Pro Tip: usually the historical values that you use for power analysis might differ in reality when you run the test. It is always a good idea to add a few more days to make sure that you are compensating for unforeseen traffic variability. In the current case instead of running the test for 24 days I would probably run it for a full month. On top of “absorbing” the variability in traffic and making sure your test is not underpowered, gathering more traffic than previously scheduled can also allow you to detect smaller MDE than the one you were aiming at.
When and only when your test has run for the predefined amount of time that you should stop your AB test and gather the data. Using SQL directly or asking for the help of your favorite Data Analyst, you should then:
The reality is that you never really know what Minimum Detectable Effect (or MDE in short) to select in order to dimension your experiment. However there are a few things you can do in terms of getting close to a number that might be realistic or helpful.
If you have joined an existing Team which has already run tons of experiments, you can piggyback on the knowledge that has already been amassed. Accessing or creating a document summarizing every experiment that was run, with some descriptions of the page the test was run on, the measured effect (or difference between the Control and Variant) and whether it was statistically significant etc… can really help you assess the MDE of future experiments. Maybe you are about to run an AB test that was done before on a similar webpage: in this case you can use the effect found from that previous experiment as your best guess.
Now if you are joining a brand new Team and there is no existing knowledge that you can leverage to make educated guesses about your MDE, then there are probably two methodologies you can follow. The first one is basically to disregard all that experiment dimensionality calculation and run the experiment for two business cycles only. This can be a valid way to look at things if you work on “top of the funnel” initiatives like acquisition for example. In this case you are probably dealing with a lot of traffic so you can run every single of your AB tests for two weeks while still getting enough samples to detect very small effects.
The second solution and by far my preferred one is to follow a more data driven approach by leveraging what we learned from the previous chapter. It goes as follows:
The boring and time consuming part here is having to compute the sample size for every single combination you are interested in. Like in the above example if you have a long list of different combinations (MDE, number of variations) you can spend a lot of time coming up with the sample size and test duration figures.
In this blog post I share some Apps Script code to automate all the calculations for you. Neat right? Now all you need to do follow the steps in the post and you are all set!
Calculating statistical significance is not complicated and does not necessitate an advanced solution. It can be done from Google Sheets quite simply. If we take the example of comparing two conversion rates (control vs variant) here is a template that you can use. If the difference is statistically significant (based on your choosen alpha) then the Google Sheets will tell you so on cell B14.
Let’s see what’s under the hood. First of all we are using the Chi Square Test (a mouthful I know). It is basically a test that will compare the data you have gathered against what your data would have looked like if the variant was similar to the control (in other words in the case where there is no effect at all due to the change you have implemented).
To work the Chi Square test needs you to transform your original dataset to compute the figures for the case where there is no effect whatsoever - which I called the Expected Values in the Google Sheets. The process would look something like:
Once you calculated your p-value you can directly compare it against the significance level alpha (do not overthink it and pick 0.05, but whatever you choose make sure it is the same value as the alpha you selected for dimensioning your experiment). If your p-value is less than alpha then you can conclude that the variant is better than the control.
The next natural question is by how much is it better, and we do need to be careful on how we report it. This is what we are going to tackle in a couple of chapters from here.
For the moment, if your want to know in details how to calculate statistical significance not only for conversion rates but also for continuous variables, using either Google Sheets or Python, here is below a list of ressources:
In the next chapter I am going to go through analysing the result of an AB test and move beyond the black and white interpretation that is often required when we work in a professional setting as opposed to an academic one.
All the methodologies and templates we have been through so far are all related to the frequentist world. Whithin such a setting, an AB test has implicitly two hypothesis:
When we calculate the p-value we are within the premise of the null hypothesis. We assume that the change implemented has no effect, and in this context we can interpret the p-value as a measure of “surprise”. It tells us how surprising the result is - in other words how surprising the difference between the variant and the control is - in a world where we assume that there should be no effect at all.
If the p-value is less than the significance level alpha we say that the result is so surprising that we have enough evidence to reject the null hypothesis and accept the alternative hypothesis. It does not necessarily mean that the alternative is true for sure. There still is a 5% chance (the significance level alpha) that we are facing a false positive: a case where we conclude that the Alternative Hypothesis is true when actually it is not.
If the p-value is more than the significance level alpha then we say that we do not have enough evidence to reject the null hypothesis. We can not confirm that there is an effect, nor can we say that the null hypothesis is true, we just can not conclude anything unfortunately. In such a case, one should not reject the logic behind implementing the test in the first place and take the rejection as an indication that the change tested was not bold enough and needs to be reviewed, improved or tweaked.
With frequentist statistics, we can only answer one question and one question only through an AB test: is the variant better than the control? Yes or No and that is it.
I have seen countless Data Analysts (I should include myself) and Product Managers making the same mistake which is to use the AB test result to estimate what the real uplift is or what the real difference in conversion rates between the two variations is.
The reality is that the uplift or the difference calculated from the test is likely to be over optimistic. Think about what would happen if you were to run the same AB test a hundred times. You would find some tests that failed to reach statistical significance, as well as a wide variety of numbers reported for the uplift from one test to another.
The goal is to set expectations right in our stakeholders’s mind and from my experience being cautious here is the right approach. There two ways to report on an AB test performance:
The frequentist approach unfortunately does not work well when we deal with small samples. More often than not, when computing how long an experiment should run for one might find out that due to the low amount of traffic coming to the page the test would have to run for an unrealistic amount of time.
In such a case the frequentist approach here does not offer any insights. Either we run the test for the required period of time and then check the p-value but as discussed this is no longer an option due to the low traffic. Or we decide to run the AB test for two business cycles and compute the p-value afterwards. In that latter case it is very likely that the p-value is going to be meaningless (automatically more than 5%) - unless a “once-in-a-lifetime” effect is detected.
In such a scenario I find the Bayesian approach to be most helpful. It provides a probability that the variant is better than the control, no matter how small the sample size. You can then define a level of risk you are willing to accept. For example if the probability that the variant beats the control is higher than, say 90%, then conclude that the variant is a “winner”.
I do not advocate for a specific cut off point here, it is for anyone to make up their minds on the level of risk they are ready to accept. However, due to the small sample size it is also very likely that the result reported by Bayesian statistics stays quite low and rarely reaches a high probability level (above the 90% threshold for example), unless the effect of the change is very impactful. Small sample size equates to high variability. Now because Bayesian statistics provides a number we still have “something” we can use to make an informed decision based on the level of risk we are happy to take.
Pro Tip: regardless of the statistical methodology we use it is important to remind ourselves that, when dealing with small sample sizes, we need to make sure we maximize our chances to capture the "signal" by:
If you do not have access to a modern experimentation platform but have the help of data engineers to technically set an experimentation up and track the data, measuring the result of an AB test can easily be done, on a cheap, leveraging Google Sheets, Apps Script or Python.
In a nutshell, the high level steps should be to: