A/B testing with Machine Learning

7 min readAug 14, 2021

two mobile phones A & B written in the screen

In this article, you will find the essence of Machine Learning for A/B Testing

Table of content:
1. Basics of A/B testing and its use cases
2. Limitations and challenges of classical A/B testing
3. Sequential A/B testing pros and cons
4. A/B testing formulation in Machine Learning context
5. Data review and ML A/B testing result
6. The advantage of using MLFlow and DVC in ML experimentation

1. Basics of A/B testing and its use cases

What is A/B testing?

a wooden door with A and B written in metal in the handle — Photo by Jason Dent on Unsplash

A/B testing (also known as bucket testing or split-run testing) is a user experience research methodology. A/B tests consist of a randomized experiment with two variants, A and B. It includes the application of statistical hypothesis testing or “two-sample hypothesis testing” as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants are more effective.

Essentially, A/B testing eliminates all the guesswork out and enables experienced optimizers to make data-backed decisions. In A/B testing, A refers to ‘control’ or the original testing variable. Whereas B refers to ‘variation’ or a new version of the original testing variable.

The basic A/B testing process looks like this:

Let’s assume you want to know if the landing page design has an impact on the conversion rate.

1. Make a hypothesis about one or two changes you think will improve the page’s conversion rate.
2. Create a variation or variations of that page with one change per variation.
3. Divide incoming traffic equally between each variation and the original page.
4. Run the test as long as it takes to acquire statistically significant findings.
5. If a page variation produces a statistically significant increase in page conversions, use it to replace the original page.
6. Repeat

When Should We Use A/B Testing?
A/B testing works best when testing incremental changes, such as UX changes, new features, ranking, and page load times. Here you may compare pre and post-modification results to decide whether the changes are working as desired or not.

2. Limitations and challenges of classical A/B testing

Classical A/B testing works but it isn’t terribly efficient. You’ll need to use a significant amount of time and resources to carry out your tests before you can gain any meaningful results.

A/B testing doesn’t work well when testing major changes, like new products, new branding, or completely new user experiences. In these cases, there may be effects that drive higher than normal engagement or emotional responses that may cause users to behave in a different manner.

In this way, rather than testing A versus B, you can introduce C,D,E,F, and G into the equation as well and try out different combinations such as header image A with headline B and CTA C.

Not only that, but instead of considering each visitor to be equal as in split A/B testing, ML can take into consideration factors such as demographics, customer status, and previous behavior to dynamically serve up different versions of your site to different groups of users.

The power of ML enables you to personalize and optimize your web properties from thousands of potential variations to display the single version that offers the best chance of conversion for each individual visitor.

3. Sequential A/B testing pros and cons

Sequential testing as the name implies it involves performing sequential interim analysis till results are significant or till a maximum number of interim analyses is reached.

Sequential analysis sounds appealing especially since it may result in a trial needing much fewer subjects than a randomized trial where the sample size is calculated in advance.

Are there any pros and cons of sequential analysis?

Pros
* Optimize necessary observation (sample size)
* Reduce the likelihood of error
* Gives a chance to finish experiments earlier without increasing the possibility of false results
Cons
* From a frequentist perspective, if we are concerned with preserving type I errors, we need to recognize that we are doing multiple comparisons
* If we do 3 analyses of the data, then we have three non-independent chances to make a type I error
* For a fixed sample size and significance level, sequential testing ends up reducing power compared to waiting until all the data comes in.

Traditional split testing is very time-consuming, especially if you want to test out several different variables. You have to carry out a single test for each different element you’re experimenting with and wait for a fairly conclusive result before you can continue with the next test.
Because you need a substantial audience size for each test to obtain any kind of meaningful results, A/B testing can also be very resource-intensive, often requiring a team dedicated to the task or outsourcing to an expensive marketing company.

4. A/B testing in Machine Learning

What is Machine learning?

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.

What does it have to do with A/B testing?

Unlike statistical inference, Machine Learning algorithms enable us to model complex systems that include all of the ongoing events, user features, and more. There are a number of algorithms each with strengths and weaknesses.

A major issue with traditional, statistical-inference approaches to A/B Testing is that it only compares 2 variables — an experiment/control to an outcome. The problem is that customer behavior is vastly more complex than this. Customers take different paths, spend different amounts of time on the site, come from different backgrounds (age, gender, interests), and more. This is where Machine Learning excels — generating insights from complex systems.

5. Data review and ML A/B testing result

Check out my Github on the project

The data for this project is a “Yes” and “No” response of online users to the following question

Q: Do you know the brand Lux?

O Yes
O No

Dataset Column description

auction_id: the unique id of the online user who has been presented the BIO. In standard terminologies, this is called an impression id. The user may see the BIO questionnaire but choose not to respond. In that case, both the yes and no columns are zero.
experiment: which group the user belongs to — control or exposed.
date: the date in YYYY-MM-DD format
hour: the hour of the day in HH format.
device_make: the name of the type of device the user has e.g. Samsung
platform_os: the id of the OS the user has.
browser: the name of the browser the user uses to see the BIO questionnaire.
yes: 1 if the user chooses the “Yes” radio button for the BIO questionnaire.
no: 1 if the user chooses the “No” radio button for the BIO questionnaire.

Results

With A/B testing we compare between two, but with machine learning we can incorporate the complexity and dynamic nature of data and draw insights.

With classical A/B testing there is no significant lift in brand awareness but

with machine learning the hour day date and platform leads to more ‘yes’.

Our experiment feature has feature importances of 0. This implies that the ‘experiment’ feature is not the main driving feature of the Decision Tree Model. It is not contributing much to awareness. The best predictor for the Decision Tree Model is hour which has feature importance of 0.45 followed by device_make. But this is a result of a decision tree model with a max-depth of 4. if we increase the max-depth, we might observe a different result in feature importance.

Conclusion

With classical A/B testing, we determined if there was a significant lift in brand awareness which is instrumental to smartAd in making the next move.

With Machine Learning, we discover that the other features like an hour of the day, the dates, determine the conversion in brand awareness

There is a greater potential to have a significant lift in brand awareness.

The hours of the day and the dates count towards gaining more “yes “ results.

6. The advantage of using MLFlow and DVC in ML experimentation

What is MLflow?

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry

DVC and MLflow are two open-source projects that are widely adopted, each for its own specialty. DVC excels at data versioning, and MLflow is multiple tools combined into one, but mainly in use for its experiment tracking and artifacts logging capabilities

It supplies automatic logging capabilities for most common high-level machine learning frameworks as well, including Scikit-learn, Tensorflow, Keras, XGBoost, FastAI, PyTorch, and more.