4 Mistakes that I made while running A/B Tests with Firebase Remote ConfigAnkesh Kumar SinghBlockedUnblockFollowFollowingJun 28I learnt about Firebase remote config and experiments about a year and a half back and started using it in my current product CricPlay, as soon as we hit a decent scale for meaningful A/B testing.
Firebase provides a robust and low cost (free to use, but developers do need to spend some time architecting the app for remote config) framework to test your hypothesis in the production app with actual users.
Depending on the level of code customisation, it can handle fairly simple (CTA color, copy) to complex (alternate business logic) use cases.
A dozen experiments later, I am listing down some of my mistakes (so you don’t make them).
The post assumes a beginner to intermediate understanding of the platform.
In case you are just starting out with Firebase for your A/B tests, you should go through this video series to get started: https://www.
Not targeting correctlyFirebase is a suite for mobile applications.
Handling a new config value would require making changes to the application.
If you redesigned the in-app store screen to read the color of that purchase button from remote config, an app update is required.
When running an experiment, it is important to target only those app versions that actually handle the config values being tested.
This is important because, in a total base of 100 users, only 20 may have updated to the latest version.
A % improvement (or deterioration) that may be statistically significant on a base of 20 users may appear insignificant for a larger base.
Statistical significance represents the likelihood that the difference between the conversion rates of 2 variants (or more) did not occur due to a random chance.
In A/B testing, we are performing a null hypothesis testing.
A null hypothesis states that the change has no effect on conversion.
It is assumed to be true until evidence indicates otherwise.
A result is statistically significant if it can not be explained by the null hypothesis and requires an alternate hypothesis.
In the above example, the alternative hypothesis holds for 20 users, but the null hypothesis is true for the other 80 — the change in config value has no impact on the previous version.
Further, if the adoption is slower for a new update, the experiment needs to run that much longer in order to reach a large enough sample the represents the overall base.
More so, existing users who are quick to update to the latest version may exhibit slightly different behavior from those who update a week or two later.
Apart from app version, Firebase supports several other targeting criteria.
Users can be targeted by audience segments, properties, geography and language.
If an experiment is designed to optimize the call to action copy in English for a multilingual app, it must target users with device language set to English.
To learn about all the available targeting capabilities, refer to the doc.
Not using an activation eventActivation event is an extension of targeting the right segment for an experiment.
Consider an experiment designed to optimize the onboarding experience of a game so as to get more users to play (captured by an event “Gameplay”).
For meaningful insights, the experiment should be restricted to new users.
Setting an activation event facilitates this.
Unlike targeting, it does not filter users at the time of sampling.
The test variants are served to both new as well well as existing users in the test sample.
However, setting “Signup” as an activation event ensures that only new users (who performed “Signup” event) are analyzed for the experiment.
Not setting an activation event dilutes the result as the existing users are also triggering “Gameplay” events.
Relying solely on Firebase Analytics for analysing resultsFirebase console provides a lot of insights about a running experiment: a summary about the state of the experiment, an overview of the impact on primary goal event as well as other experiment goals and a detailed understanding of the performance of each variant with respect to any of the goals.
For most simple experiments (UI changes, adding a new feature), the console results alone are sufficient to validate the hypothesis and roll out the best variant.
However, the complex ones may require additional analysis that may not be possible with the Firebase console.
One such experiment was to understand the impact of a “welcome bonus” on the game economy.
A welcome bonus enables a user to get some virtual currency on signup to experience the “premium” features.
A high demand of the virtual currency drives monetization through IAPs and rewarded advertising.
A welcome bonus, therefore, should decrease the revenue from a new user in the short term, with an intent to get more users hooked on to premium features to boost long term monetization.
An economy with taps and sinks interspersed throughout the game can be tough to analyze in the Firebase console, especially if the event schema is not very well designed.
There are 2 solutions to this problem:Using BigQuery for analysisBy linking BigQuery to Firebase, all analytics data related to A/B tests can be accessed.
For each variant, Firebase adds a user property firebase_exp_<experiment no.
> with value <variant index> that can be used to track the users exposed to each variant in BigQuery.
Passing experiment data to other analytics platforms / your databaseThis can be accomplished by creating a dummy config value (say firebaseExperimentId).
The default value is an empty string.
The application is be configured to ignore this value if empty and pass the non-empty value as user profile update.
While configuring the experiment variants, this dummy value is used to pass the variant information to the application.
This user profile value can be used for creating funnels, analyzing screen flows and performing advanced segmentation for the variants in the analytics tool (CleverTap in my case).
It helps leverage the capabilities of your premium analytics tool (as opposed to free Firebase analytics).
As an extension, this dummy value can also be used to implement server-side experiments.
The application passes this value as an API header.
The API can serve alternate business logic basis the remote config value.
Managing re-installs and multiple devicesThis is also one of the limitations of Firebase experiments.
If your app has a lot of re-installs and/or users accessing the same account from multiple devices, it is possible that the same user gets exposed to both test variants of an experiment.
A/B Testing uses an instance ID that the Firebase SDK generates to identify a unique device/user.
If a user uninstalls the app and installs it again, the SDK will generate a new instance ID and therefore treat the user as a new user.
This is also true if the same user logs into the app on a different device.
Again, for most simple experiments such as UI optimization or testing a new feature, it shouldn’t really matter.
However, for complex ones such as testing alternate business logic, it can influence the results or worse confuse some users.
Particularly in a market like India, a high re-install percentage is very common.
The solution is to either switch to another A/B testing solution or build a custom implementation of the same.
Both have their limitations — the cost of adding another tool and passing all the events to perform meaningful analysis vs.
developing a statistical model to sample users and analyse experiment results.
There can also be a middle path (that I took).
It entails creating a copy of remote config values and experiments on the application’s backend and exposing the same via an API.
This can be used to essentially “lock” a remote config value against a logged-in user, even if Firebase provides a different value for the same user at a later stage (post re-install or access from another device).
It is important to build an unlocking mechanism when the experiment ends and the winner is rolled-out.
Important to note that, depending on the % of re-installs or multiple devices, this may lead to a significant discrepancy in experiment results on the Firebase console.
Firebase does not know that the application is overriding remote config values for these users.
Hence, the results need to be analysed in BigQuery or another analytics tool by passing the overridden config value.
Not the most optimal solution, but works!.If you are managing an analytics platform, consider this a feature request.
What are some of your learnings with Firebase experiments or A/B tests in general?.Let me know in the comments.