Udacity is considering online experiments to test potential improvements to their website. Two versions of the website are shown to different users - usually the existing website and a potential change. My goal is to design and analyze an A/B test and write up a recommendation on whether Udacity should introduce a new version of the website.

The project involves choosing and characterize metrics to evaluate experiments, designing an experiment with enough statistical power, analyzing the results and draw valid conclusions, and ensuring that the participants of experiments are adequately protected.

Here is a screenshot of what the experiments look like:- Free trial screener

Experiment Design

Metric Choice

For each metric, explain both why you did or did not use it as an invariant metric and why you did or did not use it as an evaluation metric. Also, state what results you will look for in your evaluation metrics in order to launch the experiment.

Invariant metrics

  • a) Number of cookies: Number of unique cookies to view the course overview page.
  • b) Number of clicks: Number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger).
  • c) Click-through-probability: Number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page.

Evaluation metrics

  • a) Gross conversion: Number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button.
  • b) Retention: Number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout.
  • c) Net conversion: Number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button.

Reasons behind metric selection

  • Number of cookies will be a good population sizing invariant because it is randomly assigned between control and experiment group.
  • Number of cookies, Number of clicks, and Click-through-probability are measured before free trial screener is triggered, so will be invariant.
  • Gross conversion, Retention and Net conversion can be used as an evaluation metric because the numerator of the metric happens after free trial screener is triggered.
  • These are the metric that Udacity is trying to track in this experiment. By adding free trial screener after Start Free Trial button, the experiment could affect the number of user-ids to complete checkout.
  • Number of user-id is the number of users who enroll in the free trial. It cannot be used as an invariant metric or evaluation metric. User-ids are only tracked only after a student enrolls in the free trial and will not be equally distributed between the control and experimental groups. Hence it cannot be used as an invariant metric. It is essentially number of enrollments and is a raw count. We cannot adjust raw count to differences in group sizes and we are already using Gross conversion to measure the impact on enrollments that can normalize. Gross conversion is more robust than number of user IDs since it is normalized to the unit of diversion. So I opted not to use Number of user-id as evaluation metric either.

Results to look for in order to launch the experiment
Our experimental goals are (1) to reduce enrollments by unprepared students (2) without significantly reducing the number of students who complete the free trial and make at least one payment.

  • Goal 1 can be achieved by a decrease in Gross conversion. Since the workload expectations have been set upfront, we expect the number of students completing the checkout to go down and the Gross conversion to decrease.
  • Goal 2 can be achieved by an increase or no change in Retention and Net conversion.

Measuring Standard Deviation

For each of your evaluation metrics, indicate whether you think the analytic estimate would be comparable to the empirical variability, or whether you expect them to be different (in which case it might be worth doing an empirical estimate if there is time). Briefly give your reasoning in each case.

For binomial distribution with probability p and population N, the analytical standard deviation is computed as $sd = \sqrt\frac{p(1-p)}{N}$

Analytical Estimate of Standard Deviation given 5000 cookies per day

Evaluation Metric Standard Deviation
Gross Conversion .02023
Retention .05495
Net Conversion .01560
  • Analytically computed variability is likely to be close to empirically computed variability when the unit of diversion and unit of analytics are the same - cookie in this case. This is the case for Gross conversion and Net conversion.
  • For the Retention metric, the unit of analytics is user-id, which is not the same as the unit of diversion cookie. The variability of this metric will be much higher. It might be worth doing an empirical estimate of variability for this metric if there is time.

Sizing

Number of Samples vs. Power

Using the analytic estimates of variance, how many pageviews total (across both groups) would you need to collect to adequately power the experiment? Use an alpha of 0.05 and a beta of 0.2. Make sure you have enough power for each metric.

I did not use Bonferroni correction, because we are measuring three highly correlated metrics. The sample size needed to adequately power the experiments were calculated using the following link.

Evaluation Metrics Baseline Conversion Rate dmin alpha beta sample size # Pageviews
Retention .53 0.01 0.05 0.2 39115 4741212
Net Conversion .10931 0.0075 0.05 0.2 27413 685325
Gross Conversion .20625 0.01 0.05 0.2 25835 645875

Duration vs. Exposure

What percentage of Udacity's traffic would you divert to this experiment (assuming there were no other experiments you wanted to run simultaneously)? Is the change risky enough that you wouldn't want to run on all traffic?

Given the percentage you chose, how long would the experiment take to run, using the analytic estimates of variance?

Evaluation Metrics Traffic diverted Duration (days)
Retention 1 119
Net Conversion 1 18
Gross Conversion 1 17

If we divert 100% of Udacity’s traffic, the experiment will still run for 119 days. This is an unreasonably long time. However, if we drop Retention and just measure Net conversion and Gross conversion, the experiment can be run in 18 days with enough pageviews to adequately power the experiment.

The experiment constitutes a minimal risk to both students and Udacity. 1. There is no chance that anyone gets hurt because of the duration of our experiment. 2. We are not dealing with sensitive data - political attitudes, personal disease history, sexual preferences etc. Entire traffic can be directed to this experiment if there are no other parallel experiment. Even if we divert 50% of Udacity's traffic the experiment can be done in 36 days.

Experiment Analysis

Sanity Checks

For each of your invariant metrics, give the 95% confidence interval for the value you expect to observe, the actual observed value, and whether the metric passes your sanity check.

Invariant metrics Lower bound Upper bound Observed Passses
Number of cookies 0.4988 0.5012 0.5006 Yes
Number of clicks on “Start free trial” 0.4959 0.5041 0.5005 Yes
Click through probability on “Start free trial” -0.0013 0.0013 0.0001 Yes

Result Analysis

Effect Size Tests

For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. Indicate whether each metric is statistically and practically significant.

Evaluation metrics Lower bound Upper bound Observed Statistical significance Practical significance
Net conversion -0.01160 0.001857 -0.0049 No No
Gross conversion -0.02912 -0.01199 -0.02055 Yes Yes

Statistical significance was determined based on whether or not the Confidence interval bound contained 0. Practical significance was determined based on minimum detectable effect ($d_{min}$) parameter. The $d_{min}$ for Gross conversion was set at 0.01 and for Net conversion was set at 0.0075. For Gross conversion, our observed value is more than two below the practical significance boundary, and the confidence interval does not include zero. Hence, Gross conversion is both statistically and practically significant. However, for Net Conversion our observed value is well within the boundaries of practical significance, and confidence interval includes a zero. Hence, Net conversion is neither statistically nor practically significant.

Sign Tests

For each of your evaluation metrics, do a sign test using the day-by-day data, and report the p-value of the sign test and whether the result is statistically significant.

We performed the sign test using this calculator.

Sign test. If there is no change, there is 0.5 probability of positive change on each day (Null hypothesis), then:

Evaluation metrics # days with positive change # days two-tail P value Statistical Significance
Net conversion 10 23 0.6776 No
Gross conversion 4 23 0.0026 Yes

Summary

State whether you used the Bonferroni correction, and explain why or why not. If there are any discrepancies between the effect size hypothesis tests and the sign tests, describe the discrepancy and why you think it arose.

We did not use Bonferroni correction for sign test.

  • To propose our recommendations we need to consider both the net and gross conversion. We want both metrics to match our expectations (we look for a decrease in gross conversion and for a increase in the net conversion). We are in a situation where ALL metrics need to match our expectations in order to launch the change.
  • This is not the same as the case where ANY metric needs to match the expectations. In fact, it is the exact opposite.
  • False negatives have the greatest impact when ALL metrics must be satisfied to trigger launch, since a single false negative can govern the decision.
  • False positives have the greatest impact when ANY metrics satisfied can trigger launch, since a single false positive will govern the decision.
  • The Bonferroni correction controls for false positives at the expense of power, or increased false negatives.

For Net conversion both the Effect size test and Sign test are not statistically significant whereas for Gross conversion both the tests are statistically significant.

The effect size test for Retention is statistically significant, whereas sign test is not. There are two possible reasons. 1) We did not run the experiment long enough to have adequate page views, and hence enough Power for this metric. 2) Sign test is a non-parametric test and has lower power than the effect size test.

Recommendation

My recommendation will be to not launch the change.

We had to achieve two objectives in order to launch the change - 1) reduce the number of frustrated students who left the free trial because they didn't have enough time 2) without significantly reducing the number of students who complete the free trial and make at least one payment.

1) There was a statistically and practically significant decrease in Gross conversion. It is behaving like expected. The "free trial screener" decreased the number of students who completed check out because the expectations were set up front.

2) However, the Net conversion was neither statistically nor practically significant. It isn't behaving like we expected in order to launch the change. Even more, the lower bound of the confidence interval is below the negative boundary of the practical significance. It's possible that this number went down and reduced the number of students who complete the free trial and make at least one payment. This is not desirable. Net conversion image

Follow-Up Experiment

Give a high-level description of the follow up experiment you would run, what your hypothesis would be, what metrics you would want to measure, what your unit of diversion would be, and your reasoning for these choices.

Udacity can test a change where they add a “Book Welcome Videochat” button after enrolling in the course. So, if the student clicks Start free trial and subsequently check out and enroll, they will be required to attend a video chat appointment with a Udacity coach within the first 14 days free trial.

The hypothesis is that talking to Udacity coach will provide student with a personal touch, boost student morale’s, set clear expectations about the course and benefits of completing the course with past examples. If the hypothesis is held true Udacity will boost student’s enthusiasm and provide momentum to complete the course.

The unit of diversion will be user-id because the experiment is being done after the student enrolls in the course. We want each student to have consistent experience independent of platform and device. The same user-id cannot book video chat appointment twice.

The experiment that takes place after enrollment, so cookie and click based metrics are not relevant since we are already past that point in the funnel. Those metrics are as well less stable than the ones that are based on the user_IDs, which should be chosen here.

Invariant metrics

  • Number of user-ids

User-ids are being explicitly randomized between control and experiment group. User-id will serve as good population sizing invariant.

Evaluation metrics

  • Retention

Retention is the metrics that Udacity is trying to measure in this experiment. The experiment will affect the number of user-ids to remain enrolled past the 14-day boundary.


Comments

comments powered by Disqus