Run an A/B test and get a number you can act on: a p-value, a z-score, a 95% confidence interval on the lift, and a quick power check — all from visitor and conversion counts you already have. No spreadsheet, no Python, no uploading data anywhere.
How it works
The calculator performs a two-proportion z-test, the industry standard for comparing two conversion rates. Here is the exact sequence:
-
Estimate each rate. p_A = conversions_A / visitors_A and p_B = conversions_B / visitors_B.
-
Pool under the null. If the two variants are truly equal, the best estimate of the common rate is p_pool = (conversions_A + conversions_B) / (visitors_A + visitors_B).
-
Compute the pooled standard error. SE = sqrt( p_pool * (1 - p_pool) * (1/n_A + 1/n_B) )
-
Compute the z-score. z = (p_B - p_A) / SE
-
Convert to a p-value. The p-value is the area under the standard normal curve beyond the observed |z|. For a two-tailed test: p = 2 * (1 - Phi(|z|)). For a one-tailed test testing B > A: p = 1 - Phi(z).
-
Build the confidence interval. The 95% CI on the absolute difference uses an unpooled standard error — SE_ci = sqrt( p_A*(1-p_A)/n_A + p_B*(1-p_B)/n_B ) — so it reflects uncertainty about both rates independently: CI = (p_B - p_A) ± 1.96 * SE_ci.
-
Power check. Using Cohen’s formula for the minimum sample size needed for 80% power at alpha = 0.05 (two-tailed), the tool flags whether your current sample is likely too small to reliably detect the observed effect.
Worked example
Suppose your checkout button test ran for two weeks:
| Variant | Visitors | Conversions | Rate |
|---|---|---|---|
| Control (A) | 5,000 | 250 | 5.00% |
| Variant (B) | 5,000 | 295 | 5.90% |
- Relative lift: +18.0%
- Pooled rate: (250 + 295) / 10,000 = 5.45%
- Pooled SE: sqrt(0.0545 * 0.9455 * (1/5000 + 1/5000)) = 0.004540
- z-score: (0.059 - 0.050) / 0.004540 = 1.982
- p-value (two-tailed): 2 * (1 - Phi(1.982)) = 0.0474
- 95% CI on difference: [+0.01%, +1.79%]
The result is significant (p ≤ 0.05). The confidence interval is entirely positive, confirming B is genuinely better. The minimum sample size for 80% power at this effect size is roughly 9,983 per variant — the current sample is below that, so collecting more data would strengthen confidence in the result.
Formula note
The normal CDF is computed using the Abramowitz and Stegun rational approximation (formula 7.1.26), which has a maximum error of 1.5e-7 — more than sufficient for practical A/B testing. The pooled standard error is used for the hypothesis test (matching the null hypothesis of equal rates), while the unpooled standard error is used for the confidence interval (reflecting the actual uncertainty in each arm). This is standard practice and matches tools such as Optimizely and VWO.