What is a p-value in an A/B test?

The p-value is the probability of observing a difference at least as large as the one you measured, assuming the two variants are actually identical. A p-value of 0.05 means there is a 5% chance the result is pure noise. Smaller is better — most teams require p ≤ 0.05 before acting on a result.

What is the difference between one-tailed and two-tailed tests?

A two-tailed test asks whether B is *different* from A (either better or worse). A one-tailed test asks whether B is *better* than A. You should only choose one-tailed if you decided before the experiment that you would ignore a negative result — otherwise two-tailed is the honest choice.

How is the z-score calculated?

The test pools the two conversion rates into a single estimate of the common rate under the null hypothesis (p_pool = (conversions_A + conversions_B) / (visitors_A + visitors_B)), then computes the pooled standard error SE = sqrt(p_pool * (1 - p_pool) * (1/n_A + 1/n_B)). The z-score is (p_B - p_A) / SE. The p-value follows from the standard normal distribution.

What is the 95% confidence interval showing?

The CI is the range of plausible values for the *true* absolute difference in conversion rates (p_B - p_A). If the interval spans zero, the direction of the effect is uncertain even if the p-value is below 0.05. Always check the CI, not just the p-value.

How many visitors do I need for a valid A/B test?

The calculator shows the minimum sample size per variant needed for 80% statistical power at the observed effect size (alpha = 0.05, two-tailed). If your current sample is below that number, the test is underpowered — real effects can easily fail to reach significance.

Is statistical significance the same as practical significance?

No. A very large experiment can detect a 0.01% lift as statistically significant, but that lift may be commercially meaningless. Always combine the p-value with the confidence interval and the relative lift to decide whether the effect is large enough to justify shipping.

A/B Test Significance Calculator

Run an A/B test and get a number you can act on: a p-value, a z-score, a 95% confidence interval on the lift, and a quick power check — all from visitor and conversion counts you already have. No spreadsheet, no Python, no uploading data anywhere.

How it works

The calculator performs a two-proportion z-test, the industry standard for comparing two conversion rates. Here is the exact sequence:

Estimate each rate. p_A = conversions_A / visitors_A and p_B = conversions_B / visitors_B.
Pool under the null. If the two variants are truly equal, the best estimate of the common rate is p_pool = (conversions_A + conversions_B) / (visitors_A + visitors_B).
Compute the pooled standard error. SE = sqrt( p_pool * (1 - p_pool) * (1/n_A + 1/n_B) )
Compute the z-score. z = (p_B - p_A) / SE
Convert to a p-value. The p-value is the area under the standard normal curve beyond the observed |z|. For a two-tailed test: p = 2 * (1 - Phi(|z|)). For a one-tailed test testing B > A: p = 1 - Phi(z).
Build the confidence interval. The 95% CI on the absolute difference uses an unpooled standard error — SE_ci = sqrt( p_A*(1-p_A)/n_A + p_B*(1-p_B)/n_B ) — so it reflects uncertainty about both rates independently: CI = (p_B - p_A) ± 1.96 * SE_ci.
Power check. Using Cohen’s formula for the minimum sample size needed for 80% power at alpha = 0.05 (two-tailed), the tool flags whether your current sample is likely too small to reliably detect the observed effect.

Worked example

Suppose your checkout button test ran for two weeks:

Variant	Visitors	Conversions	Rate
Control (A)	5,000	250	5.00%
Variant (B)	5,000	295	5.90%

Relative lift: +18.0%
Pooled rate: (250 + 295) / 10,000 = 5.45%
Pooled SE: sqrt(0.0545 * 0.9455 * (1/5000 + 1/5000)) = 0.004540
z-score: (0.059 - 0.050) / 0.004540 = 1.982
p-value (two-tailed): 2 * (1 - Phi(1.982)) = 0.0474
95% CI on difference: [+0.01%, +1.79%]

The result is significant (p ≤ 0.05). The confidence interval is entirely positive, confirming B is genuinely better. The minimum sample size for 80% power at this effect size is roughly 9,983 per variant — the current sample is below that, so collecting more data would strengthen confidence in the result.

Formula note

The normal CDF is computed using the Abramowitz and Stegun rational approximation (formula 7.1.26), which has a maximum error of 1.5e-7 — more than sufficient for practical A/B testing. The pooled standard error is used for the hypothesis test (matching the null hypothesis of equal rates), while the unpooled standard error is used for the confidence interval (reflecting the actual uncertainty in each arm). This is standard practice and matches tools such as Optimizely and VWO.