Cold Email A/B Test Sample Size Calculator: The 2026 Guide

By Daniel Park, Editor, Comparisons · May 22, 2026 · 9 min read read · Last reviewed May 22, 2026

Stop calling A/B tests at 200 sends. The exact sample sizes you need for cold email A/B tests in 2026, with worked examples and a calculator logic.

Why most cold email A/B tests are statistical noise

The most common mistake in cold outbound: send 300 emails per variant, declare a winner, scale the losing version. A proper cold email A/B test sample size calculator tells you the truth - to detect a realistic 2 percentage point lift on a 5% reply rate, you need around 2,200 sends per variant. Most teams declare victory at 1/10th of that.

This guide gives you the formula, the worked examples, and the operator shortcuts so you stop fooling yourself.

The numbers you need before you calculate

Baseline conversion rate (current reply rate, current positive reply rate, etc.)
Minimum detectable effect (MDE) - smallest lift worth caring about
Statistical power (use 80%)
Significance level (use 95%, two-tailed)

The formula behind a cold email A/B test sample size calculator

For two-proportion tests, the standard formula is:

n = (Z_alpha + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p1 - p2)^2

Where Z_alpha = 1.96 (95% confidence) and Z_beta = 0.84 (80% power). You do not need to memorize this - you need to know what it produces.

Worked example 1: subject line test on reply rate

Baseline reply rate: 6%. You want to detect a lift to 7.5% (MDE of 1.5pp). Plugging in: n = roughly 3,100 per variant. So you need 6,200 sends total before you can credibly call this test.

Worked example 2: opening line test on positive reply rate

Baseline positive reply rate: 1.2%. MDE: 0.4pp. n = roughly 11,800 per variant. This is why most opening-line tests on positive replies are essentially un-runnable for sub-scale teams.

Worked example 3: CTA test on meeting-booked rate

Baseline: 0.6%. MDE: 0.2pp. n = roughly 26,000 per variant. If you are sending 500 emails a day, that is two months of testing for one A/B. Most teams should not test CTAs on meeting rate - test on reply rate and trust the proxy.

Operator shortcut: the 2,000-per-variant rule

If you do not want to do the math every time, here is the heuristic that fits 80% of cold email tests at typical B2B baselines (4-8% reply rate): aim for 2,000 sends per variant for reply-rate tests; aim for 10,000+ for positive-reply or meeting-booked tests.

If you cannot reach those volumes, do not run the test. Pick the variant that is theoretically better and move on. You will learn more from shipping than from underpowered experiments.

How to design tests you can actually power

Test one variable at a time

Subject line OR opening line OR CTA - never all three. Multivariate testing requires geometric increases in sample size that no SDR team can afford.

Test high-impact variables first

Subject line and opening line move reply rate more than CTA or signature. Spend your statistical budget where the effect sizes are biggest.

Use sequenced rollouts, not split tests, for tiny lists

If your list is 800 contacts, do not split it. Send all 800 to variant A this week, all 800 of next week's list to variant B. Less rigorous but more practical at small scale.

Tooling: where to run the calculation

Optimizely, Evan Miller's online calculator, or any stats package. Clay users sometimes wire a sample-size column into their tables to flag underpowered tests. Apollo and Smartlead show test results but do not flag significance - assume their "winner" labels are unreliable until you check the math.

The mistake that kills cold email A/B tests

Peeking. Looking at the test on day 2, day 4, day 7, and stopping the moment "significance" appears. This inflates your false positive rate dramatically. Pre-commit to a sample size and do not check until you hit it.

The peeking penalty in numbers

If you peek 5 times during a test, your real false positive rate is closer to 15% than the nominal 5%. You will scale wrong variants more often than you think.

What to test in 2026 (and what not to)

Worth testing: subject line patterns, opening line personalization depth, sequence length, send time of day, reply tone. Not worth testing at most scales: button vs link CTAs, signature format, email length within 30 words, day of week (use known data instead).

Connecting tests to deliverability

Every A/B test assumes equal inbox placement across variants. If variant B uses spammier words and lands in promotions, you are testing deliverability, not copy. Warm your domains first - see our warmup guide - and split inboxes evenly across variants.

If you are choosing between native sequencer tests and dedicated tooling, our Apollo vs Cognism comparison covers the data-side tradeoffs that affect how clean your test cohorts are.

Puzzle Inbox and reply attribution

If you split A/B by inbox rather than by contact, attributing replies back to variant gets messy fast. A unified inbox like Puzzle Inbox keeps reply threads tied to the originating sequence variant, which makes post-test analysis honest.

Operator takeaway: Use 2,000 sends per variant as your floor for reply-rate tests. Pre-commit the sample size. Never peek. If you cannot power the test, ship the better-looking variant and move on.

Ready to start sending?

Puzzle Inbox provisions pre-warmed Google Workspace and Outlook 365 cold email inboxes ready to send within 24-72 hours. See the pricing page, the how-it-works walkthrough, or the our-process page for full details.