I split tested personalized first lines against generic openers across 8000 cold emails and the result was not what I expected
personalize_test_leo · 2026-07-03 · 870 views
Everyone in this community says personalization is the answer. So I ran a real split test to check whether that is actually true for my ICP.
The setup. Same list of 8,000 B2B SaaS SDR leaders. Split 50/50. Same subject line for both variants. Variant A: a manually researched personalized first line for every prospect, drawn from a recent LinkedIn post, a company milestone, or something specific to their role. Variant B: a sharp, specific opener that named the ICP pain directly with no personalization. Same email body after the first line in both variants.
The result. Variant A (personalized): 3.4 percent reply rate. Variant B (specific generic): 2.9 percent reply rate. Personalization won. But not by as much as the conventional wisdom suggests.
The part nobody talks about. Variant A took 6 times longer to build. The manual research, the LinkedIn scraping, the Clay enrichment and waterfall setup. When I account for time cost, the ROI on personalization at that scale is much lower than the raw reply rate difference implies.
For a high-value ICP where one meeting is worth $50,000 in pipeline, the 0.5 percent difference justifies the effort. For a volume-play ICP where you need hundreds of conversations to find deals, the math looks completely different.
What actually moved the needle more than personalization. The subject line. When I tested a question-format subject line against a statement subject line across both variants, the question format moved reply rates by 0.8 percent. More than the personalization difference.
Personalization is real. It is just not the highest-leverage variable in most campaigns. Get the subject line right, get the pain framing right, get your infrastructure clean. Then add personalization when you have the margin to do it well.
All 8,000 sends ran through PuzzleInbox Google Workspace inboxes on Instantly. Consistent infrastructure is the only way split test data actually means something.