How-To

How I automate cold email reply classification with GPT-4 and save 10 hours per week

automationdevs · 2026-04-19 · 1,280 views

I used to spend 2 hours per day manually classifying cold email replies into categories: interested, not interested, objection, unsubscribe, referral, out of office, miscellaneous. Built a GPT-4 based classifier that does it in real time. Saves me roughly 10 hours per week. Sharing the setup.

The architecture. Sending platform (Instantly) webhook → cloud function (AWS Lambda or similar) → OpenAI API call → write classification to Airtable/database → route to appropriate team inbox or workflow.

The prompt. System prompt tells GPT-4 to categorize cold email replies into 7 categories with specific criteria for each. I give it 2-3 example replies per category. The prompt is roughly 800 tokens total.

User prompt is just the reply content. GPT-4 returns a JSON with category, confidence score, and a one-line summary.

The categories. 1) Interested — positive signal, wants to talk. Route to sales immediately. 2) Objection — interested but has concerns. Route to nurture sequence. 3) Not now — not interested currently but open later. Schedule follow-up in 60 days. 4) Referral — "wrong person, contact X instead". Capture referral data, start sequence to X. 5) Unsubscribe — wants off list. Auto-suppress across all campaigns. 6) Out of office — temporary. Reschedule next email for return date. 7) Not interested / negative — permanent no. Suppress from this campaign.

The accuracy. 94% classification accuracy against a manually-labeled test set of 500 replies. The 6% errors are mostly edge cases (ambiguous replies, ESL prospects with unusual phrasing). Much higher accuracy than rule-based keyword classifiers, which struggle with anything beyond "unsubscribe" matching.

The cost. Roughly $0.02 per reply classified at GPT-4 pricing. I process ~500 replies per week. Total: $10/week, $40/month. Trivial compared to the 40 hours of labor saved per month.

Edge cases. Multi-intent replies are hard ("I\'m interested but wrong person, try John"). I have GPT-4 return multi-intent classifications with primary and secondary categories. Human reviews these weekly.

The human layer. 6% error rate means 30 misclassified replies per week. I spot-check categorizations weekly — takes 30 minutes. Flag any obvious errors for retraining. Adjust the prompt quarterly based on error patterns.

What this enables. Real-time routing: interested replies hit a Slack channel within 60 seconds of landing. Sales team can respond while the prospect is still in their inbox. Dramatically improves meeting booking rate (replies responded to within 1 hour convert 3x higher than replies after 24 hours).

Tools worth considering instead of DIY. Some sending platforms now have built-in AI classification (Instantly Unibox, Smartlead reply management). Quality is getting close to custom GPT-4 setups. If you are not technical, just use the platform feature. DIY makes sense if you need custom routing workflows or multi-platform unified classification.