The Prompt Optimization Playbook: Which Method Wins Where

By Rishav · March 8, 2026 · 14 min read

We win 4 out of 5 benchmarks. The previous state of the art isn't close.

We built two prompt optimization methods — PromptGrad (gradient-based) and ContraPrompt (contrastive learning) — and tested them head-to-head against GEPA, the previous state of the art, across five benchmarks. On every reasoning benchmark in the study, our methods beat GEPA. The only task where GEPA holds a marginal lead is MMLU-Pro — a knowledge-recall benchmark where no optimizer can help because you can't teach a model facts through prompting. The interesting part: PromptGrad and ContraPrompt win on completely different tasks with zero overlap. PromptGrad dominates competition math and diverse reasoning. ContraPrompt dominates multi-hop QA and graduate science. Neither touches the other's territory. There is no universal best prompt optimizer. The right choice depends on your task. Getting this decision right matters more than any single technique — because the wrong optimizer can actively hurt performance (ContraPrompt on AIME: -15%). This post gives you the framework to make that call.

Executive Summary
We win 4 out of 5 benchmarks against GEPA, the previous state of the art. PromptGrad at +18%, ContraPrompt at +14% — on every reasoning task in the study, our methods come out ahead

GEPA's only win is MMLU-Pro, a knowledge-recall task where no optimizer can help — you can't teach a model facts through prompting

PromptGrad and ContraPrompt dominate different tasks — PromptGrad where failures are systematic and low-variance, ContraPrompt where failures are high-variance with self-correction potential

Prompt optimization can't fix knowledge gaps, but it delivers 3–29% gains on reasoning tasks

The practical impact: optimized small models close 50–70% of the performance gap with models costing 30–60× more

The Full Results

Five benchmarks. Three optimizers. We win four.

Benchmark	Task Type	Baseline	GEPA	PromptGrad	ContraPrompt	Winner	Delta vs GEPA

HotPotQA	Multi-hop reasoning	27.0%	36.8%	45.8%	47.6%	ContraPrompt	+10.8 pts
AIME 2025	Competition math	30.0%	43.3%	46.7%	36.7%	PromptGrad	+3.4 pts

BBH	27 diverse tasks	26.1%	87.6%	89.8%	88.3%	PromptGrad	+2.2 pts
GPQA Diamond	Graduate science	58.1%	64.2%	66.9%	68.2%	ContraPrompt	+4.0 pts

MMLU-Pro	Professional knowledge	81.2%	80.2%	79.6%	80.0%	GEPA	-0.2 pts

Reading the deltas: HotPotQA's +10.8 points is a 29% relative improvement over what was previously the best method available. AIME's +3.4 points is on competition math where every point represents problems that stumped most human contestants. BBH's gap looks small until you realize GEPA already pushed this from a 26.1% baseline — we squeezed more out of an already-optimized benchmark. GPQA Diamond's +4 points is on PhD-qualifying science questions. Benchmark comparison across all five tasks — our methods beat GEPA on 4 out of 5, with winners starred

Benchmark comparison across all five tasks — our methods beat GEPA on 4 out of 5, with winners starred

Normalized Performance

Raw scores are misleading — a 3% gain where baseline is 87% means something very different from a 3% gain where baseline is 27%. Normalizing to a 0–1 scale per benchmark gives the fair comparison:

Method	Normalized Score	vs. GEPA

Baseline	0.200	—
GEPA (state of the art)	0.641	—

ContraPrompt	0.730	+14%
PromptGrad	0.755	+18%

Both methods significantly outperform the current state of the art. Together, they cover more ground than either alone.

Three Philosophies of Optimization

Each method embodies a fundamentally different theory of how to improve prompts. Understanding the philosophy tells you when to deploy it.

GEPA: The Explorer

"Generate many variants. Test them all. Keep the best." GEPA treats prompt optimization as evolution. It creates a population of prompt variants, evaluates them on training data, and selects survivors through Pareto optimization — balancing performance against prompt simplicity. A strong reflection model guides mutations. It's like A/B testing at massive scale. You try hundreds of options and pick the winner. You don't know why it won, but you know it did. Strength: Reliable, hard to catastrophically break. Good default when you don't understand your failure modes. Limitation: Black box. Can't discover targeted reasoning fixes. Hits a ceiling on tasks requiring deep, systematic reasoning improvements.

PromptGrad: The Diagnostician

"Analyze specific failures. Extract targeted fixes. Validate each one independently." PromptGrad treats optimization like evidence-based medicine. It examines symptoms (failures), proposes treatments (rules), runs a controlled trial on each treatment (per-rule validation on 15 held-out examples), and only prescribes what's statistically proven to work. It's like a doctor who tests every intervention before adding it to your treatment plan. Slower and more expensive, but you trust the results — and when something goes wrong, you know exactly which rule to adjust. Strength: Explainable results, statistical guarantees against overfitting, comprehensive coverage through stratified sampling. The most reliable choice when you need auditability. Limitation: Higher compute cost. May miss non-obvious improvements that broader evolutionary search would find. Requires diverse failures to extract good rules. The evidence: Disabling rule validation alone drops performance by 14%. This single mechanism — testing every proposed rule on 15 held-out examples at p<0.05 significance before accepting it — is what separates evidence-based optimization from evolutionary guesswork. On MATH-500 where the baseline was already 82%, the validation gate rejected 27 out of 29 proposed rules in the first epoch because none measurably helped. That's the system refusing to add noise to an already-good prompt.

ContraPrompt: The Learning Coach

"Watch the model fail, then succeed. Extract what changed." ContraPrompt is like a coach studying game film — not just the losses, but the specific moments where the team fell behind and came back. By comparing failure and success on the same problem, it isolates the exact reasoning strategies that were missing. It's like studying comeback wins. The patterns that turn losses into victories are the most valuable plays in the book. Strength: Discovers self-correction strategies no other method can find. Naturally focuses on the highest-leverage improvements. Soft validation captures compound effects. Limitation: Entirely dependent on retry success rate. If the model can't self-correct with a nudge, there's nothing to mine. Underperforms on tasks where failures are fundamental capability limits. The evidence: Removing contrastive mining entirely drops HotPotQA performance by 37%. No other component comes close. Progressive rule accumulation — letting rules build on each other instead of resetting — accounts for 9%. Soft validation (keeping rules unless they actively hurt, rather than demanding each independently prove itself) is worth 6 to 8 points. At our default threshold we keep 28 rules and score 47.6%. Strict validation keeps only 14 rules and scores 41.3%. Individually neutral rules help in combination, and soft validation captures that. ContraPrompt ablation study — contrastive mining is the single most important component by far

ContraPrompt ablation study — contrastive mining is the single most important component by far

Rules Compound — GEPA Flattens

The ablation data shows which components matter. The iteration data shows how they compound. PromptGrad on HotPotQA, iteration by iteration:

Iteration	Score	Rules	Notes

0 (start)	27.0%	0	Baseline — no optimization
1	38.2%	12	Already passes GEPA (36.8%)

2	42.8%	23	+4.6 pts from 11 new rules
3	45.1%	31	Gains tapering but still climbing

4	46.8%	35
5	47.6%	36	Final — 10.8 pts above GEPA

GEPA's final score on the same benchmark: 36.8%. We pass GEPA after a single iteration and keep climbing. By iteration 5, we've accumulated 10.8 points of advantage.

ContraPrompt tells the same story from the mining side: 46% of examples yield contrastive pairs in iteration 1, dropping to 36%, 28%, 18%, 10% by iteration 5. The optimizer exhausts its own signal because the model keeps getting better — there are fewer failures left to mine.

PromptGrad learning curve vs GEPA — passes GEPA after one iteration, shaded area shows accumulated advantage

The Decision Framework

Question 1: Is your task reasoning-bound or knowledge-bound?

This is the first and most important filter. Knowledge-bound tasks (factual QA, trivia, domain knowledge recall): Prompt optimization delivers minimal gains. Across all three methods, MMLU-Pro showed less than 1% movement. You can't teach a model facts through prompting. Invest in retrieval (RAG) instead. Reasoning-bound tasks (multi-step analysis, evidence synthesis, mathematical reasoning, complex Q&A): Prompt optimization delivers dramatic gains — 3% to 29% depending on method and task. Continue to Question 2.

Question 2: How variable are your model's failures?

This is the core diagnostic. The key insight: failure variance determines which optimizer can extract a useful learning signal. Across all five benchmarks, the correlation between retry success rate and ContraPrompt's improvement over GEPA is r=0.90, p<0.05. It's the single strongest predictor of which optimizer wins — turning this qualitative recommendation into a quantitative one. Run a quick retry probe — take 20 failed examples, re-run them with a nudge like "Think more carefully," and count successes.

Meaningful retry success (high-variance failures) → ContraPrompt. The model sometimes succeeds and sometimes fails on similar inputs. ContraPrompt leverages exactly this contrast — comparing failures and successes on the same problem to isolate the missing reasoning strategies. This is why it dominates on retrieval-augmented QA (HotPotQA), multi-label domain classification (GDPR-Bench), and expert-level reasoning (GPQA). Its grouped synthesis writes a targeted rule for each failure category.

Retry success near zero (low-variance, systematic failures) → PromptGrad. The model fails consistently and predictably regardless of retry. PromptGrad doesn't need successful retries — consistent failure patterns are sufficient for its gradient computation. Its strict per-rule validation acts as a strong regularizer, which is why it accepted only 2 of 54 candidate rules on MATH-500 while still improving performance.

Borderline → Test both. The cost of running both (~$1 total) is negligible compared to the deployment value.

Question 3: What does your baseline look like?

High baseline already (>75%) → PromptGrad. When the model is near its capability ceiling, systematic failures dominate and PromptGrad's strict validation prevents overfitting to noise.

Multiple distinct error categories → ContraPrompt. If you can categorize failures into groups (e.g., evidence retrieval errors vs. reasoning chain errors), ContraPrompt's grouped synthesis writes a targeted rule for each category — higher leverage than broad rule extraction.

Diverse task mix (like BBH's 27 subtasks) → PromptGrad. Its stratified sampling and comprehensive per-rule validation are designed for breadth and auditability.

Unknown failure modes → Start with GEPA. Its broad evolutionary search will surface the landscape before you commit to a targeted approach.

Quick Reference

Your Situation	Recommended Optimizer

High-variance failures, model self-corrects on retry	ContraPrompt
Low-variance systematic failures, high baseline (>75%)	PromptGrad

Multiple distinct, categorizable error types	ContraPrompt
Diverse task mix, need explainable & auditable improvements	PromptGrad

New to optimization, want safe & reliable defaults	GEPA
Maximum performance, budget for both	Run both, take best per task

Knowledge/factual QA tasks	Skip optimization — invest in RAG

The Strategic Argument: Small Models Are the New Frontier

Prompt optimization fundamentally changes the economics of LLM deployment.

+79% performance at the same price point

25× more for only 6% more performance

Starting Point	Best Optimized	Relative Gain

27.0% (HotPotQA baseline)	47.6%	+76%
30.0% (AIME baseline)	46.7%	+56%

26.1% (BBH baseline)	89.8%	+244%
58.1% (GPQA baseline)	68.2%	+17%

The optimization is a one-time cost — approximately $0.50–1.00 per task. The optimized prompt then runs at small-model prices indefinitely. For an enterprise processing 1M queries/month:

Without optimization: Use Opus for quality → ~$50K/month

With optimization: Use optimized Haiku → ~$2K/month, retaining 70–80% of Opus quality

That's not a marginal improvement. It's a different business model.

Cost vs performance — our optimization moves Haiku up +79% at the same cost, while Opus costs 25x more for 6% more performance

The economics get even better when you consider data efficiency. Our methods reach near-peak performance with minimal training data — you don't need thousands of examples to see the gains.

$Near-peak performance with almost no data — our methods match or beat GEPA's full-budget score using a fraction of the training data$

What the Rules Look Like

ContraPrompt on HotPotQA:

"If guessing, cite specific evidence from the provided context."

"Track document source for each intermediate fact."

"Do not jump to conclusions with partial evidence — state what you know and what's missing."

PromptGrad on BBH:

"For yes/no questions, output exactly Yes or No with no additional text."

"For sequence tracking problems, maintain an explicit ordered list of state changes."

PromptGrad on AIME:

"Use consistent LaTeX notation throughout."

"After computing a final answer, verify it satisfies all stated constraints."

"When modular arithmetic is involved, track remainders explicitly at each step."

Each rule was extracted from real failures and validated on held-out data. If a rule stops helping in production, you identify and remove it — because every rule has a name and a test record.

Side-by-side comparison — GEPA produces generic strategies with no validation, our methods produce specific IF-THEN rules each individually validated on held-out data

What We're Releasing

Artifact	What You Get

Optimized Prompt Libraries	Pre-optimized prompts for all 5 benchmarks × 3 methods. Copy-paste ready.
Benchmark Results	Full results tables, per-example scores, statistical analysis

Evaluation Harness	Scripts to reproduce every number in this series
Optimizer Implementations	PromptGrad and ContraPrompt source code

We're releasing artifacts and code for research and experimentation. Our production optimization service — with automated task analysis, cross-model optimization, and enterprise prompt management — is part of the VizopsAI platform.

The Bottom Line

14–18% over state of the art

under $1 per task

If you deploy LLMs at scale and you're not optimizing your prompts, you're leaving 20–75% of your model's capability on the table.

We built two different optimizers not because we couldn't pick one, but because the problem demanded it. Different tasks need different approaches. The right toolkit — not the right tool — is what closes the gap between small-model economics and big-model performance.

This is Part 3 of our Prompt Optimization series:

PromptGrad: What If You Could Apply Gradient Descent to English?

ContraPrompt: When AI Learns from Its Own Mistakes

The Prompt Optimization Playbook ← You are here

Rishav is a Founding AI Engineer at VizopsAI. Before VizopsAI, he researched reinforcement learning at Mila and built ML systems at Wells Fargo and Pixxel. He holds a B.Tech from BITS Pilani.

VizopsAI builds the secure runtime for enterprise AI applications. vizops.ai