We Achieved 29% Better Reasoning by Teaching LLMs to Learn from Their Own Failures

By Rishav · February 22, 2026 · 10 min read

This is Part 2 of our prompt optimization series. In Part 1, we introduced PromptGrad — a gradient-descent-inspired optimizer that beat the state of the art by 18% across five benchmarks. In this post, we tackle a different signal: when an AI stumbles on a problem but gets it right on a second try, something changed in its reasoning — and that difference is a goldmine for optimization.

Executive Summary > Introducing VizopsAI ContraPrompt: a system that watches AI fail then succeed on the same problem, extracts what changed, and bakes those patterns into the prompt permanently. No fine-tuning required. > - +29% on multi-hop reasoning, +19% normalized across six benchmarks — by mining the gap between a model's first-pass failures and its self-corrections.
Real-world compliance gains: +18% on GDPR detection — contrastive optimization transfers from academic benchmarks to regulatory tasks.

Cost arbitrage: Optimized prompts on a $0.25/1M-token model approach the quality of models costing 30–60×more, with fully auditable, human-readable rules.

Free diagnostic: Test 20 failed examples with a retry nudge. If >30% self-correct, ContraPrompt will deliver strong gains on your task.

Models know more than they show. We built a system to capture what they're holding back.

Ask Claude a multi-hop question — one that requires connecting facts across multiple documents — and it stumbles. It grabs the first relevant fact and jumps to a conclusion, skipping the verification step. Now give it a second chance. Just say: "Your answer was incorrect. Think more carefully." 30–40% of the time, it nails it. Same model, same question, same context. The only thing that changed was a generic nudge. The knowledge was there the entire time — the model just didn't deploy the right reasoning strategy on the first try. Every developer who's worked with LLMs has noticed this. What nobody had done — until now — was systematically mine that gap between failure and success, extract the patterns that explain the difference, and bake them permanently into the prompt. We're calling it Contrastive Prompting (or ContraPrompt for short).

Lets walk through an example

Here's an actual HotPotQA problem from our experiments. The question requires connecting information across two Wikipedia paragraphs: Question: "Are the directors of 'Jaws' and 'The Color Purple' the same person?" Attempt 1 (Failed — score 0.2):

"Jaws was directed by Steven Spielberg. The Color Purple is a novel by Alice Walker that was adapted into a film. Based on the information provided, I cannot confirm they share the same director."

The model found one fact (Spielberg directed Jaws) but hedged on the second instead of verifying it in the context. It had the answer — Spielberg directed both — but didn't commit to checking.

Attempt 2 (After nudge — score 1.0):

"Jaws (1975) was directed by Steven Spielberg [Paragraph 1]. The Color Purple (1985) was also directed by Steven Spielberg [Paragraph 3]. Yes, both films were directed by the same person: Steven Spielberg."

Same model. Same context. The difference: on the second attempt, it verified each fact against a specific paragraph before combining them.

ContraPrompt's extracted rule:

"When answering comparison questions, explicitly locate and cite the relevant fact for EACH entity before stating the comparison. Do not assume or hedge — verify against the source."

That rule, extracted from watching the model fail then succeed, improved performance across hundreds of similar questions. One contrastive pair → one rule → systematic improvement.

How ContraPrompt Works

The method has three phases, each building on the last.

Phase 1: Multi-Attempt Solving

For each training example, ContraPrompt gives the model up to three shots:

Attempt 1: Solve with the current prompt. No help.

Attempt 2: If it failed — retry with generic feedback: "Your previous answer was incorrect. Think more carefully."

Attempt 3: If it still failed — retry with specific feedback about the error type: "Your answer had a reasoning error in evidence synthesis. Address this."

The feedback is deliberately calibrated. Attempt 2's vague nudge forces the model to reconsider its entire approach. Attempt 3's targeted hint focuses it on the specific weakness.

This generates a rich dataset: for each problem, we have the model's worst attempt, its best attempt, and everything in between.

Phase 2: Contrastive Mining

contrastive pairs

on the exact same problem from the exact same model

Error correction pairs (failed → succeeded): What fixed a total failure? These reveal missing reasoning steps — like the "verify each fact" rule above.

Refinement pairs (partially correct → fully correct): What improved an already decent answer? These reveal subtle optimizations — like better output formatting or more precise entity matching.

A stronger analysis model (Sonnet-class) examines each pair and extracts structured rules describing the pattern. This separation — using a stronger model for analysis while keeping the task model fixed — is critical. You want the smartest analyst reviewing the failures, even if a cheaper model runs the actual task.

Phase 3: Soft Validation and Progressive Accumulation

actively hurt

soft validation outperforms strict validation by 7% on average.

progressive rule bank

Results: Dominant on Deep Reasoning

Benchmark	Task Type	GEPA	PromptGrad	ContraPrompt	vs. GEPA

HotPotQA	Multi-hop reasoning	36.8%	45.8%	47.6%	+29%
GPQA Diamond	Graduate science	64.2%	66.9%	68.2%	+6%

BBH	27 reasoning tasks	87.6%	89.8%	88.3%	+1%
MMLU-Pro	Professional knowledge	80.2%	79.6%	80.0%	~0%

AIME 2025	Competition math	43.3%	46.7%	36.7%	-15%
GDPR-Bench Android	GDPR compliance detection	~8%	~5%	~26%	+18.2%

Performance comparison across six benchmarks. ContraPrompt shows dominant gains on reasoning-heavy tasks (HotPotQA +29%, GDPR-Bench +18.2%).

Normalized: 0.741 vs. GEPA's 0.624 — a 19% improvement across six benchmarks.

+29% on Multi-Hop Reasoning

exactly

-15% on Competition Math

only

Why it fails:

hard

Why AIME is the exception, not the rule:

The practical diagnostic:

free diagnostic

Retry Success Rate	ContraPrompt Likely Outcome

>30%	Strong improvement (+10–30%)
15–30%	Moderate improvement (+3–10%)

<15%	Use PromptGrad or GEPA instead

Test 20 failed examples with a generic nudge. Count how many succeed. You'll know in minutes whether ContraPrompt is the right tool.

Why This Matters

1. ContraPrompt productizes a Universal Developer Intuition

2. It Targets the Highest-Value Failure Mode

can

doesn't always

ContraPrompt vs. PromptGrad: Complementary, Not Competing

PromptGrad post

	PromptGrad	ContraPrompt

Signal source	"What went wrong?"	"What changed when it went right?"
Best when	Diverse failure modes, systematic errors	Model can self-correct, deep reasoning

Biggest win	AIME +8%, BBH +3%	HotPotQA +29%, GDPR-Bench +18.2%, GPQA +6%
Validation style	Strict (rule must help)	Soft (rule must not hurt)

Failure on	Knowledge-bound tasks	Tasks where retries don't help

They never win on the same benchmark. That's not a coincidence — it's evidence that they capture fundamentally different optimization signals. The full comparison and decision framework is in our final post in this series.

Rishav is a Founding AI Engineer at VizopsAI. He specializes in reinforcement learning and prompt optimization, with research experience at Mila, Wells Fargo, and Pixxel. He holds a B.Tech from BITS Pilani.

VizopsAI builds the secure runtime for enterprise AI applications. vizops.ai