We Achieved 29% Better Reasoning by Teaching LLMs to Learn from Their Own Failures

By Rishav · February 22, 2026 · 10 min read
This is Part 2 of our prompt optimization series. In Part 1, we introduced PromptGrad — a gradient-descent-inspired optimizer that beat the state of the art by 18% across five benchmarks. In this post, we tackle a different signal: when an AI stumbles on a problem but gets it right on a second try, something changed in its reasoning — and that difference is a goldmine for optimization.

Executive Summary > Introducing VizopsAI ContraPrompt: a system that watches AI fail then succeed on the same problem, extracts what changed, and bakes those patterns into the prompt permanently. No fine-tuning required. > - +29% on multi-hop reasoning, +19% normalized across six benchmarks — by mining the gap between a model's first-pass failures and its self-corrections.

Models know more than they show. We built a system to capture what they're holding back.

Ask Claude a multi-hop question — one that requires connecting facts across multiple documents — and it stumbles. It grabs the first relevant fact and jumps to a conclusion, skipping the verification step. Now give it a second chance. Just say: "Your answer was incorrect. Think more carefully." 30–40% of the time, it nails it. Same model, same question, same context. The only thing that changed was a generic nudge. The knowledge was there the entire time — the model just didn't deploy the right reasoning strategy on the first try. Every developer who's worked with LLMs has noticed this. What nobody had done — until now — was systematically mine that gap between failure and success, extract the patterns that explain the difference, and bake them permanently into the prompt. We're calling it Contrastive Prompting (or ContraPrompt for short).

Lets walk through an example

Here's an actual HotPotQA problem from our experiments. The question requires connecting information across two Wikipedia paragraphs: Question: "Are the directors of 'Jaws' and 'The Color Purple' the same person?" Attempt 1 (Failed — score 0.2):
"Jaws was directed by Steven Spielberg. The Color Purple is a novel by Alice Walker that was adapted into a film. Based on the information provided, I cannot confirm they share the same director."

The model found one fact (Spielberg directed Jaws) but hedged on the second instead of verifying it in the context. It had the answer — Spielberg directed both — but didn't commit to checking.

Attempt 2 (After nudge — score 1.0):
"Jaws (1975) was directed by Steven Spielberg [Paragraph 1]. The Color Purple (1985) was also directed by Steven Spielberg [Paragraph 3]. Yes, both films were directed by the same person: Steven Spielberg."

Same model. Same context. The difference: on the second attempt, it verified each fact against a specific paragraph before combining them.

ContraPrompt's extracted rule:
"When answering comparison questions, explicitly locate and cite the relevant fact for EACH entity before stating the comparison. Do not assume or hedge — verify against the source."

That rule, extracted from watching the model fail then succeed, improved performance across hundreds of similar questions. One contrastive pair → one rule → systematic improvement.


How ContraPrompt Works

The method has three phases, each building on the last.

Phase 1: Multi-Attempt Solving

For each training example, ContraPrompt gives the model up to three shots: