We Released a SOTA Prompt Optimizer. It Beats GEPA by 18% by Doing Less, Not More.
Executive Summary
- Introducing VizOps PromptGrad — a system that automatically makes your AI prompts better and tells you exactly why they're better
- It beats the current best optimizer (GEPA) by 18% across five industry benchmarks
- The key insight: most optimizers keep every "improvement" they find. We test each one individually and throw away the ones that don't actually help. That single decision accounts for a 14% performance gap.
- Bottom line for teams deploying AI: optimized prompts on a cheap model ($0.25/1M tokens) approach the quality of models that cost 30–60× more. That's the difference between a $50K/month API bill and a $2K one.
The Problem Everyone Knows But Nobody's Solved
Every team deploying LLMs hits the same wall: prompts are fragile, expensive to tune, and impossible to maintain at scale. The manual approach — engineers iterating through trial and error — costs weeks of senior talent per task. Automated optimizers exist, but they mostly work through evolutionary search: generate hundreds of prompt variants, test them all, keep the winner. It works, but it's a black box. When the winning prompt breaks in production, you have no idea why it won or what to fix. We asked: what if prompt optimization could work like gradient descent — the technique that made deep learning possible? In neural networks, gradient descent works because it decomposes the problem. It identifies which specific parameters contribute to each error and nudges them individually. Every update is targeted, validated, measurable. That's the opposite of "generate random variants and hope."How PromptGrad Works
The core loop has four steps, each inspired by how gradient descent operates on neural networks — but adapted for natural language. 1. Stratified Failure Sampling PromptGrad processes failures in carefully constructed batches. Not random samples — stratified samples that ensure every type of error is represented. Without this, optimization fixates on the most common failure and ignores edge cases. In our ablations, removing stratification dropped performance by 4% on average. 2. Textual Gradient Extraction For each batch, a stronger "reflection" model — think of it as a senior engineer reviewing a junior's work — analyzes the failures and proposes structured correction rules. These are the textual gradients: concrete, actionable instructions pointing in the direction of improvement. Here's a real rule extracted during HotPotQA optimization:IF the question requires combining facts from multiple paragraphs, THEN explicitly list each fact with its source paragraph number BEFORE attempting to combine them. Common failure: jumping to a combined answer after finding only the first relevant fact.
These aren't vague tips. They're specific, conditional instructions derived from actual error patterns.
The Moment We Knew We Were Onto Something
We removed one component and performance dropped 14%. That's how we knew. When we set out to build a better prompt optimizer, we didn't expect the most important discovery to be about what to throw away. But that's exactly what happened. PromptGrad proposes dozens of improvement rules during optimization. Most optimizers would keep them all. We validate each one individually on held-out data, and only keep the ones that provably help. When we disabled that validation in an ablation study, performance cratered by 14% on average. Rules that looked brilliant on training data were actively hurting generalization — by up to 13 percentage points on some benchmarks. That single finding shaped everything about how PromptGrad works. And it's why it beats the state of the art by 18%. 3. Per-Rule Statistical Validation (The Critical Innovation) This is where that 14% ablation finding lives. Before any rule joins the prompt, it's tested alone on 15 held-out examples. It must improve performance by at least 1% to be accepted. No exceptions. No bundling with other rules. Each rule earns its place on individual merit. Why is this so important? Because LLM failures are noisy. A rule might seem to help on training examples by coincidence. Without independent validation, you accumulate these false positives until your prompt is bloated with contradictory instructions that collectively make things worse. We measured this: unvalidated rule sets hurt generalization by up to 13%. Per-rule validation solves the credit assignment problem — you know exactly which rules help and which don't. This makes the final prompt not just better, but debuggable. 4. Two-Tier Prompt Architecture Accepted rules accumulate in a structured layer on top of frozen base instructions:- Global layer (frozen): Foundational reasoning instructions. Chain-of-thought framing, output format, task description. Never modified during optimization.
- Local layer (learned): Validated correction rules, each targeting a specific failure pattern. Grows during optimization, pruned when it gets too large.
- Multi-hop synthesis (HotPotQA): +24%
- Mathematical proof (AIME): +8%
- Diverse logical reasoning (BBH): +3%
- Mixed reasoning + knowledge (GPQA): +4%
- Pure knowledge recall (MMLU-Pro): -1%
This separation prevents catastrophic forgetting — the optimizer can never accidentally destroy what's already working. When the local rule set exceeds 8 rules, a merging step consolidates similar rules. Removing merging caused a 5% performance drop in ablations, confirming that unchecked accumulation degrades prompt quality.
Results: Winning 4 Out of 5 Benchmarks
We tested against GEPA — a strong evolutionary optimizer that's the current standard — across five of the hardest reasoning benchmarks:| Benchmark | What It Tests | Baseline | GEPA | PromptGrad | vs. GEPA |
|---|
| HotPotQA | Multi-hop reasoning | 27.0% | 36.8% | 45.8% | +24% |
|---|---|---|---|---|---|
| AIME 2025 | Competition mathematics | 30.0% | 43.3% | 46.7% | +8% |
| BBH | 27 diverse reasoning tasks | 26.1% | 87.6% | 89.8% | +3% |
|---|---|---|---|---|---|
| GPQA Diamond | Graduate-level science | 58.1% | 64.2% | 66.9% | +4% |
| MMLU-Pro | Professional knowledge | 81.2% | 80.2% | 79.6% | -1% |
|---|
All results use Claude Haiku 4.5 — a small, cost-efficient model. That's deliberate: we're optimizing the model you can actually afford to run at scale.
The one loss (MMLU-Pro, by 0.6 points) is instructive. MMLU-Pro tests factual knowledge — essentially trivia. Prompt optimization can improve how a model reasons, but it can't teach it facts it doesn't know. This limitation is real, and knowing it saves you from wasting optimization budget on knowledge-bound tasks.
Why the Improvement Scales with Reasoning Depth
We found a strong correlation (ρ = 0.90, p < 0.05) between a benchmark's reasoning intensity and PromptGrad's improvement over GEPA:The pattern is intuitive: textual gradients diagnose reasoning failures. When the bottleneck is how the model thinks (not what it knows), PromptGrad delivers.
What This Means for Enterprise LLM Economics
Here's the math that matters to decision-makers. The model cost gap is brutal. Your most capable model (Opus-class, GPT-4-class) might cost $15–75 per million tokens. A small model (Haiku-class) costs $0.25–1.00. At enterprise scale — millions of queries per month — that's the difference between a $50K/month API bill and a $2K/month one. Prompt optimization narrows the quality gap without changing the model. On HotPotQA, naive Haiku scores 27%. Optimized Haiku scores 46% — a 70% relative improvement. The optimization itself costs roughly 48,000 tokens (~$0.50) and runs once. The resulting prompt works forever at zero marginal cost. You're not going to fully replace an Opus-class model. But for many tasks, you can get 70–80% of the way there at 2–3% of the cost. The optimization ROI is measured in days, not months. And you can debug it. When an evolutionary optimizer hands you a winning prompt, it's a black box. When PromptGrad hands you an optimized prompt, it comes with 5–10 explicit, validated rules. When something breaks in production, you read the rules, find the culprit, and fix it. This is the difference between a research demo and a production system.A Real Optimization, Start to Finish
Here's what GEPA's evolutionary search produced for HotPotQA:"Answer the question using the provided context. Break down the reasoning step by step."
Reasonable. Generic. 36.8% accuracy.
Here's what PromptGrad produced — each rule independently validated:
Rule 1: When answering multi-hop questions, identify ALL required facts before combining any of them. > Rule 2: For each fact, explicitly verify it against the source context — do not rely on memory. > Rule 3: Track which document each fact came from. Cite the source before using the fact. > Rule 4: Only combine facts once ALL are independently verified. Common failure: premature combination after partial evidence.> Rule 5: If the question asks for a specific entity type (person, place, date), verify your answer matches that type before outputting.
45.8% accuracy. +24% over GEPA. And a product team can read every rule, understand it, and make informed decisions about deployment.
What's Next
PromptGrad is one half of the story. We're working on methods that learn from something PromptGrad can't see — the gap between a model's failures and its self-corrections. Stay tuned.Rishav is a Founding AI Engineer at VizopsAI. He specializes in reinforcement learning and prompt optimization, with research experience at Mila, Wells Fargo, and Pixxel. He holds a B.Tech from BITS Pilani. VizopsAI builds the secure runtime for enterprise AI applications. vizops.ai