We Released a SOTA Prompt Optimizer. It Beats GEPA by 18% by Doing Less, Not More.

By Rishav · February 18, 2026 · 8 min read

Executive Summary
Introducing VizOps PromptGrad — a system that automatically makes your AI prompts better and tells you exactly why they're better

It beats the current best optimizer (GEPA) by 18% across five industry benchmarks

The key insight: most optimizers keep every "improvement" they find. We test each one individually and throw away the ones that don't actually help. That single decision accounts for a 14% performance gap.

Bottom line for teams deploying AI: optimized prompts on a cheap model ($0.25/1M tokens) approach the quality of models that cost 30–60× more. That's the difference between a $50K/month API bill and a $2K one.

The Problem Everyone Knows But Nobody's Solved

Every team deploying LLMs hits the same wall: prompts are fragile, expensive to tune, and impossible to maintain at scale. The manual approach — engineers iterating through trial and error — costs weeks of senior talent per task. Automated optimizers exist, but they mostly work through evolutionary search: generate hundreds of prompt variants, test them all, keep the winner. It works, but it's a black box. When the winning prompt breaks in production, you have no idea why it won or what to fix. We asked: what if prompt optimization could work like gradient descent — the technique that made deep learning possible? In neural networks, gradient descent works because it decomposes the problem. It identifies which specific parameters contribute to each error and nudges them individually. Every update is targeted, validated, measurable. That's the opposite of "generate random variants and hope."

How PromptGrad Works

The core loop has four steps, each inspired by how gradient descent operates on neural networks — but adapted for natural language. 1. Stratified Failure Sampling PromptGrad processes failures in carefully constructed batches. Not random samples — stratified samples that ensure every type of error is represented. Without this, optimization fixates on the most common failure and ignores edge cases. In our ablations, removing stratification dropped performance by 4% on average. 2. Textual Gradient Extraction For each batch, a stronger "reflection" model — think of it as a senior engineer reviewing a junior's work — analyzes the failures and proposes structured correction rules. These are the textual gradients: concrete, actionable instructions pointing in the direction of improvement. Here's a real rule extracted during HotPotQA optimization:

IF the question requires combining facts from multiple paragraphs, THEN explicitly list each fact with its source paragraph number BEFORE attempting to combine them. Common failure: jumping to a combined answer after finding only the first relevant fact.

These aren't vague tips. They're specific, conditional instructions derived from actual error patterns.

The Moment We Knew We Were Onto Something

We removed one component and performance dropped 14%. That's how we knew. When we set out to build a better prompt optimizer, we didn't expect the most important discovery to be about what to throw away. But that's exactly what happened. PromptGrad proposes dozens of improvement rules during optimization. Most optimizers would keep them all. We validate each one individually on held-out data, and only keep the ones that provably help. When we disabled that validation in an ablation study, performance cratered by 14% on average. Rules that looked brilliant on training data were actively hurting generalization — by up to 13 percentage points on some benchmarks. That single finding shaped everything about how PromptGrad works. And it's why it beats the state of the art by 18%. 3. Per-Rule Statistical Validation (The Critical Innovation) This is where that 14% ablation finding lives. Before any rule joins the prompt, it's tested alone on 15 held-out examples. It must improve performance by at least 1% to be accepted. No exceptions. No bundling with other rules. Each rule earns its place on individual merit. Why is this so important? Because LLM failures are noisy. A rule might seem to help on training examples by coincidence. Without independent validation, you accumulate these false positives until your prompt is bloated with contradictory instructions that collectively make things worse. We measured this: unvalidated rule sets hurt generalization by up to 13%. Per-rule validation solves the credit assignment problem — you know exactly which rules help and which don't. This makes the final prompt not just better, but debuggable. 4. Two-Tier Prompt Architecture Accepted rules accumulate in a structured layer on top of frozen base instructions:

Global layer (frozen): Foundational reasoning instructions. Chain-of-thought framing, output format, task description. Never modified during optimization.

Local layer (learned): Validated correction rules, each targeting a specific failure pattern. Grows during optimization, pruned when it gets too large.

This separation prevents catastrophic forgetting — the optimizer can never accidentally destroy what's already working. When the local rule set exceeds 8 rules, a merging step consolidates similar rules. Removing merging caused a 5% performance drop in ablations, confirming that unchecked accumulation degrades prompt quality.

Results: Winning 4 Out of 5 Benchmarks

Benchmark	What It Tests	Baseline	GEPA	PromptGrad	vs. GEPA

HotPotQA	Multi-hop reasoning	27.0%	36.8%	45.8%	+24%
AIME 2025	Competition mathematics	30.0%	43.3%	46.7%	+8%

BBH	27 diverse reasoning tasks	26.1%	87.6%	89.8%	+3%
GPQA Diamond	Graduate-level science	58.1%	64.2%	66.9%	+4%

MMLU-Pro	Professional knowledge	81.2%	80.2%	79.6%	-1%

Normalized performance: 0.755 vs. GEPA's 0.641 — an 18% improvement.

All results use Claude Haiku 4.5 — a small, cost-efficient model. That's deliberate: we're optimizing the model you can actually afford to run at scale.

The one loss (MMLU-Pro, by 0.6 points) is instructive. MMLU-Pro tests factual knowledge — essentially trivia. Prompt optimization can improve how a model reasons, but it can't teach it facts it doesn't know. This limitation is real, and knowing it saves you from wasting optimization budget on knowledge-bound tasks.

Why the Improvement Scales with Reasoning Depth

Multi-hop synthesis (HotPotQA): +24%

Mathematical proof (AIME): +8%

Diverse logical reasoning (BBH): +3%

Mixed reasoning + knowledge (GPQA): +4%

Pure knowledge recall (MMLU-Pro): -1%

The pattern is intuitive: textual gradients diagnose reasoning failures. When the bottleneck is how the model thinks (not what it knows), PromptGrad delivers.

What This Means for Enterprise LLM Economics

The model cost gap is brutal.

Prompt optimization narrows the quality gap without changing the model.

And you can debug it.

A Real Optimization, Start to Finish

"Answer the question using the provided context. Break down the reasoning step by step."

Reasonable. Generic. 36.8% accuracy.

Here's what PromptGrad produced — each rule independently validated:

Rule 1: When answering multi-hop questions, identify ALL required facts before combining any of them. > Rule 2: For each fact, explicitly verify it against the source context — do not rely on memory. > Rule 3: Track which document each fact came from. Cite the source before using the fact. > Rule 4: Only combine facts once ALL are independently verified. Common failure: premature combination after partial evidence.
> Rule 5: If the question asks for a specific entity type (person, place, date), verify your answer matches that type before outputting.

45.8% accuracy. +24% over GEPA. And a product team can read every rule, understand it, and make informed decisions about deployment.

What's Next

Rishav is a Founding AI Engineer at VizopsAI. He specializes in reinforcement learning and prompt optimization, with research experience at Mila, Wells Fargo, and Pixxel. He holds a B.Tech from BITS Pilani.

VizopsAI builds the secure runtime for enterprise AI applications. vizops.ai