We Released a SOTA Prompt Optimizer. It Beats GEPA by 18% by Doing Less, Not More.

By Rishav · February 18, 2026 · 8 min read
Executive Summary

The Problem Everyone Knows But Nobody's Solved

Every team deploying LLMs hits the same wall: prompts are fragile, expensive to tune, and impossible to maintain at scale. The manual approach — engineers iterating through trial and error — costs weeks of senior talent per task. Automated optimizers exist, but they mostly work through evolutionary search: generate hundreds of prompt variants, test them all, keep the winner. It works, but it's a black box. When the winning prompt breaks in production, you have no idea why it won or what to fix. We asked: what if prompt optimization could work like gradient descent — the technique that made deep learning possible? In neural networks, gradient descent works because it decomposes the problem. It identifies which specific parameters contribute to each error and nudges them individually. Every update is targeted, validated, measurable. That's the opposite of "generate random variants and hope."

How PromptGrad Works

The core loop has four steps, each inspired by how gradient descent operates on neural networks — but adapted for natural language. 1. Stratified Failure Sampling PromptGrad processes failures in carefully constructed batches. Not random samples — stratified samples that ensure every type of error is represented. Without this, optimization fixates on the most common failure and ignores edge cases. In our ablations, removing stratification dropped performance by 4% on average. 2. Textual Gradient Extraction For each batch, a stronger "reflection" model — think of it as a senior engineer reviewing a junior's work — analyzes the failures and proposes structured correction rules. These are the textual gradients: concrete, actionable instructions pointing in the direction of improvement. Here's a real rule extracted during HotPotQA optimization:
IF the question requires combining facts from multiple paragraphs, THEN explicitly list each fact with its source paragraph number BEFORE attempting to combine them. Common failure: jumping to a combined answer after finding only the first relevant fact.

These aren't vague tips. They're specific, conditional instructions derived from actual error patterns.

The Moment We Knew We Were Onto Something

We removed one component and performance dropped 14%. That's how we knew. When we set out to build a better prompt optimizer, we didn't expect the most important discovery to be about what to throw away. But that's exactly what happened. PromptGrad proposes dozens of improvement rules during optimization. Most optimizers would keep them all. We validate each one individually on held-out data, and only keep the ones that provably help. When we disabled that validation in an ablation study, performance cratered by 14% on average. Rules that looked brilliant on training data were actively hurting generalization — by up to 13 percentage points on some benchmarks. That single finding shaped everything about how PromptGrad works. And it's why it beats the state of the art by 18%. 3. Per-Rule Statistical Validation (The Critical Innovation) This is where that 14% ablation finding lives. Before any rule joins the prompt, it's tested alone on 15 held-out examples. It must improve performance by at least 1% to be accepted. No exceptions. No bundling with other rules. Each rule earns its place on individual merit. Why is this so important? Because LLM failures are noisy. A rule might seem to help on training examples by coincidence. Without independent validation, you accumulate these false positives until your prompt is bloated with contradictory instructions that collectively make things worse. We measured this: unvalidated rule sets hurt generalization by up to 13%. Per-rule validation solves the credit assignment problem — you know exactly which rules help and which don't. This makes the final prompt not just better, but debuggable. 4. Two-Tier Prompt Architecture Accepted rules accumulate in a structured layer on top of frozen base instructions: