Engineering Virality: How We Taught AI to Stop Being 'Average'

By VizopsAI Team · January 22, 2026 · 12 min read

In the attention economy, 'average' is invisible.

Every CMO, CIO, and executive leader today faces the same paradox with Generative AI: It is incredibly smart, but it is also incredibly average. If you use a standard frontier model (like GPT-5 or Claude Sonnet 4.5) to write a marketing email, it will give you the most statistically probable, "safe" answer. For tasks like summarizing meetings or writing SQL queries, safety is a feature. But when you are fighting for a customer's attention in a crowded inbox, "average" is invisible. You cannot win back a dormant customer by aiming for the median outcome. To break through the noise, you need to swing for the fences. The problem is that standard AI models are mathematically incapable of taking that swing. They are trained to revert to the mean. To compensate for this lack of creativity, businesses historically rely on two expensive crutches:

Bribe — Offering a 10-20% discount to force an open (eroding your margins and commoditizing customer loyalty)

Clickbait — Using deceptive subject lines to force a click (eroding your brand)

We challenged ourselves: *Can an AI agent deliver viral-level engagement without the brand risk of clickbait or the margin erosion of discounts?

The answer was a resounding 'YES.' But the path to getting there wasn't by prompting a generalist model. We had to change the mathematics of how an LLM learns.

Swing for the Fences
We chose headline generation as our proving ground because headlines are the gatekeeper of attention. Before a reader experiences your content, they decide in milliseconds whether to click based on a few words. A brilliant article with a mediocre headline is invisible; a compelling headline can make average content go viral. To benchmark our approach, we analyzed the Upworthy Dataset—a collection of A/B tested headlines from one of the internet's most viral publishers. This dataset captures real user behavior at scale, making it an ideal proxy for high-stakes engagement. The data revealed a brutal power law:

Mean Click Through Rate (CTR): 1.57%

38% of headlines fail completely (CTR < 1%)

Only 2.2% go viral (CTR > 5%)

To optimize headline creation, most teams would use Standard Reinforcement Learning (RL). But here's the catch — Standard RL optimizes for the Mean (Expected Value). It looks at that 1.57% average and teaches the model to chop off the "long tail" of outcomes to prevent the model from saying anything crazy. In doing so, it also prevents the model from saying anything brilliant.

Let's think of a baseball player.

Classic RL is like a player coached to minimize strikeouts. It calculates the "average" outcome of every swing. Since swinging for the fences carries a high risk of missing, Standard RL plays it safe. It hits a lot of singles (mediocre headlines), but it never hits a home run.

To win, we need to change the coaching strategy. We don't teach the model to look at the average. We teach it to see the entire range of possibilities—from "total flop" to "viral sensation." We call this "Distributional RL".

Instead of predicting a single "average" number, our reward model predicts multiple quantiles of the potential outcome. It doesn't just ask "How good is this?"; it asks "What are the odds this goes viral?"

# We track the full spectrum of outcomes Quantiles = [0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]

This gives the model a richer picture of every option:

Q0.5 (Median): The "Base Hit." (What happens most of the time?)

Q0.9 (Upper Tail): The "Home Run." (What happens if this lands?)

Q0.99 (Extreme): The "Grand Slam." (Viral potential.)

With this visibility, we can implement a Risk-Seeking Optimization strategy. We use a weighted reward formula to tell the model: "I value the upside potential more than the safety of the median."

# The "Swing for the Fences" Formula
Reward = Q_0.5 + 0.5  (Q_0.9 - Q_0.5)

Translation: We start with the median performance (Q0.5), but we add a massive bonus based on how high the upside is (Q0.9). This mathematically forces the model to prefer headlines with "viral variance."

Building the Coach

How do we actually build this?

DistributionalRM

class DistributionalRM(nn.Module):
    def __init__(self, quantiles=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]):
        self.encoder = AutoModel.from_pretrained("distilbert-base-uncased")
        self.head = nn.Sequential(
            nn.Linear(768, 256), nn.ReLU(), nn.Dropout(0.1),
            nn.Linear(256, 256), nn.ReLU(), nn.Dropout(0.1),
            nn.Linear(256, len(quantiles))  # Output all quantiles
        )

To train this model, we can't use standard error (MSE) because MSE tries to squash everything to the average. We used Huber Quantile Loss, adapted from advanced game-playing AI (like QR-DQN).

L_τ(y, q) = |τ - 1_{y<q}| × huber(y - q, κ)

Think of this loss function as a coach that understands "luck." It doesn't punish the model severely for missing an outlier; instead, it gently corrects the model's estimation of probability. This allows the model to learn the shape of the distribution without getting confused by the noise inherent in viral data.

The Matchup

Small 14B Model

Claude Sonnet 4.5

DSPy MIPROv2

specifically for Claude

Why did we do this?

Even when the Frontier Model is given every advantage, our DistRL approach still wins.

Qwen 2.5 (14B)

Group Relative Policy Optimization (GRPO)

Generate: The model creates 4 candidate headlines.

Score: Our Distributional Reward Model scores each one.

Advantage: We compute the advantage using the group-relative baseline.

Update: We update the policy to increase the probability of high-reward outputs.

Unlike standard GRPO, our reward signal wasn't a flat number. It incorporated the distributional prediction (upside potential), effectively coaching the model to ignore "safe" answers and hunt for viral outliers.

Judges

Claude Opus 4.5, Claude Sonnet, GPT-5.1, and Gemini 2.5 Pro.

Attention, Coherence, Clarity, Viral Potential, and Originality.

Gemini 2.5 Pro

Why Gemini? We excluded the Claude models from judging this specific round to eliminate "Self-Preference Bias" (models tending to prefer their own output style).

Outperforming the Frontier

Head-to-Head: DistRL vs. Claude Sonnet 4.5

The Results (Gemini Judge):

Composite Score Comparison showing Vizops DistRL (14B) at 43.15, Claude Sonnet 4.5 (DSPy) at 41.40, and Standard RL (14B) at 41.35

DistRL beat Claude by 1.75 points.

Originality: DistRL dominated with 6.45 vs Claude's 4.28 (+2.17).

Coherence: DistRL achieved a perfect 10.00 score vs Claude's 9.97.

Because our model was mathematically rewarded for risk-taking, it produced punchier, uniquely framed angles, whereas the Frontier model—even with optimized prompts—reverted to safe, "assistant-like" patterns.

Eliminating Bias

Method	Non-Claude Composite Score

Vizops DistRL (14B)	40.82
Claude Sonnet 4.5 (DSPy)	40.71

Our small, specialized model matched or exceeded the frontier model's performance at a fraction of the compute cost.

DistRL vs. Standard RL

Distributional RL

Standard RL

Metric	DistRL	Standard RL	Winner

Viral Potential	+1.00	-1.00	DistRL
Originality	+1.00	-1.00	DistRL

Attention	+0.50	-0.50	DistRL
Coherence	0.00	0.00	Tie

Clarity	0.00	0.00	Tie
Total Impact	+2.50	-2.50	DistRL

DistRL dominates on the "Business Metrics" (Viral Potential, Originality, Attention) while maintaining exact parity on "Safety Metrics" (Coherence, Clarity). This confirms our hypothesis: Distributional RL teaches the model to take creative risks without hallucinating.

See the Difference Yourself

Try the models: We've open-sourced both models so you can test them yourself.
Distributional RL Model — The risk-seeking approach

Standard RL Model — The mean-optimizing baseline

Abstract scores are one thing, but looking at the actual output shows why DistRL wins. It captures voice, emotion, and "punchiness" that standard models smooth over.

Example 1: Geek Girls Anthem

Lede:

"If you're a lady-human of the geek persuasion, you've definitely had a run-in or two with dude-geeks who think you just aren't geeky enough..."

Method	Headline

Claude	"This Epic Song Shuts Down Every Dude Who Says Women Aren't Geeky Enough"
Vizops DistRL	"Lady Geeks Strike Back with Nerd-Tastic Song: You Don't Define My Geekery"

Standard RL	"Geeky Ladies Roast Dude-Geeks in Hilarious Nerd-Tastic Song: Watch Famous Faces"

DistRL's headline is significantly punchier with a stronger voice ("You Don't Define My Geekery").

Example 2: Disability PSA

Lede:

"If you've ever talked to someone with a disability, you've probably noticed that they're people and not just their disabilities..."

Method	Headline

Claude	"People With Disabilities Share the Hilariously Obvious Things You Need to Stop Doing Right Now"
Vizops DistRL	"Stop the Awkward! People With Disabilities Reveal Hilarious Yet Heartfelt Tips for Getting It Right"

Standard RL	"Stop Awkward Moments Cold: People with Disabilities Reveal Their Simple Tips for Respectful Interaction"

DistRL balances humor ("Stop the Awkward!") with empathy ("Heartfelt").

Example 3: Marijuana for Epilepsy

Lede:

"Marijuana is a big deal here in Colorado... one little girl, named Charlotte, considers it a lifesaver. Because she has pediatric epilepsy which causes 400 seizures A WEEK."

Method	Headline

Claude	"This 5-Year-Old Had 400 Seizures a Week Until She Found This Miracle Plant"
Vizops DistRL	"Miracle Weed Saves Little Charlotte from 400 Seizures a Week: The Emotional Journey of a Pediatric Epilepsy Breakthrough"

Standard RL	"Miracle Weed Saves Little Girl from 400 Seizures a Week: The Emotional Story Behind the Life-Changing Science"

DistRL personalizes ("Little Charlotte") and adds emotional framing ("Emotional Journey") rather than just stating facts.

Economics of High Performance

out-create

Input Context: 2,500 tokens per email.

Volume: 1,000,000 emails per day.

Frontier Price: ~$3.00 per 1M tokens (Blended).

That adds up to $7,500/day or a whopping $2.7 million per year! And that's for just one channel!

For a retargeting campaign, a $2.7 million bill destroys the ROI before you send a single email. This forces most companies into a "Quality vs. Scale" trade-off: either you pay the premium for smarts (and destroy margin), or you settle for a "dumber," cheaper model (and destroy engagement).

We eliminated the trade-off.

By using Unsloth for efficient fine-tuning and vLLM for serving, we replaced a massive API bill with cheap commodity hardware.

Frontier Bill: $7,500/day

Vizops Bill: ~$40/day in GPU costs = $14,600/year.

We achieved better engagement metrics (beating Claude by 1.75 points) with 99% reduction in cost. You don't have to choose between breaking the bank or being boring.

The Age of Specialized Intelligence

Core Business Loops

Specialization is the new Frontier.

Distributional RL

Agentic Reranking

You don't need a bigger model to win. You need a smarter one.

Want to see the difference? Try our Distributional RL model and Standard RL baseline on Hugging Face. Ready to build specialized AI agents for your business? Request Early Access or reach out at contact@vizops.ai