comparisonApr 16, 202610 min read

Long-Context Pricing Compared: Who Charges You 2x for Using More Than 200K Tokens

Every provider advertises million-token context windows. Not every provider tells you they double the price when you use them.

A million-token context window sounds amazing. Feed in an entire codebase, a full document set, a day's worth of logs — and let the model work its magic.

But here's the part the marketing pages leave out: some providers charge 2x when you actually use that context window.

The advertised per-token price? That's the base rate. Cross a threshold — usually 200K or 272K tokens — and both input and output costs jump. Some providers double them. Others stay flat. The difference matters more than you'd expect.

We compared long-context pricing across OpenAI, Anthropic, Google, and xAI to find out who's actually affordable when you push past 200K tokens.

The Base Prices: Before Surcharges Kick In

First, here's what each provider charges at "normal" context lengths (under 200K tokens), per million tokens:

Model	Input/MTok	Output/MTok	Max Context	Provider
GPT-5.4	$2.50	$15.00	1.05M	OpenAI
Claude Opus 4.6	$5.00	$25.00	1M	Anthropic
Claude Sonnet 4.6	$3.00	$15.00	1M	Anthropic
Gemini 2.5 Pro	$1.25	$10.00	1M	Google
Gemini 3.1 Pro	$2.00	$12.00	1M+	Google
Grok 4.1 Fast	$0.20	$0.50	2M	xAI
GPT-4.1	$2.00	$8.00	1M	OpenAI

At these base rates, the ranking is clear: Grok 4.1 Fast is absurdly cheap, Gemini 2.5 Pro offers the best flagship value, and Anthropic charges a premium. But base rates are only half the story.

The Surcharge Map: What Happens Above 200K Tokens

Here's where it gets interesting. Each provider handles long context differently:

Provider	Model	Surcharge Threshold	Input Multiplier	Output Multiplier
OpenAI	GPT-5.4	272K tokens	2x ($5.00/MTok)	1.5x ($22.50/MTok)
OpenAI	GPT-4.1	None	1x (flat)	1x (flat)
Google	Gemini 2.5 Pro	200K tokens	2x ($2.50/MTok)	1.5x ($15.00/MTok)
Google	Gemini 3.1 Pro	200K tokens	2x ($4.00/MTok)	1.5x ($18.00/MTok)
Anthropic	All Claude models	None	1x (flat)	1x (flat)
xAI	Grok 4.1 Fast	None	1x (flat)	1x (flat)

Key takeaway: Anthropic, xAI, and OpenAI's GPT-4.1 charge flat rates at any context length. OpenAI's GPT-5.4 and all Google Gemini models apply surcharges above a threshold. The "cheapest" model on paper can become one of the most expensive in practice.

Let that sink in. Gemini 2.5 Pro looks like a bargain at $1.25 input — until you realize it's $2.50 once you cross 200K tokens. GPT-5.4 goes from $2.50 to $5.00 input at 272K. Meanwhile, Claude Opus 4.6 stays at $5.00 input whether you send 10K or 900K tokens.

Real-World Cost Comparison: 500K Token Requests

Let's model a realistic long-context scenario. You're building a codebase Q&A system that sends 500K input tokens (the codebase) and generates 2K output tokens (the answer) per query.

Cost per query at 500K input + 2K output:

Model	Input Cost	Output Cost	Total per Query	Monthly (1,000 queries)
Grok 4.1 Fast	$0.10	$0.001	$0.10	$100
GPT-4.1	$1.00	$0.016	$1.02	$1,016
Gemini 2.5 Pro (surcharge)	$1.25	$0.020	$1.27	$1,270
Claude Sonnet 4.6	$1.50	$0.030	$1.53	$1,530
GPT-5.4 (surcharge)	$2.50	$0.030	$2.53	$2,530
Claude Opus 4.6	$2.50	$0.050	$2.55	$2,550
Gemini 3.1 Pro (surcharge)	$2.00	$0.024	$2.02	$2,024

Wait — did you catch that?

GPT-5.4 and Claude Opus 4.6 cost almost the same at 500K tokens, even though GPT-5.4's base price is half of Opus. The surcharge closes the gap entirely. And Gemini 2.5 Pro, which looks 4x cheaper than Claude Sonnet at base rates, is only 17% cheaper when you actually use the full context.

The Breakeven Points: When Flat Pricing Wins

At what context length do surcharge-free models become cheaper than models with lower base prices?

GPT-5.4 vs Claude Opus 4.6

Below 272K tokens: GPT-5.4 is cheaper ($2.50 vs $5.00 input)
Above 272K tokens: GPT-5.4 jumps to $5.00 input — tied with Opus
Factor in output: GPT-5.4 at $22.50 vs Opus at $25.00 — still slightly cheaper on output above the threshold
Verdict: GPT-5.4 stays marginally cheaper at all context lengths, but the advantage shrinks from 50% to about 10% once you cross 272K

Gemini 2.5 Pro vs Claude Sonnet 4.6

Below 200K tokens: Gemini is a clear winner ($1.25 vs $3.00 input)
Above 200K tokens: Gemini jumps to $2.50 input vs Sonnet's flat $3.00
Verdict: Gemini 2.5 Pro stays cheaper, but the gap narrows from 58% savings to just 17%

The GPT-4.1 Anomaly

Here's the hidden gem: GPT-4.1 has no long-context surcharge, a 1M token window, and costs $2.00 input / $8.00 output at any length. That makes it:

Cheaper than GPT-5.4 above 272K tokens ($2.00 vs $5.00 input)
Cheaper than Gemini 2.5 Pro above 200K tokens ($2.00 vs $2.50 input)
Cheaper than both Anthropic models at any length

For long-context workloads where you don't need absolute frontier intelligence, GPT-4.1 is arguably the best value in the market right now.

Caching Changes Everything (Again)

Long-context pricing gets even more complex when you add prompt caching. If you're sending the same large context repeatedly — the same codebase, the same document set — caching slashes costs dramatically:

Model	Normal Input/MTok	Cached Input/MTok	Cache Discount
GPT-5.4 (under 272K)	$2.50	$0.25	90% off
GPT-5.4 (over 272K)	$5.00	$0.50	90% off
Claude Opus 4.6	$5.00	$0.50	90% off
Claude Sonnet 4.6	$3.00	$0.30	90% off
Gemini 2.5 Pro (under 200K)	$1.25	~$0.32	75% off
Gemini 2.5 Pro (over 200K)	$2.50	~$0.63	75% off
GPT-4.1	$2.00	$0.50	75% off

Caching is the great equalizer. With caching enabled, even the expensive models become affordable for long-context use. But notice: OpenAI and Anthropic offer 90% cache discounts, while Google offers 75%. That difference compounds at scale.

The new cost ranking with caching (500K cached input + 2K output):

Model	Cached Input Cost	Output Cost	Total per Query
Grok 4.1 Fast	$0.10	$0.001	$0.10
GPT-5.4 (cached, surcharge)	$0.25	$0.030	$0.28
Claude Sonnet 4.6 (cached)	$0.15	$0.030	$0.18
Claude Opus 4.6 (cached)	$0.25	$0.050	$0.30
GPT-4.1 (cached)	$0.25	$0.016	$0.27
Gemini 2.5 Pro (cached, surcharge)	$0.32	$0.020	$0.34

With caching, Claude Sonnet 4.6 actually becomes the cheapest major model for repeated long-context queries — beating both GPT-5.4 and Gemini 2.5 Pro. The flat pricing combined with a 90% cache discount is a powerful combination.

The Decision Matrix: Which Model for Which Long-Context Workload

Workload	Best Choice	Why
One-off long-context analysis (ad hoc queries)	GPT-4.1	Flat pricing, cheapest above 200K, good enough for most tasks
Repeated long-context queries (RAG, codebase Q&A)	Claude Sonnet 4.6	Flat pricing + 90% cache discount = lowest effective cost
Budget-sensitive high volume	Grok 4.1 Fast	$0.20/$0.50 flat — nothing else comes close on raw cost
Maximum intelligence needed	Claude Opus 4.6	Flat pricing means predictable bills; Gemini 3.1 Pro surcharges make it less predictable
Short-context tasks (under 50K)	Gemini 2.5 Pro or GPT-5.4	Base rates are lowest; surcharges don't apply

How to Know Which Bucket You're In

The problem is that most teams don't know their context length distribution. They build an app, start making API calls, and only check the bill at the end of the month.

By then, a few things have happened:

That "simple" RAG pipeline is sending 300K tokens per query (context + retrieved docs + conversation history)
Conversation history is ballooning because nobody set a sliding window
The embedding retrieval is over-fetching chunks "just in case"

This is exactly the kind of waste that per-feature attribution catches. When you can see that your /api/chat endpoint is averaging 400K tokens per call while /api/summarize uses 20K, you know where to focus optimization.

Track your actual context usage with AISpendGuard — tag each API call by feature, see average input/output tokens per route, and catch the long-context calls that are inflating your bill before they show up on the invoice.

Three Actions You Can Take Today

1. Audit your context lengths. Check your average input token count per API call. If most calls are under 100K tokens, surcharges aren't affecting you — optimize elsewhere. If calls regularly exceed 200K, you're likely paying surcharges without realizing it.

2. Implement caching for repeated contexts. If you're sending the same system prompt, codebase, or document set across multiple calls, enable prompt caching immediately. The 75-90% discount on cached input is the single biggest cost lever available.

3. Match models to context needs. Don't use a surcharged flagship for long-context work when GPT-4.1 or Claude Sonnet 4.6 offer flat pricing at competitive quality. Reserve GPT-5.4 and Gemini 3.1 Pro for short-context tasks where their base rates shine.

The Bottom Line

Long-context AI pricing isn't what it looks like on the pricing page. The provider that's cheapest at 50K tokens may not be cheapest at 500K tokens. Surcharges, caching discounts, and output cost ratios all shift the math depending on how you actually use the API.

The providers that charge flat rates — Anthropic, xAI, and OpenAI's GPT-4.1 — offer the most predictable billing for long-context workloads. The providers that apply surcharges — OpenAI's GPT-5.4, Google's entire Gemini line — can still be cost-effective if most of your calls stay under the threshold.

The only way to know which model is actually cheapest for your workload is to track your real usage patterns. Not averages across all calls — per-feature, per-route, per-customer breakdowns that show where tokens are going and whether you're hitting surcharge territory.

Start monitoring for free with AISpendGuard — see exactly which features are burning through long-context tokens and where you can cut costs by switching models or enabling caching.

Pricing data current as of April 2026. Sources: OpenAI API pricing, Anthropic API pricing, Google AI Developer pricing, xAI API pricing. AISpendGuard syncs model prices daily from LiteLLM's open pricing database and tracks changes over time — see our price changes page for historical data.