A million-token context window sounds amazing. Feed in an entire codebase, a full document set, a day's worth of logs — and let the model work its magic.
But here's the part the marketing pages leave out: some providers charge 2x when you actually use that context window.
The advertised per-token price? That's the base rate. Cross a threshold — usually 200K or 272K tokens — and both input and output costs jump. Some providers double them. Others stay flat. The difference matters more than you'd expect.
We compared long-context pricing across OpenAI, Anthropic, Google, and xAI to find out who's actually affordable when you push past 200K tokens.
The Base Prices: Before Surcharges Kick In
First, here's what each provider charges at "normal" context lengths (under 200K tokens), per million tokens:
| Model | Input/MTok | Output/MTok | Max Context | Provider |
|---|---|---|---|---|
| GPT-5.4 | $2.50 | $15.00 | 1.05M | OpenAI |
| Claude Opus 4.6 | $5.00 | $25.00 | 1M | Anthropic |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M | Anthropic |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M | |
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M+ | |
| Grok 4.1 Fast | $0.20 | $0.50 | 2M | xAI |
| GPT-4.1 | $2.00 | $8.00 | 1M | OpenAI |
At these base rates, the ranking is clear: Grok 4.1 Fast is absurdly cheap, Gemini 2.5 Pro offers the best flagship value, and Anthropic charges a premium. But base rates are only half the story.
The Surcharge Map: What Happens Above 200K Tokens
Here's where it gets interesting. Each provider handles long context differently:
| Provider | Model | Surcharge Threshold | Input Multiplier | Output Multiplier |
|---|---|---|---|---|
| OpenAI | GPT-5.4 | 272K tokens | 2x ($5.00/MTok) | 1.5x ($22.50/MTok) |
| OpenAI | GPT-4.1 | None | 1x (flat) | 1x (flat) |
| Gemini 2.5 Pro | 200K tokens | 2x ($2.50/MTok) | 1.5x ($15.00/MTok) | |
| Gemini 3.1 Pro | 200K tokens | 2x ($4.00/MTok) | 1.5x ($18.00/MTok) | |
| Anthropic | All Claude models | None | 1x (flat) | 1x (flat) |
| xAI | Grok 4.1 Fast | None | 1x (flat) | 1x (flat) |
Key takeaway: Anthropic, xAI, and OpenAI's GPT-4.1 charge flat rates at any context length. OpenAI's GPT-5.4 and all Google Gemini models apply surcharges above a threshold. The "cheapest" model on paper can become one of the most expensive in practice.
Let that sink in. Gemini 2.5 Pro looks like a bargain at $1.25 input — until you realize it's $2.50 once you cross 200K tokens. GPT-5.4 goes from $2.50 to $5.00 input at 272K. Meanwhile, Claude Opus 4.6 stays at $5.00 input whether you send 10K or 900K tokens.
Real-World Cost Comparison: 500K Token Requests
Let's model a realistic long-context scenario. You're building a codebase Q&A system that sends 500K input tokens (the codebase) and generates 2K output tokens (the answer) per query.
Cost per query at 500K input + 2K output:
| Model | Input Cost | Output Cost | Total per Query | Monthly (1,000 queries) |
|---|---|---|---|---|
| Grok 4.1 Fast | $0.10 | $0.001 | $0.10 | $100 |
| GPT-4.1 | $1.00 | $0.016 | $1.02 | $1,016 |
| Gemini 2.5 Pro (surcharge) | $1.25 | $0.020 | $1.27 | $1,270 |
| Claude Sonnet 4.6 | $1.50 | $0.030 | $1.53 | $1,530 |
| GPT-5.4 (surcharge) | $2.50 | $0.030 | $2.53 | $2,530 |
| Claude Opus 4.6 | $2.50 | $0.050 | $2.55 | $2,550 |
| Gemini 3.1 Pro (surcharge) | $2.00 | $0.024 | $2.02 | $2,024 |
Wait — did you catch that?
GPT-5.4 and Claude Opus 4.6 cost almost the same at 500K tokens, even though GPT-5.4's base price is half of Opus. The surcharge closes the gap entirely. And Gemini 2.5 Pro, which looks 4x cheaper than Claude Sonnet at base rates, is only 17% cheaper when you actually use the full context.
The Breakeven Points: When Flat Pricing Wins
At what context length do surcharge-free models become cheaper than models with lower base prices?
GPT-5.4 vs Claude Opus 4.6
- Below 272K tokens: GPT-5.4 is cheaper ($2.50 vs $5.00 input)
- Above 272K tokens: GPT-5.4 jumps to $5.00 input — tied with Opus
- Factor in output: GPT-5.4 at $22.50 vs Opus at $25.00 — still slightly cheaper on output above the threshold
- Verdict: GPT-5.4 stays marginally cheaper at all context lengths, but the advantage shrinks from 50% to about 10% once you cross 272K
Gemini 2.5 Pro vs Claude Sonnet 4.6
- Below 200K tokens: Gemini is a clear winner ($1.25 vs $3.00 input)
- Above 200K tokens: Gemini jumps to $2.50 input vs Sonnet's flat $3.00
- Verdict: Gemini 2.5 Pro stays cheaper, but the gap narrows from 58% savings to just 17%
The GPT-4.1 Anomaly
Here's the hidden gem: GPT-4.1 has no long-context surcharge, a 1M token window, and costs $2.00 input / $8.00 output at any length. That makes it:
- Cheaper than GPT-5.4 above 272K tokens ($2.00 vs $5.00 input)
- Cheaper than Gemini 2.5 Pro above 200K tokens ($2.00 vs $2.50 input)
- Cheaper than both Anthropic models at any length
For long-context workloads where you don't need absolute frontier intelligence, GPT-4.1 is arguably the best value in the market right now.
Caching Changes Everything (Again)
Long-context pricing gets even more complex when you add prompt caching. If you're sending the same large context repeatedly — the same codebase, the same document set — caching slashes costs dramatically:
| Model | Normal Input/MTok | Cached Input/MTok | Cache Discount |
|---|---|---|---|
| GPT-5.4 (under 272K) | $2.50 | $0.25 | 90% off |
| GPT-5.4 (over 272K) | $5.00 | $0.50 | 90% off |
| Claude Opus 4.6 | $5.00 | $0.50 | 90% off |
| Claude Sonnet 4.6 | $3.00 | $0.30 | 90% off |
| Gemini 2.5 Pro (under 200K) | $1.25 | ~$0.32 | 75% off |
| Gemini 2.5 Pro (over 200K) | $2.50 | ~$0.63 | 75% off |
| GPT-4.1 | $2.00 | $0.50 | 75% off |
Caching is the great equalizer. With caching enabled, even the expensive models become affordable for long-context use. But notice: OpenAI and Anthropic offer 90% cache discounts, while Google offers 75%. That difference compounds at scale.
The new cost ranking with caching (500K cached input + 2K output):
| Model | Cached Input Cost | Output Cost | Total per Query |
|---|---|---|---|
| Grok 4.1 Fast | $0.10 | $0.001 | $0.10 |
| GPT-5.4 (cached, surcharge) | $0.25 | $0.030 | $0.28 |
| Claude Sonnet 4.6 (cached) | $0.15 | $0.030 | $0.18 |
| Claude Opus 4.6 (cached) | $0.25 | $0.050 | $0.30 |
| GPT-4.1 (cached) | $0.25 | $0.016 | $0.27 |
| Gemini 2.5 Pro (cached, surcharge) | $0.32 | $0.020 | $0.34 |
With caching, Claude Sonnet 4.6 actually becomes the cheapest major model for repeated long-context queries — beating both GPT-5.4 and Gemini 2.5 Pro. The flat pricing combined with a 90% cache discount is a powerful combination.
The Decision Matrix: Which Model for Which Long-Context Workload
| Workload | Best Choice | Why |
|---|---|---|
| One-off long-context analysis (ad hoc queries) | GPT-4.1 | Flat pricing, cheapest above 200K, good enough for most tasks |
| Repeated long-context queries (RAG, codebase Q&A) | Claude Sonnet 4.6 | Flat pricing + 90% cache discount = lowest effective cost |
| Budget-sensitive high volume | Grok 4.1 Fast | $0.20/$0.50 flat — nothing else comes close on raw cost |
| Maximum intelligence needed | Claude Opus 4.6 | Flat pricing means predictable bills; Gemini 3.1 Pro surcharges make it less predictable |
| Short-context tasks (under 50K) | Gemini 2.5 Pro or GPT-5.4 | Base rates are lowest; surcharges don't apply |
How to Know Which Bucket You're In
The problem is that most teams don't know their context length distribution. They build an app, start making API calls, and only check the bill at the end of the month.
By then, a few things have happened:
- That "simple" RAG pipeline is sending 300K tokens per query (context + retrieved docs + conversation history)
- Conversation history is ballooning because nobody set a sliding window
- The embedding retrieval is over-fetching chunks "just in case"
This is exactly the kind of waste that per-feature attribution catches. When you can see that your /api/chat endpoint is averaging 400K tokens per call while /api/summarize uses 20K, you know where to focus optimization.
Track your actual context usage with AISpendGuard — tag each API call by feature, see average input/output tokens per route, and catch the long-context calls that are inflating your bill before they show up on the invoice.
Three Actions You Can Take Today
1. Audit your context lengths. Check your average input token count per API call. If most calls are under 100K tokens, surcharges aren't affecting you — optimize elsewhere. If calls regularly exceed 200K, you're likely paying surcharges without realizing it.
2. Implement caching for repeated contexts. If you're sending the same system prompt, codebase, or document set across multiple calls, enable prompt caching immediately. The 75-90% discount on cached input is the single biggest cost lever available.
3. Match models to context needs. Don't use a surcharged flagship for long-context work when GPT-4.1 or Claude Sonnet 4.6 offer flat pricing at competitive quality. Reserve GPT-5.4 and Gemini 3.1 Pro for short-context tasks where their base rates shine.
The Bottom Line
Long-context AI pricing isn't what it looks like on the pricing page. The provider that's cheapest at 50K tokens may not be cheapest at 500K tokens. Surcharges, caching discounts, and output cost ratios all shift the math depending on how you actually use the API.
The providers that charge flat rates — Anthropic, xAI, and OpenAI's GPT-4.1 — offer the most predictable billing for long-context workloads. The providers that apply surcharges — OpenAI's GPT-5.4, Google's entire Gemini line — can still be cost-effective if most of your calls stay under the threshold.
The only way to know which model is actually cheapest for your workload is to track your real usage patterns. Not averages across all calls — per-feature, per-route, per-customer breakdowns that show where tokens are going and whether you're hitting surcharge territory.
Start monitoring for free with AISpendGuard — see exactly which features are burning through long-context tokens and where you can cut costs by switching models or enabling caching.
Pricing data current as of April 2026. Sources: OpenAI API pricing, Anthropic API pricing, Google AI Developer pricing, xAI API pricing. AISpendGuard syncs model prices daily from LiteLLM's open pricing database and tracks changes over time — see our price changes page for historical data.