comparisonApr 16, 202610 min read

Long-Context Pricing Compared: Who Charges You 2x for Using More Than 200K Tokens

Every provider advertises million-token context windows. Not every provider tells you they double the price when you use them.


A million-token context window sounds amazing. Feed in an entire codebase, a full document set, a day's worth of logs — and let the model work its magic.

But here's the part the marketing pages leave out: some providers charge 2x when you actually use that context window.

The advertised per-token price? That's the base rate. Cross a threshold — usually 200K or 272K tokens — and both input and output costs jump. Some providers double them. Others stay flat. The difference matters more than you'd expect.

We compared long-context pricing across OpenAI, Anthropic, Google, and xAI to find out who's actually affordable when you push past 200K tokens.

The Base Prices: Before Surcharges Kick In

First, here's what each provider charges at "normal" context lengths (under 200K tokens), per million tokens:

ModelInput/MTokOutput/MTokMax ContextProvider
GPT-5.4$2.50$15.001.05MOpenAI
Claude Opus 4.6$5.00$25.001MAnthropic
Claude Sonnet 4.6$3.00$15.001MAnthropic
Gemini 2.5 Pro$1.25$10.001MGoogle
Gemini 3.1 Pro$2.00$12.001M+Google
Grok 4.1 Fast$0.20$0.502MxAI
GPT-4.1$2.00$8.001MOpenAI

At these base rates, the ranking is clear: Grok 4.1 Fast is absurdly cheap, Gemini 2.5 Pro offers the best flagship value, and Anthropic charges a premium. But base rates are only half the story.

The Surcharge Map: What Happens Above 200K Tokens

Here's where it gets interesting. Each provider handles long context differently:

ProviderModelSurcharge ThresholdInput MultiplierOutput Multiplier
OpenAIGPT-5.4272K tokens2x ($5.00/MTok)1.5x ($22.50/MTok)
OpenAIGPT-4.1None1x (flat)1x (flat)
GoogleGemini 2.5 Pro200K tokens2x ($2.50/MTok)1.5x ($15.00/MTok)
GoogleGemini 3.1 Pro200K tokens2x ($4.00/MTok)1.5x ($18.00/MTok)
AnthropicAll Claude modelsNone1x (flat)1x (flat)
xAIGrok 4.1 FastNone1x (flat)1x (flat)

Key takeaway: Anthropic, xAI, and OpenAI's GPT-4.1 charge flat rates at any context length. OpenAI's GPT-5.4 and all Google Gemini models apply surcharges above a threshold. The "cheapest" model on paper can become one of the most expensive in practice.

Let that sink in. Gemini 2.5 Pro looks like a bargain at $1.25 input — until you realize it's $2.50 once you cross 200K tokens. GPT-5.4 goes from $2.50 to $5.00 input at 272K. Meanwhile, Claude Opus 4.6 stays at $5.00 input whether you send 10K or 900K tokens.

Real-World Cost Comparison: 500K Token Requests

Let's model a realistic long-context scenario. You're building a codebase Q&A system that sends 500K input tokens (the codebase) and generates 2K output tokens (the answer) per query.

Cost per query at 500K input + 2K output:

ModelInput CostOutput CostTotal per QueryMonthly (1,000 queries)
Grok 4.1 Fast$0.10$0.001$0.10$100
GPT-4.1$1.00$0.016$1.02$1,016
Gemini 2.5 Pro (surcharge)$1.25$0.020$1.27$1,270
Claude Sonnet 4.6$1.50$0.030$1.53$1,530
GPT-5.4 (surcharge)$2.50$0.030$2.53$2,530
Claude Opus 4.6$2.50$0.050$2.55$2,550
Gemini 3.1 Pro (surcharge)$2.00$0.024$2.02$2,024

Wait — did you catch that?

GPT-5.4 and Claude Opus 4.6 cost almost the same at 500K tokens, even though GPT-5.4's base price is half of Opus. The surcharge closes the gap entirely. And Gemini 2.5 Pro, which looks 4x cheaper than Claude Sonnet at base rates, is only 17% cheaper when you actually use the full context.

The Breakeven Points: When Flat Pricing Wins

At what context length do surcharge-free models become cheaper than models with lower base prices?

GPT-5.4 vs Claude Opus 4.6

  • Below 272K tokens: GPT-5.4 is cheaper ($2.50 vs $5.00 input)
  • Above 272K tokens: GPT-5.4 jumps to $5.00 input — tied with Opus
  • Factor in output: GPT-5.4 at $22.50 vs Opus at $25.00 — still slightly cheaper on output above the threshold
  • Verdict: GPT-5.4 stays marginally cheaper at all context lengths, but the advantage shrinks from 50% to about 10% once you cross 272K

Gemini 2.5 Pro vs Claude Sonnet 4.6

  • Below 200K tokens: Gemini is a clear winner ($1.25 vs $3.00 input)
  • Above 200K tokens: Gemini jumps to $2.50 input vs Sonnet's flat $3.00
  • Verdict: Gemini 2.5 Pro stays cheaper, but the gap narrows from 58% savings to just 17%

The GPT-4.1 Anomaly

Here's the hidden gem: GPT-4.1 has no long-context surcharge, a 1M token window, and costs $2.00 input / $8.00 output at any length. That makes it:

  • Cheaper than GPT-5.4 above 272K tokens ($2.00 vs $5.00 input)
  • Cheaper than Gemini 2.5 Pro above 200K tokens ($2.00 vs $2.50 input)
  • Cheaper than both Anthropic models at any length

For long-context workloads where you don't need absolute frontier intelligence, GPT-4.1 is arguably the best value in the market right now.

Caching Changes Everything (Again)

Long-context pricing gets even more complex when you add prompt caching. If you're sending the same large context repeatedly — the same codebase, the same document set — caching slashes costs dramatically:

ModelNormal Input/MTokCached Input/MTokCache Discount
GPT-5.4 (under 272K)$2.50$0.2590% off
GPT-5.4 (over 272K)$5.00$0.5090% off
Claude Opus 4.6$5.00$0.5090% off
Claude Sonnet 4.6$3.00$0.3090% off
Gemini 2.5 Pro (under 200K)$1.25~$0.3275% off
Gemini 2.5 Pro (over 200K)$2.50~$0.6375% off
GPT-4.1$2.00$0.5075% off

Caching is the great equalizer. With caching enabled, even the expensive models become affordable for long-context use. But notice: OpenAI and Anthropic offer 90% cache discounts, while Google offers 75%. That difference compounds at scale.

The new cost ranking with caching (500K cached input + 2K output):

ModelCached Input CostOutput CostTotal per Query
Grok 4.1 Fast$0.10$0.001$0.10
GPT-5.4 (cached, surcharge)$0.25$0.030$0.28
Claude Sonnet 4.6 (cached)$0.15$0.030$0.18
Claude Opus 4.6 (cached)$0.25$0.050$0.30
GPT-4.1 (cached)$0.25$0.016$0.27
Gemini 2.5 Pro (cached, surcharge)$0.32$0.020$0.34

With caching, Claude Sonnet 4.6 actually becomes the cheapest major model for repeated long-context queries — beating both GPT-5.4 and Gemini 2.5 Pro. The flat pricing combined with a 90% cache discount is a powerful combination.

The Decision Matrix: Which Model for Which Long-Context Workload

WorkloadBest ChoiceWhy
One-off long-context analysis (ad hoc queries)GPT-4.1Flat pricing, cheapest above 200K, good enough for most tasks
Repeated long-context queries (RAG, codebase Q&A)Claude Sonnet 4.6Flat pricing + 90% cache discount = lowest effective cost
Budget-sensitive high volumeGrok 4.1 Fast$0.20/$0.50 flat — nothing else comes close on raw cost
Maximum intelligence neededClaude Opus 4.6Flat pricing means predictable bills; Gemini 3.1 Pro surcharges make it less predictable
Short-context tasks (under 50K)Gemini 2.5 Pro or GPT-5.4Base rates are lowest; surcharges don't apply

How to Know Which Bucket You're In

The problem is that most teams don't know their context length distribution. They build an app, start making API calls, and only check the bill at the end of the month.

By then, a few things have happened:

  • That "simple" RAG pipeline is sending 300K tokens per query (context + retrieved docs + conversation history)
  • Conversation history is ballooning because nobody set a sliding window
  • The embedding retrieval is over-fetching chunks "just in case"

This is exactly the kind of waste that per-feature attribution catches. When you can see that your /api/chat endpoint is averaging 400K tokens per call while /api/summarize uses 20K, you know where to focus optimization.

Track your actual context usage with AISpendGuard — tag each API call by feature, see average input/output tokens per route, and catch the long-context calls that are inflating your bill before they show up on the invoice.

Three Actions You Can Take Today

1. Audit your context lengths. Check your average input token count per API call. If most calls are under 100K tokens, surcharges aren't affecting you — optimize elsewhere. If calls regularly exceed 200K, you're likely paying surcharges without realizing it.

2. Implement caching for repeated contexts. If you're sending the same system prompt, codebase, or document set across multiple calls, enable prompt caching immediately. The 75-90% discount on cached input is the single biggest cost lever available.

3. Match models to context needs. Don't use a surcharged flagship for long-context work when GPT-4.1 or Claude Sonnet 4.6 offer flat pricing at competitive quality. Reserve GPT-5.4 and Gemini 3.1 Pro for short-context tasks where their base rates shine.

The Bottom Line

Long-context AI pricing isn't what it looks like on the pricing page. The provider that's cheapest at 50K tokens may not be cheapest at 500K tokens. Surcharges, caching discounts, and output cost ratios all shift the math depending on how you actually use the API.

The providers that charge flat rates — Anthropic, xAI, and OpenAI's GPT-4.1 — offer the most predictable billing for long-context workloads. The providers that apply surcharges — OpenAI's GPT-5.4, Google's entire Gemini line — can still be cost-effective if most of your calls stay under the threshold.

The only way to know which model is actually cheapest for your workload is to track your real usage patterns. Not averages across all calls — per-feature, per-route, per-customer breakdowns that show where tokens are going and whether you're hitting surcharge territory.

Start monitoring for free with AISpendGuard — see exactly which features are burning through long-context tokens and where you can cut costs by switching models or enabling caching.


Pricing data current as of April 2026. Sources: OpenAI API pricing, Anthropic API pricing, Google AI Developer pricing, xAI API pricing. AISpendGuard syncs model prices daily from LiteLLM's open pricing database and tracks changes over time — see our price changes page for historical data.


Related Articles


Want to track your AI spend automatically?

AISpendGuard detects waste patterns, breaks down costs by feature, and recommends specific changes with $/mo savings estimates.