You asked GPT-4o to classify a support ticket. The correct answer is one word: billing.
Instead, you got 347 tokens explaining why it's a billing issue, what billing means in context, and three follow-up suggestions you didn't ask for.
At $10 per million output tokens, those 346 extra tokens cost 28x more than the one you needed. Multiply by 10,000 calls a day, and you're burning $100/month on AI opinions nobody reads.
The Pattern Nobody Talks About
We analyzed waste patterns across AISpendGuard workspaces and found the same problem everywhere: concise tasks produce verbose outputs.
| Task Type | Expected Output | Typical Output | Waste Factor |
|---|---|---|---|
| classify | 1-50 tokens | 200-500 tokens | 4-10x |
| route | 1-50 tokens | 150-400 tokens | 3-8x |
| eval | 10-100 tokens | 300-800 tokens | 3-8x |
| extract | 20-200 tokens | 500-2000 tokens | 2.5-10x |
| embed | 0 tokens (embedding only) | 50-200 tokens | ∞ |
The last row is the worst offender. Embedding tasks should produce zero text output — the value is in the vector, not the response. Yet many implementations generate text alongside the embedding, paying for tokens that go straight to /dev/null.
Why This Happens
LLMs are trained to be helpful. When you ask "classify this ticket," the model wants to explain its reasoning. Without explicit constraints, it will:
- State the classification
- Explain why it chose that label
- Offer confidence scores you didn't ask for
- Suggest related categories
- Add a disclaimer about edge cases
Each of those steps costs output tokens — the most expensive tokens in every provider's pricing.
What AISpendGuard Now Detects
We shipped Rule 9: Output Verbosity — a waste detection rule that flags when concise task types produce disproportionately verbose output.
Here's what it checks:
- classify and route tasks averaging more than 50 output tokens per call
- eval tasks averaging more than 100 output tokens
- extract tasks averaging more than 200 output tokens
- embed tasks producing any text output at all
When the rule fires, you get:
- Severity — how far above the threshold your outputs are
- Estimated savings — the dollar amount you'd save per month by constraining output
- Actionable fix — specific recommendations for your model and task type
- Deep-link filters — click straight to the affected events in your dashboard
No other cost monitoring tool does this. Helicone and Langfuse show you token counts. We tell you which counts are wrong — and what to do about it.
How to Fix It (3 Approaches)
1. Set max_tokens explicitly
The simplest fix. If your classify task needs one word, tell the model:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Classify this ticket: {ticket}"}],
max_tokens=10 # One word + safety margin
)
2. Use structured outputs
Force the model to return JSON with exactly the fields you need:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Classify this ticket: {ticket}"}],
response_format={"type": "json_object"},
max_tokens=20
)
# Returns: {"category": "billing"} — not a 347-token essay
3. Add response format constraints in the prompt
Respond with ONLY the category label. No explanation. No reasoning.
Valid labels: billing, technical, account, feature_request, other
The Numbers
A typical classify workload doing 10,000 calls/day on GPT-4o:
| Scenario | Avg Output Tokens | Monthly Output Cost |
|---|---|---|
| Unconstrained | 350 tokens | $1,050 |
| With max_tokens=50 | 15 tokens | $45 |
| Savings | $1,005/mo (96%) |
That's not a rounding error. That's your margin.
Start Detecting Output Waste — Free
AISpendGuard's free tier (50,000 events/month) includes all 9 waste detection rules, including output verbosity. Set up the SDK in under 5 minutes, send your events, and we'll tell you exactly where your outputs are too verbose — and how much you'll save by fixing them.
No prompts stored. No model outputs recorded. Tags only.
AISpendGuard is the simplest way for dev teams to find and fix wasted AI API spend. Privacy-first, EUR pricing, EU-hosted. Free tier, no credit card required.