A four-agent pipeline. Three AI providers. One Slack alert at 3 AM: "Monthly AI spend exceeded $2,400."
That's the scenario a small SaaS team faced after deploying their AI-powered customer support automation. They'd built it with CrewAI — a research agent, a drafting agent, a review agent, and a routing agent — all working together to handle inbound tickets.
The product worked great. The bill didn't.
The Setup: Four Agents, Three Providers, Zero Visibility
Here's what their agent pipeline looked like:
| Agent | Task | Model | Provider |
|---|---|---|---|
| Router | Classify ticket priority & category | GPT-4o-mini | OpenAI |
| Researcher | Search knowledge base, gather context | Claude Sonnet 4.6 | Anthropic |
| Drafter | Write customer response | GPT-4o | OpenAI |
| Reviewer | Quality-check and approve/reject draft | Claude Opus 4.6 | Anthropic |
On paper, this looks reasonable. Fast classification with a cheap model, research with a capable mid-tier model, drafting with GPT-4o, and a final quality gate with Opus.
In practice, it was a money pit.
The Problem: Provider Dashboards Don't Show Agent-Level Spend
The team checked their OpenAI dashboard. It said: $1,100 this month.
They checked Anthropic's console. It said: $1,300 this month.
Total: $2,400. But where was the money going?
Provider dashboards show you total spend by model. They don't tell you which agent, which task type, or which customer workflow is driving the cost.
This is the fundamental gap. When you have four agents making hundreds of calls per day across two providers, the provider dashboard is useless for optimization. You know the total — you don't know the cause.
What Tag-Based Attribution Revealed
The team added AISpendGuard's SDK with per-agent tags. Each LLM call got tagged with agent_name, task_type, and ticket_priority. Here's what the data showed after one week:
Finding #1: The Reviewer Was Burning 52% of Total Spend
| Agent | Monthly Spend | % of Total | Avg Calls/Day |
|---|---|---|---|
| Router | $38 | 1.6% | 420 |
| Researcher | $580 | 24.2% | 390 |
| Drafter | $530 | 22.1% | 385 |
| Reviewer | $1,252 | 52.1% | 1,140 |
Wait — the Reviewer was making 3x more calls than any other agent?
It turned out the Reviewer had a retry loop. When it rejected a draft (which happened ~40% of the time), the Drafter would rewrite and the Reviewer would re-evaluate. Some tickets went through 4-5 revision cycles before approval. Each cycle meant another Claude Opus 4.6 call at $5.00/$25.00 per million tokens.
The team had no idea. The retry logic was buried in CrewAI's task delegation config — it looked like a single "review" step in the workflow definition.
Finding #2: Opus Was Overkill for 73% of Reviews
Not all tickets need a $25/M-output-token quality gate. The tag data revealed:
| Ticket Priority | Reviewer Spend | % of Reviews | Avg Output Tokens |
|---|---|---|---|
| Low | $312 | 41% | 180 |
| Medium | $245 | 32% | 220 |
| High | $480 | 19% | 410 |
| Critical | $215 | 8% | 680 |
73% of reviews were for low and medium priority tickets — simple questions like password resets and billing inquiries. These reviews generated fewer than 250 output tokens on average. Claude Opus 4.6 was being used to quality-check a two-sentence reply to "How do I update my credit card?"
AISpendGuard's waste detection flagged this: "Switch review_agent from claude-opus-4-6 to claude-haiku-4-5 for low/medium priority tickets. Estimated savings: $410/month."
Finding #3: The Researcher Was Sending Full Ticket History Every Call
The Researcher agent was supposed to search the knowledge base for relevant articles. But the prompt included the full ticket conversation history — every previous message, including the customer's original email, support agent replies, and internal notes.
For tickets with long threads, this meant 8,000-12,000 input tokens per call. The knowledge base search itself only needed the latest customer message (~200 tokens).
The tag data showed researcher calls averaging 9,400 input tokens — 47x more than needed.
This is the hidden cost of conversation history playing out in an agent pipeline. Every agent in the chain was receiving context it didn't need.
The Fix: Three Changes, $1,800 in Monthly Savings
Armed with per-agent, per-task-type cost data, the team made three targeted changes:
Change 1: Cap Reviewer Retries at 2 Rounds
Before: Unlimited retries. Some tickets went through 5 revision cycles. After: Max 2 retries. If the draft fails twice, escalate to a human.
Impact: Reviewer calls dropped from 1,140/day to 480/day. Monthly Reviewer spend: $1,252 → $528.
Change 2: Route Low/Medium Reviews to Haiku
Before: All reviews used Claude Opus 4.6 ($5/$25 per 1M tokens). After: Low and medium priority tickets use Claude Haiku 4.5 ($1/$5 per 1M tokens). High and critical stay on Opus.
Current pricing comparison for this use case:
| Model | Input (per 1M) | Output (per 1M) | Review Cost (avg) |
|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | ~$0.0068 |
| Claude Haiku 4.5 | $1.00 | $5.00 | ~$0.0014 |
That's a 5x cost reduction per review call for routine tickets.
Impact: Low/medium review spend dropped from $557 to ~$111. Monthly savings: $446.
Change 3: Trim Researcher Input to Latest Message Only
Before: Full conversation history sent to Researcher (avg 9,400 input tokens). After: Only the latest customer message + ticket metadata (avg 350 input tokens).
Impact: Researcher spend dropped from $580 to ~$22/month. Monthly savings: $558.
Total Result
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly AI spend | $2,400 | $608 | -74.7% |
| Reviewer calls/day | 1,140 | 480 | -58% |
| Avg researcher input tokens | 9,400 | 350 | -96% |
| Tickets requiring human escalation | 0 | ~12/day | +12 |
$1,792 saved per month. The 12 daily escalations were tickets that genuinely needed human review — the team actually preferred this over the AI silently approving subpar responses.
Why This Only Works With Per-Call Attribution
Here's what each monitoring approach would have told this team:
Provider dashboards (OpenAI/Anthropic): "You spent $2,400 across two providers." No agent breakdown, no task-type split, no retry visibility.
Billing aggregators: Same total, maybe with daily trends. Still no per-call attribution.
Full observability platforms: Would show traces with prompt content — but this team handles customer PII in every ticket. Storing prompts was a non-starter for their privacy policy.
Tag-based attribution (AISpendGuard): Per-agent spend, per-priority breakdowns, retry patterns, input token distributions — all without ever seeing the prompt content. The tags (agent_name=reviewer, task_type=quality_check, ticket_priority=low) told the whole cost story.
The privacy angle matters even more this week. The LiteLLM supply chain attack on March 24 compromised 3.4 million daily downloads with credential-stealing malware — because LiteLLM sits in the request path. Tools that route your traffic carry supply chain risk. Passive SDK ingestion doesn't.
The Agentic AI Cost Problem Is Getting Worse
This isn't an edge case. As teams adopt multi-agent frameworks — CrewAI, LangChain, OpenAI Agents SDK, AutoGen — the cost surface area multiplies:
- More agents = more calls. A single user action can trigger 5-15 LLM calls across agents.
- Retry loops compound. Agent orchestration frameworks often have built-in retry logic that's invisible in the workflow definition.
- Context bloat spreads. Each agent in a chain tends to receive the full context from previous agents, whether it needs it or not.
- Model selection is static. Teams pick a model during development and never revisit it, even when cheaper alternatives launch (like GPT-4.1 at $2/$8 replacing GPT-4o at $2.50/$10).
And with 114 AI models changing prices this month alone, the "set it and forget it" approach to model selection is actively costing you money.
What to Do Right Now
If you're running multi-agent AI workflows, here are three things you can do today:
1. Tag every agent call. At minimum: agent_name, task_type, and one business dimension (customer tier, priority, feature). This is the foundation for any cost optimization.
2. Check your retry logic. Open your agent framework config and look for retry/revision loops. Most frameworks default to generous retry limits. Cap them and add human escalation as a fallback.
3. Match model tier to task complexity. Not every agent call needs your most expensive model. Classification, routing, and simple reviews can run on Haiku-class models at 5-10x lower cost.
Track your multi-agent AI spend automatically with AISpendGuard — per-agent attribution, waste detection, and model recommendations. No prompts stored, no gateway required.
See exactly where your agent pipeline is burning money → Start monitoring for free