use-caseMar 25, 20268 min read

Our AI Agent Pipeline Hit $2,400/Month — Here's How We Found the Waste

A real scenario where tag-based cost attribution exposed hidden spend in a multi-agent workflow

A four-agent pipeline. Three AI providers. One Slack alert at 3 AM: "Monthly AI spend exceeded $2,400."

That's the scenario a small SaaS team faced after deploying their AI-powered customer support automation. They'd built it with CrewAI — a research agent, a drafting agent, a review agent, and a routing agent — all working together to handle inbound tickets.

The product worked great. The bill didn't.

The Setup: Four Agents, Three Providers, Zero Visibility

Here's what their agent pipeline looked like:

Agent	Task	Model	Provider
Router	Classify ticket priority & category	GPT-4o-mini	OpenAI
Researcher	Search knowledge base, gather context	Claude Sonnet 4.6	Anthropic
Drafter	Write customer response	GPT-4o	OpenAI
Reviewer	Quality-check and approve/reject draft	Claude Opus 4.6	Anthropic

On paper, this looks reasonable. Fast classification with a cheap model, research with a capable mid-tier model, drafting with GPT-4o, and a final quality gate with Opus.

In practice, it was a money pit.

The Problem: Provider Dashboards Don't Show Agent-Level Spend

The team checked their OpenAI dashboard. It said: $1,100 this month.

They checked Anthropic's console. It said: $1,300 this month.

Total: $2,400. But where was the money going?

Provider dashboards show you total spend by model. They don't tell you which agent, which task type, or which customer workflow is driving the cost.

This is the fundamental gap. When you have four agents making hundreds of calls per day across two providers, the provider dashboard is useless for optimization. You know the total — you don't know the cause.

What Tag-Based Attribution Revealed

The team added AISpendGuard's SDK with per-agent tags. Each LLM call got tagged with agent_name, task_type, and ticket_priority. Here's what the data showed after one week:

Finding #1: The Reviewer Was Burning 52% of Total Spend

Agent	Monthly Spend	% of Total	Avg Calls/Day
Router	$38	1.6%	420
Researcher	$580	24.2%	390
Drafter	$530	22.1%	385
Reviewer	$1,252	52.1%	1,140

Wait — the Reviewer was making 3x more calls than any other agent?

It turned out the Reviewer had a retry loop. When it rejected a draft (which happened ~40% of the time), the Drafter would rewrite and the Reviewer would re-evaluate. Some tickets went through 4-5 revision cycles before approval. Each cycle meant another Claude Opus 4.6 call at $5.00/$25.00 per million tokens.

The team had no idea. The retry logic was buried in CrewAI's task delegation config — it looked like a single "review" step in the workflow definition.

Finding #2: Opus Was Overkill for 73% of Reviews

Not all tickets need a $25/M-output-token quality gate. The tag data revealed:

Ticket Priority	Reviewer Spend	% of Reviews	Avg Output Tokens
Low	$312	41%	180
Medium	$245	32%	220
High	$480	19%	410
Critical	$215	8%	680

73% of reviews were for low and medium priority tickets — simple questions like password resets and billing inquiries. These reviews generated fewer than 250 output tokens on average. Claude Opus 4.6 was being used to quality-check a two-sentence reply to "How do I update my credit card?"

AISpendGuard's waste detection flagged this: "Switch review_agent from claude-opus-4-6 to claude-haiku-4-5 for low/medium priority tickets. Estimated savings: $410/month."

Finding #3: The Researcher Was Sending Full Ticket History Every Call

The Researcher agent was supposed to search the knowledge base for relevant articles. But the prompt included the full ticket conversation history — every previous message, including the customer's original email, support agent replies, and internal notes.

For tickets with long threads, this meant 8,000-12,000 input tokens per call. The knowledge base search itself only needed the latest customer message (~200 tokens).

The tag data showed researcher calls averaging 9,400 input tokens — 47x more than needed.

This is the hidden cost of conversation history playing out in an agent pipeline. Every agent in the chain was receiving context it didn't need.

The Fix: Three Changes, $1,800 in Monthly Savings

Armed with per-agent, per-task-type cost data, the team made three targeted changes:

Change 1: Cap Reviewer Retries at 2 Rounds

Before: Unlimited retries. Some tickets went through 5 revision cycles. After: Max 2 retries. If the draft fails twice, escalate to a human.

Impact: Reviewer calls dropped from 1,140/day to 480/day. Monthly Reviewer spend: $1,252 → $528.

Change 2: Route Low/Medium Reviews to Haiku

Before: All reviews used Claude Opus 4.6 ($5/$25 per 1M tokens). After: Low and medium priority tickets use Claude Haiku 4.5 ($1/$5 per 1M tokens). High and critical stay on Opus.

Current pricing comparison for this use case:

Model	Input (per 1M)	Output (per 1M)	Review Cost (avg)
Claude Opus 4.6	$5.00	$25.00	~$0.0068
Claude Haiku 4.5	$1.00	$5.00	~$0.0014

That's a 5x cost reduction per review call for routine tickets.

Impact: Low/medium review spend dropped from $557 to ~$111. Monthly savings: $446.

Change 3: Trim Researcher Input to Latest Message Only

Before: Full conversation history sent to Researcher (avg 9,400 input tokens). After: Only the latest customer message + ticket metadata (avg 350 input tokens).

Impact: Researcher spend dropped from $580 to ~$22/month. Monthly savings: $558.

Total Result

Metric	Before	After	Change
Monthly AI spend	$2,400	$608	-74.7%
Reviewer calls/day	1,140	480	-58%
Avg researcher input tokens	9,400	350	-96%
Tickets requiring human escalation	0	~12/day	+12

$1,792 saved per month. The 12 daily escalations were tickets that genuinely needed human review — the team actually preferred this over the AI silently approving subpar responses.

Why This Only Works With Per-Call Attribution

Here's what each monitoring approach would have told this team:

Provider dashboards (OpenAI/Anthropic): "You spent $2,400 across two providers." No agent breakdown, no task-type split, no retry visibility.

Billing aggregators: Same total, maybe with daily trends. Still no per-call attribution.

Full observability platforms: Would show traces with prompt content — but this team handles customer PII in every ticket. Storing prompts was a non-starter for their privacy policy.

Tag-based attribution (AISpendGuard): Per-agent spend, per-priority breakdowns, retry patterns, input token distributions — all without ever seeing the prompt content. The tags (agent_name=reviewer, task_type=quality_check, ticket_priority=low) told the whole cost story.

The privacy angle matters even more this week. The LiteLLM supply chain attack on March 24 compromised 3.4 million daily downloads with credential-stealing malware — because LiteLLM sits in the request path. Tools that route your traffic carry supply chain risk. Passive SDK ingestion doesn't.

The Agentic AI Cost Problem Is Getting Worse

This isn't an edge case. As teams adopt multi-agent frameworks — CrewAI, LangChain, OpenAI Agents SDK, AutoGen — the cost surface area multiplies:

More agents = more calls. A single user action can trigger 5-15 LLM calls across agents.
Retry loops compound. Agent orchestration frameworks often have built-in retry logic that's invisible in the workflow definition.
Context bloat spreads. Each agent in a chain tends to receive the full context from previous agents, whether it needs it or not.
Model selection is static. Teams pick a model during development and never revisit it, even when cheaper alternatives launch (like GPT-4.1 at $2/$8 replacing GPT-4o at $2.50/$10).

And with 114 AI models changing prices this month alone, the "set it and forget it" approach to model selection is actively costing you money.

What to Do Right Now

If you're running multi-agent AI workflows, here are three things you can do today:

1. Tag every agent call. At minimum: agent_name, task_type, and one business dimension (customer tier, priority, feature). This is the foundation for any cost optimization.

2. Check your retry logic. Open your agent framework config and look for retry/revision loops. Most frameworks default to generous retry limits. Cap them and add human escalation as a fallback.

3. Match model tier to task complexity. Not every agent call needs your most expensive model. Classification, routing, and simple reviews can run on Haiku-class models at 5-10x lower cost.

Track your multi-agent AI spend automatically with AISpendGuard — per-agent attribution, waste detection, and model recommendations. No prompts stored, no gateway required.

See exactly where your agent pipeline is burning money → Start monitoring for free