"Ship AI features fast." That was the mandate.
Over three months, a four-person SaaS team bolted AI onto five different product features: smart search, document summarization, ticket classification, a customer-facing chatbot, and automated report generation. By month four, their combined AI bill hit $2,800/month — more than their entire cloud infrastructure.
The problem wasn't the total number. The problem was they had no idea which features were responsible for which costs.
This is the story of how they tagged every AI call by feature, discovered that three of their five AI features were burning money for negligible value, and cut their bill to $940/month — without removing a single feature users actually cared about.
The Setup: Five Features, Three Providers, Zero Visibility
Here's what the team built:
| Feature | Provider | Model | Purpose |
|---|---|---|---|
| Smart Search | OpenAI | GPT-4o | Semantic search across knowledge base |
| Doc Summarizer | Anthropic | Claude Sonnet 4.5 | One-click document summaries |
| Ticket Classifier | OpenAI | GPT-4o | Auto-categorize support tickets |
| Customer Chatbot | Anthropic | Claude Sonnet 4.5 | Answer product questions |
| Report Generator | OpenAI | GPT-4.1 | Weekly analytics summaries |
Each feature was built by a different developer. Each used a different SDK pattern. Each was deployed independently. The result: costs were scattered across two provider dashboards with no way to connect a dollar to a feature.
The founders knew the total spend. They had no idea where it went.
Month 1-3: "AI Is Cheap, Ship Everything"
The early days felt fine. The team estimated each feature would cost roughly the same — maybe $200-400/month each. They budgeted $1,500/month total and moved on to building.
But AI costs don't scale linearly. They scale with usage patterns — and every feature has a different pattern:
- Smart Search fires on every keystroke after 3 characters (debounced to 300ms)
- Doc Summarizer runs once per document, but documents range from 2 pages to 200 pages
- Ticket Classifier runs on every inbound email, including spam
- Customer Chatbot conversations average 8 turns, each turn sending full history
- Report Generator runs weekly but processes every user's data in one batch
Without per-feature cost tracking, these differences were invisible. The monthly bill was just a number.
The Wake-Up Call: $2,800 in Month 4
When the bill crossed $2,800 — nearly double the budget — the team finally paused to investigate. The OpenAI dashboard showed aggregate token usage. The Anthropic console showed aggregate token usage. Neither showed why.
They tried manual estimation:
- Count API calls per feature from application logs
- Multiply by estimated tokens per call
- Apply pricing rates
This took a full engineering day and produced numbers that didn't match the actual bill by over 40%. Token estimates were wrong. They forgot about retries. They didn't account for conversation history accumulation. Cached vs. uncached pricing wasn't factored in.
Manual cost attribution doesn't work. The gap between "estimated" and "actual" is where money disappears.
Adding Feature-Level Attribution
The fix took 30 minutes. The team added a feature tag to every AI call using a lightweight SDK integration:
// Before: naked API call
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: query }]
});
// After: tagged with feature context
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: query }],
// Cost tracking via AISpendGuard SDK
metadata: { feature: "smart-search", plan: "pro" }
});
For LangChain and LiteLLM integrations, it was even simpler — callback handlers that automatically capture the tags without changing application code.
Within 24 hours, they had their first per-feature cost breakdown. The numbers told a very different story than anyone expected.
The Breakdown: Where $2,800 Actually Went
| Feature | Monthly Cost | % of Total | Calls/Month | Avg Cost/Call |
|---|---|---|---|---|
| Customer Chatbot | $1,240 | 44% | 3,100 | $0.40 |
| Smart Search | $680 | 24% | 48,000 | $0.014 |
| Report Generator | $520 | 19% | 52 | $10.00 |
| Doc Summarizer | $280 | 10% | 1,400 | $0.20 |
| Ticket Classifier | $80 | 3% | 6,200 | $0.013 |
Three things jumped out immediately.
Finding 1: The Chatbot Was 44% of the Bill
The customer chatbot was the single biggest cost driver — and the team had estimated it at $300/month.
Why the 4x overshoot? Conversation history accumulation. Each chatbot turn sent the entire conversation history as input tokens. An 8-turn conversation didn't cost 8x a single turn — it cost 36x (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 = 36 messages processed total). And Claude Sonnet 4.5 at $3.00/$15.00 per million tokens isn't cheap for high-volume conversational use.
Worse: analytics showed that 72% of chatbot conversations were asking questions already answered in the help docs. The chatbot wasn't adding unique value — it was an expensive alternative to a search bar.
Finding 2: The Report Generator Cost $10 Per Run
Weekly report generation seemed efficient — only 52 runs per month. But each run processed every active user's data in a single massive prompt, generating detailed analytics summaries. At $2.00/$8.00 per million tokens for GPT-4.1, each run consumed roughly 300K input tokens and 400K output tokens.
The kicker: only 12% of users ever opened the generated report. The team was spending $520/month generating reports for users who never read them.
Finding 3: Ticket Classification Was the Bargain
At $80/month for 6,200 classifications, the ticket classifier was by far the most cost-efficient feature. It saved roughly 40 hours of manual triage per month — worth well over $2,000 in engineer time. The ROI was 25:1.
The Fix: Three Interventions, 66% Cost Reduction
Armed with per-feature data, the team made three changes in a single sprint:
Intervention 1: Swap the Chatbot Model ($1,240 → $310)
The chatbot didn't need Claude Sonnet 4.5's reasoning power for FAQ-style questions. They switched to Claude Haiku 4.5 ($1.00/$5.00 per million tokens) — a 3x reduction in input cost and 3x in output cost. They also implemented a sliding context window that kept only the last 4 turns instead of the full history.
Result: 75% cost reduction with no measurable change in user satisfaction scores.
| Metric | Before | After |
|---|---|---|
| Model | Claude Sonnet 4.5 | Claude Haiku 4.5 |
| Context window | Full history | Last 4 turns |
| Monthly cost | $1,240 | $310 |
| User satisfaction | 4.2/5 | 4.1/5 |
Intervention 2: Generate Reports On-Demand ($520 → $60)
Instead of generating reports for every user weekly, they switched to on-demand generation — reports are created only when a user clicks "Generate Report." They also switched from GPT-4.1 to GPT-4.1-mini ($0.40/$1.60 per million tokens) for the generation step, since report formatting doesn't require frontier reasoning.
Result: 88% cost reduction. Only the 12% of users who actually read reports trigger generation, and each generation costs 80% less.
Intervention 3: Debounce Smart Search ($680 → $570)
Smart search was firing too aggressively. Increasing the debounce from 300ms to 800ms and adding a minimum query length of 5 characters reduced API calls by 16% with no user-visible impact. They kept GPT-4o because search quality directly affected user retention.
A smaller win, but free to implement.
After: $2,800 → $940
| Feature | Before | After | Change |
|---|---|---|---|
| Customer Chatbot | $1,240 | $310 | -75% |
| Smart Search | $680 | $570 | -16% |
| Report Generator | $520 | $60 | -88% |
| Doc Summarizer | $280 | $280 | — |
| Ticket Classifier | $80 | $80 | — |
| Total | $2,800 | $1,300 | -54% |
After a second pass — adding prompt caching to the summarizer and batching the classifier — the total dropped further to $940/month. That's a 66% reduction from the original $2,800.
The team didn't cut features. They didn't downgrade quality. They just stopped spending money where it wasn't creating value.
The Pattern: Most Teams Have a 60/40 Problem
This story isn't unusual. Across the teams we've talked to, a consistent pattern emerges:
- ~60% of AI spend goes to 1-2 features that are either over-provisioned (too powerful a model) or over-triggered (too many API calls)
- ~40% of AI spend is distributed across features that are appropriately sized
The problem is that without per-feature attribution, you can't tell which is which. Your AI bill is a single number. You optimize blind — or you don't optimize at all.
What Made the Difference
Three things turned a $2,800 mystery into a $940 understood cost:
1. Tagging by feature — Every AI call tagged with feature, plan, and route. This is the foundation. Without it, you're guessing.
2. Cost per call visibility — Knowing the average cost per API call per feature reveals the outliers immediately. A $10/call report generator stands out when everything else is under $0.50.
3. Usage vs. value correlation — Matching AI spend against actual feature usage (report open rates, chatbot resolution rates, search click-through) shows where money creates value and where it doesn't.
Try It With Your Own AI Features
If you're running AI in production across multiple features, you probably have a similar 60/40 split — you just can't see it yet.
Here's how to find out:
- Tag your AI calls by feature name — takes 10 minutes per integration with AISpendGuard's SDK
- Let it run for a week — you need real usage data, not estimates
- Check the attribution dashboard — sort by feature, find the outlier
- Apply the cheapest fix first — model swaps and debouncing are free; architecture changes aren't
Most teams find their first optimization in the first 48 hours of tagging. The median savings we see: 30-50% cost reduction without removing any user-facing capability.
Start monitoring for free → Sign up for AISpendGuard
Have a similar story? We'd love to feature real-world AI cost optimization wins. Reach out at hello@aispendguard.com.