Last quarter, our LLM costs hit $12,000. In a single month. We had no idea where the money was going. No tracking, no budgets, no alerts. That’s when I realized: cost optimization isn’t optional for AI workloads—it’s survival. Here’s how we cut costs by 65% without sacrificing quality.

The $12,000 Wake-Up Call
It started innocently. We launched a new AI feature, usage grew, and suddenly our monthly bill was $12,000. The problem? We had no visibility into:
- Which models were costing the most
- Which features were driving costs
- Which users were consuming resources
- What times of day were most expensive
- Whether we were using the right models for each task
We were flying blind. That’s when I built a comprehensive cost optimization system.
Understanding LLM Cost Structure
LLM costs have two main components:
- Input tokens: What you send to the model
- Output tokens: What the model generates
Costs vary dramatically by model:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Use Case |
|---|---|---|---|
| GPT-4 | $30 | $60 | Complex reasoning |
| GPT-3.5 Turbo | $1.50 | $2.00 | General tasks |
| Claude 3 Opus | $15 | $75 | Premium quality |
| Claude 3 Sonnet | $3 | $15 | Best value |
The difference between GPT-4 and GPT-3.5 Turbo is 20x. Using the wrong model can cost you thousands.
Cost Tracking System
You can’t optimize what you don’t measure. Here’s our cost tracking system:
class CostTracker:
def __init__(self):
self.model_costs = {
'gpt-4': {'input': 0.03, 'output': 0.06},
'gpt-3.5-turbo': {'input': 0.0015, 'output': 0.002},
'claude-3-opus': {'input': 0.015, 'output': 0.075},
'claude-3-sonnet': {'input': 0.003, 'output': 0.015}
}
self.daily_costs = {}
self.user_costs = {}
self.feature_costs = {}
def calculate_cost(self, model, tokens_in, tokens_out):
if model not in self.model_costs:
return 0
costs = self.model_costs[model]
input_cost = (tokens_in / 1000) * costs['input']
output_cost = (tokens_out / 1000) * costs['output']
return input_cost + output_cost
def track_request(self, model, tokens_in, tokens_out, user_id=None, feature=None):
cost = self.calculate_cost(model, tokens_in, tokens_out)
today = datetime.now().date().isoformat()
# Daily tracking
if today not in self.daily_costs:
self.daily_costs[today] = 0
self.daily_costs[today] += cost
# User tracking
if user_id:
if user_id not in self.user_costs:
self.user_costs[user_id] = 0
self.user_costs[user_id] += cost
# Feature tracking
if feature:
if feature not in self.feature_costs:
self.feature_costs[feature] = 0
self.feature_costs[feature] += cost
# Alert if daily cost exceeds threshold
if self.daily_costs[today] > 100: # $100/day threshold
self.send_alert(f"Daily cost exceeded: ${self.daily_costs[today]:.2f}")
return cost

Optimization Strategies
1. Model Selection
Use the right model for the task:
- Simple tasks: Use GPT-3.5 Turbo or Claude 3 Sonnet (20x cheaper)
- Complex reasoning: Use GPT-4 or Claude 3 Opus (only when needed)
- Embeddings: Use text-embedding-3-small (much cheaper than ada-002)
def select_model(task_complexity, quality_requirement):
if task_complexity == 'simple' and quality_requirement == 'standard':
return 'gpt-3.5-turbo' # 20x cheaper
elif task_complexity == 'complex' or quality_requirement == 'high':
return 'gpt-4' # More expensive but necessary
else:
return 'claude-3-sonnet' # Good balance
2. Prompt Optimization
Shorter prompts = lower costs. Optimize your prompts:
- Remove unnecessary context
- Use concise instructions
- Limit output length with max_tokens
- Cache repeated prompts
def optimize_prompt(original_prompt, context):
# Remove redundant information
optimized = remove_redundancy(original_prompt)
# Use shorter, clearer instructions
optimized = simplify_instructions(optimized)
# Limit context to what's necessary
optimized = trim_context(optimized, context, max_tokens=2000)
return optimized
def limit_output(model_response, max_tokens=500):
# Set max_tokens to prevent excessive generation
if len(model_response.split()) > max_tokens:
return ' '.join(model_response.split()[:max_tokens])
return model_response
3. Caching
Cache responses for repeated queries:
from functools import lru_cache
import hashlib
class ResponseCache:
def __init__(self, ttl=3600): # 1 hour TTL
self.cache = {}
self.ttl = ttl
def get_cache_key(self, prompt, model):
# Create hash of prompt + model
key_string = f"{prompt}:{model}"
return hashlib.md5(key_string.encode()).hexdigest()
def get(self, prompt, model):
key = self.get_cache_key(prompt, model)
if key in self.cache:
entry = self.cache[key]
if datetime.now() - entry['timestamp'] < timedelta(seconds=self.ttl):
return entry['response']
return None
def set(self, prompt, model, response):
key = self.get_cache_key(prompt, model)
self.cache[key] = {
'response': response,
'timestamp': datetime.now()
}

4. Batch Processing
Process multiple requests together when possible:
def batch_requests(requests, batch_size=10):
batches = [requests[i:i+batch_size] for i in range(0, len(requests), batch_size)]
for batch in batches:
# Process batch together
responses = process_batch(batch)
yield responses
5. Token Budgets
Set budgets per user, feature, or time period:
class TokenBudget:
def __init__(self):
self.budgets = {
'daily': 1000000, # 1M tokens per day
'per_user': 100000, # 100K tokens per user per day
'per_feature': 500000 # 500K tokens per feature per day
}
def check_budget(self, user_id, feature, tokens_requested):
# Check daily budget
daily_used = self.get_daily_usage()
if daily_used + tokens_requested > self.budgets['daily']:
return False, "Daily budget exceeded"
# Check user budget
user_used = self.get_user_usage(user_id)
if user_used + tokens_requested > self.budgets['per_user']:
return False, "User budget exceeded"
# Check feature budget
feature_used = self.get_feature_usage(feature)
if feature_used + tokens_requested > self.budgets['per_feature']:
return False, "Feature budget exceeded"
return True, "Budget OK"
Real-World Results
After implementing these strategies, we achieved:
| Metric | Before | After | Savings |
|---|---|---|---|
| Monthly Cost | $12,000 | $4,200 | -65% |
| Cost per Request | $0.024 | $0.008 | -67% |
| GPT-4 Usage | 85% | 15% | -82% |
| Cache Hit Rate | 0% | 42% | +42% |
Best Practices
- Track everything: Model, tokens, costs, users, features
- Set budgets: Daily, per-user, per-feature limits
- Use the right model: Don't use GPT-4 for simple tasks
- Optimize prompts: Shorter = cheaper
- Cache aggressively: Cache repeated queries
- Monitor continuously: Set up alerts for cost spikes
- Review regularly: Weekly cost reviews to identify trends
🎯 Key Takeaway
Cost optimization for AI workloads isn't optional—it's essential. Track everything, use the right models, optimize prompts, cache responses, and set budgets. We cut costs by 65% without sacrificing quality. You can too.
Bottom Line
LLM costs can spiral out of control without proper tracking and optimization. Implement cost tracking from day one, use the right models for each task, optimize prompts, and cache responses. The savings will add up quickly.
Discover more from C4: Container, Code, Cloud & Context
Subscribe to get the latest posts sent to your email.