TokenHub sits between your application and LLM providers, which gives it visibility into every request and the ability to route intelligently based on cost. This guide walks through four strategies to reduce your LLM spend, from using built-in routing headers to monitoring waste in the dashboard.Documentation Index
Fetch the complete documentation index at: https://docs.inferoute.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Cost reference
Before optimizing, it helps to understand relative model costs. The figures below are approximate input token prices as of mid-2025. Output tokens are typically 3–4x more expensive.| Model | Input cost (per 1M tokens) | Best for |
|---|---|---|
openai/gpt-4o | ~$5.00 | Complex reasoning, high-stakes outputs |
anthropic/claude-3-5-sonnet | ~$3.00 | Long documents, nuanced writing |
openai/gpt-4o-mini | ~$0.15 | Most production tasks, high volume |
anthropic/claude-3-haiku | ~$0.25 | Fast responses, classification |
google/gemini-flash | ~$0.07 | Highest throughput, simple tasks |
openai/gpt-4o to openai/gpt-4o-mini is a 33x cost reduction with equivalent accuracy for simple tasks.
Cost optimization workflow
Audit your current model usage
Open the Usage dashboard and filter by model to see where your spend is concentrated. Sort by cost descending to find the highest-impact models.Look for requests going to premium models that could be handled by cheaper alternatives — classification, extraction, short summarization, and Q&A over structured data are common candidates.
Enable cost-optimized routing for eligible requests
Add the
X-Inferoute-Strategy: cost header to requests where output quality is not critically sensitive. TokenHub will select the cheapest model that can satisfy the request at that moment.python
Right-size models by task complexity
Not all tasks need the same model. Segment your request types and assign the cheapest model that meets your quality bar for each.
Set usage caps per API key
In the API Keys section of the dashboard, set a monthly spend limit per key. When the cap is reached, the key returns a
429 response instead of continuing to accrue charges.This is useful for:- Per-customer keys in multi-tenant applications (cap individual customer spend)
- Development and staging keys (prevent accidental large runs)
- Feature-specific keys (limit spend on a specific product feature)
Monitor and iterate
Return to the Usage dashboard weekly. Filter by:
- Model — confirm that cost-optimized routing is being applied to the right request types
- Provider — check whether any provider is disproportionately expensive
- Time of day — spot unexpected spikes that may indicate runaway processes or prompt injection
Use cost-optimized routing for the request
TheX-Inferoute-Strategy: cost header and the economy alias are the fastest way to reduce spend without changing your prompts or application logic.
Task complexity guide
Use this as a starting point for assigning models in your application. Adjust based on your own quality evaluations.| Task type | Recommended model | Rationale |
|---|---|---|
| Sentiment classification | google/gemini-flash or openai/gpt-4o-mini | Binary or n-class output, no nuance required |
| Entity extraction | openai/gpt-4o-mini | Structured output, well-defined task |
| Short summarization | anthropic/claude-3-haiku | Concise output, lower reasoning load |
| Customer support drafting | openai/gpt-4o-mini | Tone matters but complexity is low |
| Long document analysis | anthropic/claude-3-5-sonnet | Needs large context window and reasoning |
| Code review / generation | openai/gpt-4o | High accuracy requirement, complex logic |
| Legal / compliance drafting | openai/gpt-4o or anthropic/claude-3-5-sonnet | High stakes, accuracy critical |