TokenHub can automatically retry a failed request against one or more backup providers before returning an error to your application. This means a provider outage, timeout, or rate limit at the primary provider doesn’t have to cause a visible failure for your users. This guide explains how fallback routing works and how to configure it.Documentation Index
Fetch the complete documentation index at: https://docs.inferoute.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Default behavior
By default, TokenHub does not apply fallback routing — if the targeted provider returns a5xx error or times out, the error is returned to your application immediately. To enable fallback, you configure a prioritized list of backup models at the request level using the X-Inferoute-Fallback header.
When a primary request fails, TokenHub:
- Detects the
5xxresponse or connection timeout from the primary provider - Selects the next model from your fallback list
- Replays the original request payload to the backup provider
- Returns the successful response to your application as if it came from the primary
Fallback does not change how you are billed. You are charged for the provider that actually served the request, at that provider’s token rate. If your primary provider fails after processing tokens, you are not charged for that failed attempt.
Configure fallback with the request header
Pass a comma-separated list of fallback models in theX-Inferoute-Fallback request header. TokenHub tries them in order until one succeeds.
Identify which provider served the request
Even when fallback is transparent, you may want to log which provider handled each request for cost attribution or debugging. Every response from TokenHub includes anX-Inferoute-Provider response header.
python
Fallback routing flow
Choosing your fallback models
When building your fallback list, consider:- Capability parity — if your request uses tools, vision, or structured outputs, all fallback models must support those features
- Context length — if your prompt is 50k tokens, every fallback model must support at least that context window
- Output consistency — models from different providers have different writing styles and behavior; test fallback models on your prompts before deploying
- Cost implications — a fallback to a more expensive model increases cost when triggered; factor this into your budget
Recommended fallback pairs
| Primary | Fallback |
|---|---|
openai/gpt-4o | anthropic/claude-3-5-sonnet,google/gemini-1.5-pro |
anthropic/claude-3-5-sonnet | openai/gpt-4o,google/gemini-1.5-pro |
openai/gpt-4o-mini | anthropic/claude-3-haiku,google/gemini-flash |
google/gemini-flash | openai/gpt-4o-mini,anthropic/claude-3-haiku |