Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inferoute.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

TokenHub enforces rate limits on every API key to ensure fair usage across all customers and to protect system stability. Limits are measured in requests per minute (RPM) and tokens per minute (TPM). When you exceed a limit, the API returns a 429 Too Many Requests response until the window resets.

Default limit tiers

PlanRequests per minute (RPM)Tokens per minute (TPM)
Free60100,000
Pro6001,000,000
EnterpriseCustomCustom
Contact support@tokenhub.ai to request a limit increase for your plan.

Rate limit headers

Every API response includes headers that tell you your current limit status:
HeaderDescription
X-RateLimit-Limit-RequestsMaximum number of requests allowed in the current window.
X-RateLimit-Remaining-RequestsNumber of requests remaining before you hit the limit.
X-RateLimit-Reset-RequestsUnix timestamp (seconds) when the request window resets.

Handling 429 errors

When you exceed your rate limit, the API responds with:
HTTP/1.1 429 Too Many Requests
Retry-After: 15
The Retry-After header tells you the number of seconds to wait before retrying. The safest strategy is exponential backoff with jitter — wait progressively longer between retries, with a small random component to avoid synchronized retry storms.
import time
import random
import openai

client = openai.OpenAI(
    base_url="https://api.tokenhub.ai/v1",
    api_key="YOUR_TOKENHUB_API_KEY",
)

def chat_with_backoff(**kwargs):
    max_retries = 5
    base_delay = 1  # seconds

    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except openai.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limit hit. Retrying in {delay:.2f}s...")
            time.sleep(delay)

response = chat_with_backoff(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)
TokenHub rate limits are separate from any limits enforced by the underlying providers. Even if you are within your TokenHub limits, a provider may throttle the request on their end. TokenHub will surface those errors with the same 429 status code and attempt automatic retries according to your routing configuration.