Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inferoute.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

The chat completions endpoint is the primary way to interact with language models on TokenHub. You send a messages array representing a conversation, and the API returns the next assistant message. The request body is fully OpenAI-compatible — any payload that works with the OpenAI Chat Completions API works here.

Endpoint

POST https://api.tokenhub.ai/v1/chat/completions

Request parameters

Body

model
string
required
The model to use for this request. You can use the short name (gpt-4o) or the provider-prefixed name (openai/gpt-4o, anthropic/claude-3-5-sonnet). Retrieve the full list of available IDs from GET /v1/models.
messages
object[]
required
The conversation history as an array of message objects. Each object must include role (system, user, or assistant) and content (string).
max_tokens
integer
Maximum number of tokens to generate in the response. Defaults to the model’s maximum output.
temperature
number
default:"1"
Sampling temperature between 0 and 2. Lower values produce more deterministic output; higher values produce more varied output.
stream
boolean
default:"false"
When true, the response is streamed as server-sent events (SSE). Each event contains a partial delta object. The stream ends with data: [DONE].
top_p
number
default:"1"
Nucleus sampling parameter. The model considers only the tokens comprising the top top_p probability mass.
n
integer
default:"1"
Number of completion choices to generate. Each choice is an independent generation.
stop
string | string[]
One or more sequences where the model stops generating. The stop sequence itself is not included in the output.

Headers

X-Inferoute-Strategy
string
Routing strategy for this request. Accepted values: cost, latency, availability, round-robin. Defaults to your account’s configured strategy.
X-Inferoute-Fallback
string
Comma-separated list of fallback model IDs to try if the primary model is unavailable. Example: anthropic/claude-3-5-sonnet,openai/gpt-4-turbo.

Response fields

id
string
Unique identifier for this completion request.
object
string
Always "chat.completion".
model
string
The model that actually served the request. May differ from your requested model when a fallback was used.
choices
object[]
usage
object

Response headers

HeaderDescription
X-Inferoute-ProviderThe provider that served the request (e.g., openai, anthropic)
X-Inferoute-Request-IdUnique request ID for support and debugging
X-Inferoute-Latency-MsEnd-to-end request latency in milliseconds

Examples

Basic request and response

curl https://api.tokenhub.ai/v1/chat/completions \
  --request POST \
  --header "Authorization: Bearer YOUR_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "model": "openai/gpt-4o",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'
Response:
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "openai/gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 9,
    "total_tokens": 37
  }
}

Routing strategy and fallback

curl
curl https://api.tokenhub.ai/v1/chat/completions \
  --request POST \
  --header "Authorization: Bearer YOUR_API_KEY" \
  --header "Content-Type: application/json" \
  --header "X-Inferoute-Strategy: latency" \
  --header "X-Inferoute-Fallback: anthropic/claude-3-5-sonnet,openai/gpt-4-turbo" \
  --data '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Summarize the Eiffel Tower in one sentence."}]
  }'

Streaming

Python
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.tokenhub.ai/v1",
)

stream = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a short poem about the sea."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
Each streamed chunk looks like:
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "model": "openai/gpt-4o",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "The"
      },
      "finish_reason": null
    }
  ]
}
The stream terminates with data: [DONE].