POST /v1/chat/completions

The chat completions endpoint is the primary way to interact with language models on Inferoute. You send a messages array representing a conversation, and the API returns the next assistant message. The request body is fully OpenAI-compatible — any payload that works with the OpenAI Chat Completions API works here.

Endpoint

POST https://api.inferoute.ai/v1/chat/completions

Request parameters

Body

string

required

The model to use for this request. You can use the short name (gpt-4o) or the provider-prefixed name (openai/gpt-4o, anthropic/claude-3-5-sonnet). Retrieve the full list of available IDs from GET /v1/models.

object[]

required

The conversation history as an array of message objects. Each object must include role (system, user, or assistant) and content (string).

integer

Maximum number of tokens to generate in the response. Defaults to the model’s maximum output.

number

default:"1"

Sampling temperature between 0 and 2. Lower values produce more deterministic output; higher values produce more varied output.

boolean

default:"false"

When true, the response is streamed as server-sent events (SSE). Each event contains a partial delta object. The stream ends with data: [DONE].

number

default:"1"

Nucleus sampling parameter. The model considers only the tokens comprising the top top_p probability mass.

integer

default:"1"

Number of completion choices to generate. Each choice is an independent generation.

string | string[]

One or more sequences where the model stops generating. The stop sequence itself is not included in the output.

Headers

string

Routing strategy for this request. Accepted values: cost, latency, availability, round-robin. Defaults to your account’s configured strategy.

string

Comma-separated list of fallback model IDs to try if the primary model is unavailable. Example: anthropic/claude-3-5-sonnet,openai/gpt-4-turbo.

Response fields

string

Unique identifier for this completion request.

string

Always "chat.completion".

string

The model that actually served the request. May differ from your requested model when a fallback was used.

object[]

Show properties

integer

Zero-based index of this choice.

object

Show properties

string

Always "assistant".

string

The generated text content.

string

Why generation stopped: "stop", "length", "content_filter", or "tool_calls".

object

Show properties

integer

Number of tokens in the input messages.

integer

Number of tokens generated.

integer

Sum of prompt and completion tokens.

Response headers

Header	Description
`X-Inferoute-Provider`	The provider that served the request (e.g., `openai`, `anthropic`)
`X-Inferoute-Request-Id`	Unique request ID for support and debugging
`X-Inferoute-Latency-Ms`	End-to-end request latency in milliseconds

Examples

Basic request and response

curl https://api.inferoute.ai/v1/chat/completions \
  --request POST \
  --header "Authorization: Bearer YOUR_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "model": "openai/gpt-4o",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.inferoute.ai/v1",
)

response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    max_tokens=256,
    temperature=0.7,
)
print(response.choices[0].message.content)

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "openai/gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 9,
    "total_tokens": 37
  }
}

Routing strategy and fallback

curl

curl https://api.inferoute.ai/v1/chat/completions \
  --request POST \
  --header "Authorization: Bearer YOUR_API_KEY" \
  --header "Content-Type: application/json" \
  --header "X-Inferoute-Strategy: latency" \
  --header "X-Inferoute-Fallback: anthropic/claude-3-5-sonnet,openai/gpt-4-turbo" \
  --data '{
    "model": "openai/gpt-4o",
    "messages": [{"role": "user", "content": "Summarize the Eiffel Tower in one sentence."}]
  }'

Streaming

Python

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.inferoute.ai/v1",
)

stream = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a short poem about the sea."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

Each streamed chunk looks like:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "model": "openai/gpt-4o",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "The"
      },
      "finish_reason": null
    }
  ]
}

The stream terminates with data: [DONE].

Overview

Endpoints

POST /v1/chat/completions

Endpoint

Request parameters

Body

Headers

Response fields

Response headers

Examples

Basic request and response

Routing strategy and fallback

Streaming

​Endpoint

​Request parameters

​Body

​Headers

​Response fields

​Response headers

​Examples

​Basic request and response

​Routing strategy and fallback

​Streaming

Endpoint

Request parameters

Body

Headers

Response fields

Response headers

Examples

Basic request and response

Routing strategy and fallback

Streaming