Rerank

curl --request POST \
  --url https://api.zeroentropy.dev/v1/models/rerank \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "query": "<string>",
  "documents": [
    "<string>"
  ],
  "top_n": 123,
  "latency": "fast"
}
'

{
  "results": [
    {
      "index": 123,
      "relevance_score": 123
    }
  ],
  "total_bytes": 123,
  "total_tokens": 123,
  "actual_latency_mode": "fast",
  "e2e_latency": 123,
  "inference_latency": 123
}

Models

Rerank

Reranks the provided documents, according to the provided query.

The results will be sorted by descending order of relevance. For each document, the index and the score will be returned. The index is relative to the documents array that was passed in. The score is the query-document relevancy determined by the reranker model. The results will be returned in descending order of relevance.

Organizations will, by default, have a ratelimit of 2,500,000 bytes-per-minute (BPM) and 1000 requests-per-minute (RPM). Ratelimits are refreshed every 15 seconds. If this is exceeded, requests will be throttled into latency: "slow" mode, up to 20,000,000 bytes-per-minute. If even this is exceeded, you will get a 429 error.

The “bytes” used by a request is calculated as sum(150 + query.encode('utf-8') + d.encode('utf-8') for d in documents). Note a baseline overhead of 150 bytes, and that the query bytes are included for each document, as rerankers are cross-encoders. The maximum per-request payload size is 5,000,000 bytes.

To request higher ratelimits, please contact founders@zeroentropy.dev or message us on Discord or Slack!

POST

models

rerank

Rerank

curl --request POST \
  --url https://api.zeroentropy.dev/v1/models/rerank \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "query": "<string>",
  "documents": [
    "<string>"
  ],
  "top_n": 123,
  "latency": "fast"
}
'

{
  "results": [
    {
      "index": 123,
      "relevance_score": 123
    }
  ],
  "total_bytes": 123,
  "total_tokens": 123,
  "actual_latency_mode": "fast",
  "e2e_latency": 123,
  "inference_latency": 123
}

Authorizations

Authorization

string

header

required

The Authorization header must be provided in the format Bearer <your-api-key>.

You can get your API Key at the Dashboard!

Body

application/json

model

string

required

The model ID to use for reranking. Options are: ["zerank-2", "zerank-1", "zerank-1-small"]

query

string

required

The query to rerank the documents by.

documents

string[]

required

The list of documents to rerank. Each document is a string.

top_n

integer | null

If provided, then only the top n documents will be returned in the results array. Otherwise, n will be the length of the provided documents array.

latency

enum<string> | null

Whether the call will be inferenced "fast" or "slow". RateLimits for slow API calls are orders of magnitude higher, but you can expect >10 second latency. Fast inferences are guaranteed subsecond, but rate limits are lower. If not specified, first a "fast" call will be attempted, but if you have exceeded your fast rate limit, then a slow call will be executed. If explicitly set to "fast", then 429 will be returned if it cannot be executed fast.

Available options:

fast,

slow

Response

Successful Response

results

RerankResult · object[]

required

The results, ordered by descending order of relevance to the query.

Show child attributes

total_bytes

integer

required

The total number of bytes in the request. This is used for ratelimiting.

total_tokens

integer

required

The total number of tokens in the request. This is used for billing.

actual_latency_mode

enum<string>

required

The type of inference actually used. If auto is requested, then fast will be used by default, with slow as a fallback if your ratelimit is exceeded. Else, this field will be identical to the requested latency mode.

Available options:

fast,

slow

e2e_latency

number

required

The total time, in seconds, between rerank request received and rerank response returned. Client latency should equal e2e_latency + your ping to ZeroEntropy's API.

inference_latency

number

required

The time, in seconds, to actually inference the request. If this is significantly lower than e2e_latency, this is likely due to ratelimiting. Please request a higher ratelimit at founders@zeroentropy.dev or message us on Discord or Slack!

Embed

Models

Status

Collections

Documents

Queries

Rerank

Authorizations

Body

Response