Skip to main content

zesearch

zesearch is ZeroEntropy’s end-to-end search engine, abstracting away data processing from OCR and chunking, to embedding and storing, to querying and reranking.

Index

Add documents to a collection

When you add a document to a collection in zesearch, it goes through a fully managed ingestion pipeline:
  1. Parse: Binary files (PDF, DOCX, PPT, images, etc.) are OCR’d and converted to text. Plain text and CSV inputs skip this step.
  2. Chunk: The parsed text is split into chunks at multiple granularities: coarse (~2000 chars) and fine (~200 chars), optimized for retrieval.
  3. Embed: Each chunk is embedded using zembed-1, ZeroEntropy’s state-of-the-art multilingual embedding model, and stored in our vector index.
When you call add-document, documents are automatically added to a collection with a unique path (like a filepath). ZeroEntropy supports three content types:
  • text: Plain text content.
  • text-pages / text-pages-unordered: Pre-paginated text (array of strings). Use unordered for data like CSVs where pages are independent.
  • auto: Binary files (PDF, DOCX, PPT, etc.) as base64. ZeroEntropy handles OCR and parsing automatically.
Set overwrite: true to upsert (atomically replace if the path already exists). \

Custom Chunking

If you want control over how your data is chunked, use the text-pages content type. Each string in the pages array becomes its own page in the index, letting you define chunk boundaries yourself. Use text-pages-unordered when pages are independent (e.g. CSV rows, FAQ entries). See examples for detailed walkthroughs of different ingestion strategies.

Using zembed-1 as a standalone

You can also call zembed-1 directly via the embed endpoint and plug it into the vector database of your choice. See Models for more details.

Query

There are three granularity levels for querying your indexed data: documents, pages, and snippets. All query endpoints accept a natural language query, a collection_name, and a k parameter controlling how many results to return. All query endpoints support metadata filtering via the optional filter parameter.

Top Documents

Returns the top K most relevant documents for a given query.
Useful when you want to identify which documents are relevant without needing sub-document granularity.
Note that top-documents only returns document paths, not contents. Document contents are accessible using the Get Document Info endpoint.
Use latency_mode: “high” if you need higher throughput at the cost of higher latency (default is “low”).
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.queries.top_documents(
collection_name="contracts",
query="What are the payment terms?",
k=5,
include_metadata=True,
)

Top Pages

Returns the top K most relevant pages. Ideal for page-level retrieval over PDFs, DOCX, or documents ingested with text-pages content type.
Set include_content to true to return the full text of each page. A URL to an image of the page will also be provided in the results.

Top Snipepts

Returns the top K most relevant text snippets. This is the most granular query type.
Each snippet includes the exact character range (start_index, end_index) and page_span within the source document.
You can choose between coarse snippets (averaging ~2000 characters, default) and precise snippets (averaging ~200 characters) using the precise_responses parameter.
Pass a reranker, such as zerank-2 for even better ranking.
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.queries.top_snippets(
collection_name="pdfs",
query="What is Retrieval Augmented Generation?",
k=10,
reranker="zerank-2",
precise_responses=True,
)
for snippet in response.results:
print(f"{snippet.path} [pages {snippet.page_span}] (score: {snippet.score})")
print(snippet.content)

Data Management

zesearch organizes data into collections, each containing documents. Think of collections as databases and documents as records.

Collections

Collections Create, list, and delete collections. Collection names are strings up to 1024 UTF-8 bytes.
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
# Create a collection
zclient.collections.add(collection_name="contracts")
# List all collections
response = zclient.collections.get_list()
print(response.collection_names)
# Delete a collection
zclient.collections.delete(collection_name="contracts")

Documents

After adding a document to a collection, it takes time to parse and index. Use the Get Document Info endpoint to track progress. \ Each document response includes file_url for downloading the raw file, index_status for tracking processing state, raw content, and num_pages (null if still parsing or unsupported filetype). \ You can delete one or more documents by path. We supports batch deletion of up to 64 paths at once.
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
# Delete a single document
zclient.documents.delete(
collection_name="contracts",
path="contracts/acme-nda.txt",
)
# Batch delete
response = zclient.documents.delete(
collection_name="contracts",
path=["old/doc1.txt", "old/doc2.txt", "old/doc3.txt"],
)
print(response.deleted_paths)  # paths that were actually found and deleted
More examples can be found here.