zesearch
zesearch is ZeroEntropy’s end-to-end search engine, abstracting away data processing from OCR and chunking, to embedding and storing, to querying and reranking.
Index
Add documents to a collection
When you add a document to a collection in zesearch, it goes through a fully managed ingestion pipeline:- Parse: Binary files (PDF, DOCX, PPT, images, etc.) are OCR’d and converted to text. Plain text and CSV inputs skip this step.
- Chunk: The parsed text is split into chunks at multiple granularities: coarse (~2000 chars) and fine (~200 chars), optimized for retrieval.
- Embed: Each chunk is embedded using zembed-1, ZeroEntropy’s state-of-the-art multilingual embedding model, and stored in our vector index.
text: Plain text content.text-pages/text-pages-unordered: Pre-paginated text (array of strings). Use unordered for data like CSVs where pages are independent.auto: Binary files (PDF, DOCX, PPT, etc.) as base64. ZeroEntropy handles OCR and parsing automatically.
overwrite: true to upsert (atomically replace if the path already exists). \
Custom Chunking
If you want control over how your data is chunked, use thetext-pages content type. Each string in the pages array becomes its own page in the index, letting you define chunk boundaries yourself. Use text-pages-unordered when pages are independent (e.g. CSV rows, FAQ entries).
See examples for detailed walkthroughs of different ingestion strategies.
Using zembed-1 as a standalone
You can also call zembed-1 directly via the embed endpoint and plug it into the vector database of your choice. See Models for more details.Query
There are three granularity levels for querying your indexed data: documents, pages, and snippets. All query endpoints accept a natural language query, a collection_name, and a k parameter controlling how many results to return. All query endpoints support metadata filtering via the optional filter parameter.Top Documents
Returns the top K most relevant documents for a given query.Useful when you want to identify which documents are relevant without needing sub-document granularity.
Note that
top-documents only returns document paths, not contents. Document contents are accessible using the Get Document Info endpoint. Use latency_mode: “high” if you need higher throughput at the cost of higher latency (default is “low”).
Top Pages
Returns the top K most relevant pages. Ideal for page-level retrieval over PDFs, DOCX, or documents ingested with text-pages content type.Set include_content to true to return the full text of each page. A URL to an image of the page will also be provided in the results.
Top Snipepts
Returns the top K most relevant text snippets. This is the most granular query type.Each snippet includes the exact character range (start_index, end_index) and page_span within the source document.
You can choose between coarse snippets (averaging ~2000 characters, default) and precise snippets (averaging ~200 characters) using the precise_responses parameter.
Pass a reranker, such as
zerank-2 for even better ranking.