System Architecture

ZeroEntropy is built with the purpose of bringing advanced document intelligence to your knowledge base. We’ve designed our retrieval system to solve common failures found in native hybrid search implementations.

Ingestion Architecture

Query Architecture

Core Components

  1. Document Processing Pipeline

    • Handles document ingestion and parsing
    • Supports a variety document formats (PDF, DOCX, PPT, TXT, etc.)
    • Supports complex diagrams found in medicine, manufacturing, and deep tech.
    • Correctly parses the hierarchical structure found in legal, healthcare, and other industries.
    • Uses LLMs to tag the data, as if you had hired thousands of SEO engineers to manually annotate your corpus.
  2. Data Storage

    • Document raw data is stored in object storage, along with images for PDF/DOCX/PPT pages.
    • Document metadata is stored in PostgreSQL.
    • The document ingestion pipeline stores vector data in turbopuffer, keyword data in ParadeDB BM25 indices, collection dictionaries in S3 with the BK-tree data structure.
  3. Query Processing Engine

    • Interprets natural language queries without any special syntax required.
    • Uses LLM-in-the-loop to automatically generate potential keywords, semantic searches, and to make a final review of everything retrieved before making a final decision on exactly what is most important and relevant to your query.

Security & Performance

  • End-to-end encryption for data in transit and at rest.
  • On-Prem deployment available for enterprise users, as easy-to-use docker images.