Often, you will want to attach metadata information to each document. Then, when sending queries, you may want to filter documents based on that metadata information. ZeroEntropy supports query-time metadata filtering via a comprehensive metadata query language.

Metadata Specification

Document metadata must be of the type dict[str, str | list[str]]. For example, you could attach the following JSON object as document metadata,

{
    "timestamp": "2024-12-12T20:00:45",
    "author": "Nicholas Pipitone",
    "language": "en",
    "list:tags": ["Artificial Intelligence", "Technology", "Documentation"],
    "list:write-permissions": ["admin", "author"],
    "list:read-permissions": ["all"]
}
Note that attribute names must be alphanumeric (hyphens and underscores are also allowed). And, when an attribute is a list of strings, it must be prefixed with list:.

Metadata Filtering

Basic Usage

In order to filter, you can use the operators $eq, $ne, $gt, $gte, $lt, $lte. These operators represent “equals”, “not equals”, “greater than”, “greater than or equal to”, “less than”, and “less than or equal to”, respectively. Here is an example of a few filters,

# For "language" == "en"
results = await ze_client.top_snippets(
    collection_name="default",
    query="I'm looking for documents about apples",
    k=5,
    filter={
        "language": {
            "$eq": "en"
        }
    },
)
# For "timestamp" > (1 day ago)
from datetime import datetime, timedelta
results = await ze_client.top_snippets(
    collection_name="default",
    query="I'm looking for documents about apples",
    k=5,
    filter={
        "timestamp": {
            "$gt": (datetime.now() - timedelta(days=1)).isoformat()
        }
    },
)

You can check whether or not a string attribute matches a list using $in and $nin for “in” and “not in” operations.

# For "language" in ["en", "es"]
results = await ze_client.top_snippets(
    collection_name="default",
    query="I'm looking for documents about apples",
    k=5,
    filter={
        # `true` if "language" is set to either "en" OR "es"
        "language": {
            "$in": ["en", "es"]
        }
    },
)
If you provide a filter query and a document does not contain that attribute, then that attribute will be considered null for that document. In other words, $eq, $gt, $gte, $lt, $lte will all always evaluate to false. But, $neq will always evaluate to true, because null is not equal to any string. Be careful to not have any typos in your query attribute name, or you may not match any documents!

Lists of Strings

When using a “list of strings” metadata attribute, the attribute name must start with list:. For example, you can set list:tags to be a list of tags for a blog article document.

List of strings can only be used with the operators $in, $nin. These operators will execute “set intersection”. Meaning, a in b is true if and only if a and b have at least one element in common. Here are a few examples,

# Upload two blog posts, one about tech, and the other about food.
await ze_client.add_document(
    collection_name="default",
    document_path="ai_blog.txt",
    data={
        "type": "text",
        "text": "This is a blog post about artificial intelligence."
    },
    metadata={
        "list:tags": ["blog", "tech"]
    }
)
await ze_client.add_document(
    collection_name="default",
    document_path="food_blog.txt",
    data={
        "type": "text",
        "text": "This is a blog post about food."
    },
    metadata={
        "list:tags": ["blog", "food"]
    }
)
await ze_client.add_document(
    collection_name="default",
    document_path="empty.txt",
    data={
        "type": "text",
        "text": "This is an empty file with no tags."
    },
    metadata={} # Omission is equivalent to `list:tags` being an empty array
)
# This will only match `ai_blog.txt`
results = await ze_client.top_snippets(
    collection_name="default",
    query="I'm looking for documents about apples",
    k=5,
    filter={
        # Only `true` if "list:tags" contains EITHER "tech" OR "finance" (or both)
        "list:tags": {
            "$in": ["tech", "finance"]
        }
    },
)
# This will only match `empty.txt`
results = await ze_client.top_snippets(
    collection_name="default",
    query="I'm looking for documents about apples",
    k=5,
    filter={
        # Only `true` if "list:tags" contains NEITHER "tech" NOR "food"
        "list:tags": {
            "$nin": ["tech", "food"]
        }
    },
When sending query filters, do not forget that “list of string” attributes must start with list:! If you accidentally query for tags, then you will not find any results. You must query for list:tags.

Boolean Operators

If you want to combine filters, you can use $and, $or as boolean operators. These boolean operators will take in an array of filters. They can also be used recursively to create a tree of boolean logic.

# For "language" == "en" && "timestamp" > (1 day ago)
results = await ze_client.top_snippets(
    collection_name="default",
    query="I'm looking for documents about apples",
    k=5,
    filter={
        "$and": [
            "language": {
                "$eq": "en"
            },
            "timestamp": {
                "$gt": (datetime.now() - timedelta(days=1)).isoformat()
            }
        ]
    },
)
# For
#   "author" == "Nicholas Pipitone"
#   or ("language" == "en" and "timestamp" > (1 day ago))
from datetime import datetime, timedelta
results = await ze_client.top_snippets(
    collection_name="default",
    query="I'm looking for documents about apples",
    k=5,
    filter={
        "$or": [
            "author": {
                "$eq": "Nicholas Pipitone"
            },
            {
                "$and": [
                    "language": {
                        "$eq": "en"
                    },
                    "timestamp": {
                        "$gt": (datetime.now() - timedelta(days=1)).isoformat()
                    }
                ]
            }
        ]
    },
)