Writing a Pipeline Script

The fastest path is to copy scripts/nomic-chroma.py and adapt it for your stack — it's a complete working example. If you're building your own, this page shows the correct ingestion loop and the operational details you'll need. See Manifest Reference for field details.

The complete ingestion loop

import json, os, sys

path = sys.argv[1] if len(sys.argv) > 1 else os.environ.get("BRAISED_MANIFEST", "dist/manifest.jsonl")
chunks = []
delete_ids = []
with open(path) as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        record = json.loads(line)
        if record.get("_deleted"):
            delete_ids.append(record["id"])
        else:
            chunks.append(record)

# Delete before upserting — a changed page emits a deletion record for the old
# chunk AND a new chunk record, so order matters.
for old_id in delete_ids:
    delete_from_store(old_id)

for chunk in chunks:
    embed_and_upsert(chunk)

On the first build there are no deletion records — the same loop handles it without special-casing.

Reference implementation

scripts/nomic-chroma.py is a complete working example against Ollama and Chroma, using Python stdlib only (no pip install). It covers reading, embedding, upsert, deletion handling, and a smoke-test query. Copy and adapt it for your own stack.

python3 scripts/nomic-chroma.py dist/manifest.jsonl

Embedding

Pass chunk["content"] directly to your embedding model — it's plain text with breadcrumb context pre-baked. The field works with any embedding endpoint:

# OpenAI
embedding = openai_client.embeddings.create(
    model="text-embedding-3-small", input=chunk["content"]
).data[0].embedding

# Ollama
embedding = requests.post("http://localhost:11434/api/embeddings",
    json={"model": "nomic-embed-text", "prompt": chunk["content"]}
).json()["embedding"]

Storing metadata for retrieval results

Store url, title, heading, and breadcrumb alongside your vectors so query results are usable:

metadatas.append({
    "url":        chunk["url"],
    "title":      chunk["title"],
    "heading":    chunk["heading"],
    "breadcrumb": " > ".join(chunk["breadcrumb"]),
})

Using a webhook

If you're running a hosted ingest endpoint, use pipeline: http: instead of a script. Braised POSTs the manifest as application/x-ndjson — your endpoint receives the same stream a script would read from disk. See Configuration in the overview.

.braised/ and CI state

Braised stores source-file hashes in .braised/index-state.json to drive incremental manifests. Add .braised/ to .gitignore and cache it in CI to preserve incremental builds between runs. See Build Outputs for details.

Pages excluded after initial indexing

If a page gains llm_exclude: true after it was previously indexed, braised emits deletion records for all its old chunk IDs on the next build. The same deletion loop handles it automatically.