Cache Augmented Generation (CAG)

Information

Cache Augmented Generation (CAG) is an alternative to Retrieval-Augmented Generation (RAG) where the entire knowledge source is pre-processed and loaded directly into the LLM’s extended context window at inference time, rather than retrieved on demand from a vector database.

How it differs from RAG:

Aspect	RAG	CAG
Knowledge loading	Retrieved at query time (top-k chunks)	Loaded upfront into context window
Retrieval step	Required (embedding + vector search)	Not needed
Latency	Higher (retrieval adds round-trip)	Lower (no retrieval step)
Accuracy	Depends on retrieval quality and chunking	No chunking errors — full source is available
Context limit	Can handle large knowledge bases	Bounded by the model’s context window size
Infrastructure	Requires vector DB (Chroma, Qdrant, Weaviate…)	No vector DB needed — simpler deployment
Token cost	Only relevant chunks sent per query	Full knowledge source sent with every query

When to use CAG:

The knowledge base fits within the model’s context window (e.g., 128k–1M tokens).
Low latency is more important than per-query token cost.
You want to avoid the complexity of a vector database and retrieval pipeline.
The knowledge source changes infrequently (cache can be pre-computed via KV cache).

Limitations:

Context window limits cap the knowledge base size — not suitable for very large document collections.
Higher token cost per query compared to RAG where only relevant chunks are sent.
Some models perform less reliably with very long contexts (lost-in-the-middle problem).

Usage, tips and tricks

Use KV cache (key-value cache) features of your inference server to avoid re-processing the knowledge context on every query — this is the “cache” in CAG.
Pre-process and structure the knowledge document for clarity — the model reads the whole thing, so headings and organization matter.
For models with large context windows (Gemini 1.5, Claude 3, GPT-4 Turbo), CAG is a practical choice for knowledge bases up to a few hundred pages.
Combine with system prompt instructions to guide how the model uses the cached knowledge.

Cache Augmented Generation (CAG)

Software components

Cache Augmented Generation (CAG)

Information

Usage, tips and tricks

See also