Skip to the content.

Cache Augmented Generation (CAG)

Information

Cache Augmented Generation (CAG) is an alternative to Retrieval-Augmented Generation (RAG) where the entire knowledge source is pre-processed and loaded directly into the LLM’s extended context window at inference time, rather than retrieved on demand from a vector database.

How it differs from RAG:

Aspect RAG CAG
Knowledge loading Retrieved at query time (top-k chunks) Loaded upfront into context window
Retrieval step Required (embedding + vector search) Not needed
Latency Higher (retrieval adds round-trip) Lower (no retrieval step)
Accuracy Depends on retrieval quality and chunking No chunking errors — full source is available
Context limit Can handle large knowledge bases Bounded by the model’s context window size
Infrastructure Requires vector DB (Chroma, Qdrant, Weaviate…) No vector DB needed — simpler deployment
Token cost Only relevant chunks sent per query Full knowledge source sent with every query

When to use CAG:

Limitations:

Usage, tips and tricks

See also