Skip to the content.

LLM

Information

Tokenizers

Simple stack (local) Ollama (LLM) sentence-transformers (embeddings) SQLite / DuckDB / Chroma / FAISS

IntelliJ Continue.dev Ollama + DeepSeek Coder AGENT.md + TASKS.md

DeepSeek Coder Specialized for code generation, strong for Java and Maven + JUnit, local CPU/GPU use Code Llama Meta model optimized for code generation, 7B–34B parameters Qwen-Coder Open-source model family for multiturn coding tasks, 0.5B–32B+ parameters StarCoder / StarCoderBase Trained heavily on GitHub code, 15B, GPU recommended

Model Size Strengths Llama 1B–405B General-purpose, large ecosystem Qwen 0.5B–235B+ Very strong for coding and multilingual tasks DeepSeek 1.5B–671B Good quality-to-cost ratio, strong reasoning Mistral 3B–141B Fast and efficient Gemma 1B–27B Smaller models for local usage Phi 1.5B–14B Very good with limited resources OLMo 1B–32B Fully open training data Falcon 1B–180B Still used in some enterprises

Model Notes all-MiniLM-L6-v2 Very fast and accurate for passage embeddings nomic-embed-text Open-source, well suited for document indexing bge-small / bge-large Very good for semantic search, open-source text-embedding-3-small / 3-large If you want a more OpenAI-style option, these are common reference models

Coding Agent → DeepSeek / Code Llama / StarCoder Review Agent → LLaMA 3 / Mistral / Falcon Knowledge embeddings → MiniLM / nomic-embed-text / bge Orchestrator → a simple daemon or Zeebe; no LLM required

Embedding / Vector DB, FAISS, Chroma, Qdrant

Agent Tokenization Cost
Cloud GPT / OpenAI automatic token-based
Local LLM (Ollama, LLaMA) local free except hardware
Cloud embedding services automatic token-based
Local embeddings (FAISS + transformers) local free

Chip Huyen book What is inference and what is retrieval

Local RAG PoC 1 LLM 1 vector DB 1 document

MCP

Criterion LISP JSON YAML TOML JS TOON
Ambiguity-free ⚠️ ⚠️ ⚠️
Intent clarity ⚠️ ⚠️ ⚠️
Hierarchy ⚠️ ⚠️ ⚠️
AI-friendly ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐ ⭐⭐
Risk of storytelling ⚠️ ⚠️ ⚠️
Suitable as DSL ⚠️

Installation

Rocky Linux

Fedora

FreeBSD

OpenIndiana

Configuration

Usage, tips and tricks

Coding tips and tricks

Terms

Topic What it is Why it matters Practical insights / pro tips
— Model fundamentals —      
LLM (Large Language Model) Neural network that predicts next tokens in text Core AI intelligence layer Not human reasoning — output depends on prompt + context
Model Trained neural network Core engine behind AI systems Stateless unless external memory is added
Token Smallest unit of text Cost + context limitation unit More tokens = higher cost + slower responses
Context window Maximum text model can process at once Limits working memory Old info is dropped → use RAG or summarization
Inference Running the model to generate output Actual usage phase Every response is an inference call
Training Building a model from scratch Creates foundation models Done only by large labs
Pretraining First phase of training on general data Learns language + patterns Base capability layer
Fine-tuning Training on specific dataset Specializes model behavior Used for domain adaptation
Supervised fine-tuning (SFT) Fine-tuning on labeled prompt-response pairs Teaches desired task behavior Standard first alignment step before preference training
Instruction tuning Training on instruction-following examples Improves usability Better prompt adherence
Autonomous Learning (AL) Model improves through self-directed data or tasks Reduces manual supervision need Quality depends heavily on feedback loops and safeguards
ICM (Internal Coherence Maximization) Approach that pushes outputs toward internal harmony Improves consistency Useful concept for reasoning stability, but verify claims
UPFT (Unsupervised Prefix Fine-Tuning) Prefix-based adaptation without full supervised labels Lightweight specialization Useful when labels are limited and prompt prefixes matter
RLHF Human feedback training Alignment & safety Improves helpfulness and reduces harmful outputs
Reinforcement (RLHF/DPO) Preference-based optimization after SFT Aligns model behavior Often used to improve helpfulness, style, or policy fit
DPO (Direct Preference Optimization) Optimizes preferred answers directly from comparisons Simpler alignment alternative Often easier than full RLHF pipelines
State vs stateless LLM Whether memory exists internally Architecture constraint Most LLMs are stateless
Hallucination Model generates incorrect info Reliability risk Reduce with RAG, tools, validation
KV cache Cached attention keys/values during inference Performance optimization Avoids recomputing tokens already seen in context
Context compression / compaction Summarizing or pruning conversation history Extends effective context Prevents context overflow in long sessions
— Inference parameters —      
Temperature Controls randomness of output Output variability 0 = deterministic, 1+ = creative; use low for code
Top-p (nucleus sampling) Samples from smallest token set covering p% prob mass Balances diversity + coherence Usually 0.9–0.95; combine with temperature
Top-k Limits sampling to k most likely tokens Reduces incoherence Alternative to top-p; k=1 is greedy
Max tokens Maximum output length Cost + latency control Set per-request; affects truncation
Stop sequences Tokens that terminate generation Output boundary control Useful for structured output parsing
Streaming Returning tokens as generated Perceived responsiveness Essential for interactive UIs
— Model optimization —      
Quantization Reducing model weight precision (e.g. Q4, Q8) Smaller size, faster inference Trade-off: lower quality at extreme compression
GGUF File format for quantized local models Enables local LLM deployment Used by llama.cpp and Ollama
LoRA (Low-Rank Adaptation) Lightweight fine-tuning via rank decomposition Efficient specialization Trains <1% of parameters; composable
QLoRA Quantized LoRA fine-tuning Fine-tuning on consumer GPU Combines 4-bit quant with LoRA adapters
— Prompting —      
Prompt Input instructions to model Direct control interface Structure strongly affects output
Prompt engineering Designing effective prompts Improves output quality role → goal → constraints → examples
System prompt Hidden high-priority instructions Controls behavior Defines rules, tone, identity
Rules System constraints on behavior Safety + consistency layer Includes safety, tool, formatting rules
Skills Reusable prompt/tool capability units Modularity Composable behaviors triggered by user or system
Zero-shot prompting Task given with no examples Tests raw model capability Works well for simple tasks; fails on complex ones
Few-shot prompting Task given with a few examples in prompt Improves output format/accuracy 3–5 examples usually sufficient
Role prompting Assigning model a persona Shapes response style “You are a senior Java engineer…”
Chain-of-thought (CoT) Model reasons step-by-step before answering Improves complex reasoning Add “think step by step” or use structured scratchpad
Context engineering Crafting exactly what goes into the context window Output quality multiplier More impactful than prompt wording alone
Context stuffing Overloading prompt with irrelevant data Degrades performance Keep input structured and minimal
— Memory & retrieval —      
Context (short-term memory) Active input window Immediate reasoning Limited by token window
Memory (persistent) External stored knowledge Long-term continuity Usually vector DB + retrieval
Conversational memory Session-only memory Dialogue coherence Lost when context fills
Agent memory Structured long-term memory Enables agent intelligence Built via RAG + embeddings
Episodic memory Stored past interactions/events Continuity across sessions Indexed by time or event type
Semantic memory Stored facts and knowledge Domain knowledge layer Backed by vector DB or knowledge graph
Procedural memory Stored skills and workflows Behavior reuse Encoded as tool definitions or prompt templates
RAG (Retrieval-Augmented Generation) Retrieves external knowledge before answering Extends model knowledge Core modern AI architecture
Embeddings Vector representation of text meaning Enables semantic search Finds meaning, not keywords
Vector database Stores embeddings for retrieval Retrieval system backbone Powers RAG and memory; FAISS, Chroma, Qdrant, pgvector
Chunking Splitting data into pieces for indexing Enables retrieval Required for large documents
Semantic chunking Splitting by meaning boundaries, not fixed size Better retrieval accuracy Preferred over fixed-size chunks
Hybrid search Combining keyword (BM25) + semantic (vector) search Best retrieval recall Standard production RAG pattern
BM25 Classical keyword ranking algorithm Exact-match retrieval Complements vector search in hybrid setups
Reranking Secondary scoring of retrieved results Precision improvement Cross-encoder models re-score top-k candidates
RAG filtering Filtering retrieved data before injection Security + correctness Prevent malicious or irrelevant data
— Agentic patterns —      
Agent LLM + tools + memory + loop Turns model into actor Can execute tasks autonomously
Agent loop plan → act → observe → refine Core agent behavior Bad loops cause instability
ReAct (Reason + Act) Interleaves reasoning traces with tool actions Transparent decision-making Standard agentic pattern; reduces hallucination
Tree-of-thought (ToT) Explores multiple reasoning branches Better complex problem solving Higher token cost; use for hard planning tasks
Reflection / self-correction Agent critiques and revises its own output Quality improvement Add a critic step after executor
Human-in-the-loop (HITL) Human approval at key decision points Safety + oversight Required for irreversible or high-risk actions
Planner Creates step-by-step plan Structure Breaks tasks into actions
Executor Executes actions Action layer Uses tools, APIs, code
Critic / reviewer Evaluates outputs Quality control Enables self-correction
Router agent Dispatches tasks to specialized sub-agents Scalable multi-agent design Routes by intent, domain, or capability
Supervisor / worker pattern Central agent coordinates worker agents Parallel workload decomposition Worker agents are stateless; supervisor manages state
Sub-agent Specialized internal agent Modular architecture e.g. coder / tester / planner
Multi-agent system Multiple cooperating AI agents Parallelism + specialization Harder to debug; needs tracing
Agent-to-Agent (A2A) Protocol for direct agent-to-agent communication Interoperability standard Google-led standard; complements MCP
Feedback loop (agentic) Agent output becomes next input Iterative refinement Can diverge — add termination conditions
Looping failure modes Agent stuck in infinite or oscillating loop Stability risk Add step limits, progress checks, and human escalation
Tool use / function calling LLM calls external functions Real-world interaction Essential for agents
Structured output Model returns JSON/schema-constrained response Reliable parsing Use tool-calling or guided generation
Orchestration Coordination of agents/tools System control Manages workflows
Workflow (agentic) Structured AI process Production reliability Hybrid deterministic + AI
Planning vs execution separation Split reasoning and action Improves reliability Standard agent pattern
Autonomy level Degree of independence Capability measure chatbot → full autonomous system
Hooks Event-based triggers in systems Automation layer Trigger actions on events like file change or tool call
— MCP (Model Context Protocol) —      
MCP (Model Context Protocol) Open standard for AI model ↔ data source connections Universal integration layer Replaces per-service custom integrations
MCP Host AI application consuming MCP servers Integration entry point Claude Desktop, IDEs, custom agents
MCP Server Exposes tools/resources/prompts via MCP Capability provider One server per data source or API
MCP Client Protocol implementation inside the host Communication layer Handles discovery, calls, and responses
MCP Transport How host and server communicate Deployment flexibility stdio for local; HTTP+SSE for remote
MCP Tools Callable functions exposed by MCP server Action surface Model decides when to call them
MCP Resources URI-addressed data exposed by MCP server Data access layer Files, DB rows, API results
MCP Prompts Reusable prompt templates from MCP server Standardized instructions Server-side prompt management
MCP Sampling Server requests a completion from the host model Server-initiated reasoning Enables complex server-side workflows
— Security —      
Guardrails Constraints on outputs/actions Safety layer Prevent unsafe or invalid actions
Jailbreak Attempt to bypass safety rules Security risk Requires filtering and guardrails
Prompt injection Malicious input manipulation Major AI security risk Treat all external input as untrusted
Sandboxing Isolated execution environment Safe tool use Prevents system damage from agent actions
Least privilege Minimal permissions granted Risk reduction Core security principle for agents
Zero Trust Nothing trusted by default Security foundation Verify every action, every call
Zero Retention No data stored after processing Privacy protection Used in enterprise AI systems
Content filtering Blocking unsafe inputs or outputs Harm prevention Pre- and post-generation filtering
Red teaming Adversarial testing of AI systems Finds safety gaps Standard before production deployment
Model output validation Checking model response against schema/rules Correctness + safety Reject or retry on schema violations
PII Personal identifiable data Privacy risk Must be protected; redact before sending to model
Audit logs Action history Debugging + compliance Required for traceability in agentic systems
Secure tool use Safe function execution with input/output validation Prevents abuse Validate before and after every tool call
Policy engine Rule system for behavior Central control layer Separates logic from model
Encryption Securing data at rest and in transit Security baseline Required everywhere
Authentication Identity verification Access control User/system identity
Authorization Permission control Limits actions Critical for agents acting on behalf of users
Data retention policy Rules for storing data Compliance Defines storage duration
Data minimization Collect minimal data Privacy compliance Reduces risk surface
Redaction Removing sensitive data from logs or prompts Privacy protection Apply before storing or training
Rate limiting Limits request volume Abuse prevention API protection and cost control
Multi-tenancy isolation User data separation Data safety SaaS requirement
Model governance AI oversight policies and approvals Enterprise control Policies and change approvals
Compliance (GDPR etc.) Legal data rules Mandatory requirement Impacts storage, training, and data sharing
— Observability —      
Observability Monitoring system behavior Debugging Logs, metrics, traces
Tracing Step-by-step execution logs Agent debugging Critical for multi-agent systems
— Frameworks & tools —      
Ollama Local LLM serving runtime Run models without cloud Supports Llama, Mistral, DeepSeek, etc.
LangChain Framework for chaining LLM calls and tools Rapid agent prototyping Large ecosystem; can become complex
LangGraph Graph-based multi-agent workflow framework Stateful agent workflows Built on LangChain; supports cycles
LlamaIndex RAG and data ingestion framework Document-based AI systems Strong ingestion pipeline; many connectors
CrewAI Role-based multi-agent framework Structured team-of-agents Easy to define roles; opinionated
AutoGen Microsoft multi-agent conversation framework Automated agent collaboration Agents converse to solve tasks
Semantic Kernel Microsoft SDK for LLM + plugin integration Enterprise .NET/Python AI apps Strong Azure + OpenAI integration
— Agent state & task management —      
Scratchpad / working memory Temporary reasoning space within a single run Intermediate computation Invisible to user; used for CoT and planning steps
Agent state machine Explicit states + transitions for agent lifecycle Predictable behavior States: idle → planning → executing → verifying → done
Checkpoint / resume Saving agent progress for recovery or continuation Fault tolerance Required for long-running tasks exceeding context window
Task decomposition Breaking a goal into ordered sub-tasks Manageability Planner output; enables parallelism
Parallelism vs sequential execution Running agent tasks concurrently vs in order Performance vs correctness Use parallel for independent tasks; sequential for deps
Task queue Ordered list of pending agent tasks Load management Decouple producer from consumer; use priority queues
Dead letter queue (agent) Queue for failed/unresolvable tasks Failure isolation Prevents silent drops; enables human review
Priority queue Task ordering by urgency or importance Resource allocation High-priority tasks preempt lower ones
Token budget Maximum tokens allocated per task or session Cost + stability control Enforce per-request and per-session limits
Shared memory (multi-agent) Writable memory accessible by all agents in a system Coordination Needs locking or versioning to prevent conflicts
Blackboard pattern Shared data structure agents read/write to coordinate Decoupled multi-agent design Classic AI architecture; producer–consumer per agent
— Reliability & resilience —      
Retry strategy Re-attempting failed calls with rules Transient fault tolerance Use exponential backoff + jitter; cap max attempts
Exponential backoff Increasing delay between retries Prevents thundering herd Double delay each retry; add random jitter
Circuit breaker Stops calls to failing service after threshold Failure containment Open → half-open → closed state machine
Idempotency Same operation produces same result when repeated Safe retries Design all tool calls to be idempotent
Timeout handling Aborting calls that exceed a time limit Prevents infinite waits Set per-tool and per-agent-loop timeouts
Graceful degradation / fallback Reduced functionality when component fails Availability Fallback to simpler model or cached response
Compensation / rollback Undoing effects of a failed multi-step action Data consistency Required for agents that write to external systems
Saga pattern Distributed transaction via compensating steps Consistency without locking Each agent step has a compensating undo step
Error classification Distinguishing transient vs fatal errors Correct retry logic Transient: retry; fatal: escalate to human
Step limit / max iterations Hard cap on agent loop cycles Prevents runaway agents Always set; log and escalate when hit
— Evaluation & testing —      
Evals (evaluation framework) Systematic testing of LLM / agent output quality Measures real capability Define before building; run on every change
LLM-as-judge Using another LLM to score outputs Scalable quality assessment Use stronger model as judge; watch for self-serving bias
Golden dataset Curated input/expected-output pairs Regression test baseline Must be representative; update as system evolves
Hallucination detection Checking if output contradicts source or facts Reliability assurance Compare against retrieved context or known ground truth
Retrieval recall / precision Measures how well RAG finds the right chunks RAG quality Low recall → missed answers; low precision → noise
Regression testing (agents) Re-running evals after any model or prompt change Prevent quality degradation Automate in CI/CD pipeline
A/B testing (model versions) Comparing two model or prompt variants in production Data-driven decisions Route % of traffic; measure task success rate
Benchmarking Running standard tasks to compare models Model selection MMLU, HumanEval, SWE-bench are common benchmarks
Shadow mode testing Running new agent in parallel without acting Safe validation Compare shadow vs production outputs before cutover
— Cost & performance —      
Prompt caching Reusing computed prefix context across requests Significant cost reduction Supported by Anthropic, OpenAI; cache system prompts
Batch API / async inference Submitting many requests processed offline Lower cost, higher throughput 50% cost reduction typically; not for real-time
Model routing Directing requests to cheap vs powerful models Cost optimization Simple tasks → small model; complex → large model
Token cost tracking Accounting for input + output tokens per operation Budget management Track per agent, per user, per workflow
Latency SLA Maximum acceptable response time User experience First-token latency vs full-response latency differ
Caching (semantic) Reusing answers for semantically similar queries Speed + cost Use embedding similarity to detect cache hits
— Deployment & infrastructure —      
LLM proxy / model gateway Middleware routing and managing LLM API calls Centralized control Handles routing, auth, logging, fallback; e.g. LiteLLM
Agent registry / catalog Inventory of available agents and their capabilities Discoverability Enables dynamic routing and composition
Model versioning Tracking deployed model versions Rollback + reproducibility Pin model version per use-case; test before upgrade
Canary release (agents) Gradual rollout of new agent/model version Risk reduction Start at 1–5% traffic; monitor evals before full rollout
Blue/green deployment (models) Two live environments for instant model switchover Zero-downtime updates Keep old version hot during cutover
Sidecar agent Agent running alongside a service to assist it Transparent augmentation e.g. auto-generates docs, tests, or logs alongside service
Event-driven agent Agent triggered by events rather than polling Reactive architecture Subscribe to queues/topics; act on arrival
Message queue / event bus Async communication channel between agent components Decoupling RabbitMQ, Kafka, Zeebe; enables retry and dead-letter
— Advanced RAG & knowledge —      
Self-RAG Agent decides when and what to retrieve Adaptive retrieval Avoids unnecessary retrieval on simple queries
Corrective RAG (CRAG) Evaluates retrieved docs and corrects if poor quality Retrieval quality guard Falls back to web search if local retrieval scores low
GraphRAG RAG over knowledge graphs instead of flat vectors Relational knowledge retrieval Better for interconnected facts; higher setup cost
Knowledge graph Graph of entities and relationships Structured domain knowledge Enables reasoning over relationships, not just similarity
Document understanding Extracting structured info from PDFs, tables, images Ingestion quality Pre-process before chunking; use vision models for images
Data lineage Tracking data origin and transformations Auditability + compliance Know what source backs every generated fact
— Protocols & standards —      
JSON-RPC Lightweight RPC protocol using JSON MCP underlying transport Stateless; request/response + notification patterns
SSE (Server-Sent Events) Server pushes events to client over HTTP Real-time streaming Used by MCP HTTP transport; one-directional
OpenAPI / Swagger Standard for describing REST API schemas Tool definition format Agents can consume OpenAPI specs to discover tools
JSON Schema Vocabulary for describing JSON data structures Structured output validation Define expected output shape; validate before parsing
gRPC High-performance RPC framework using Protobuf Agent microservice comms Faster than REST; schema-first; streaming support
— Multimodal —      
Multimodal input Model processes text, images, audio, or video Richer context Vision models can read screenshots, diagrams, documents
Vision / image understanding Model analyzes image content Document + UI automation Enables agents to read screenshots, charts, OCR text
Audio transcription Converting speech to text Voice-driven agents Whisper and similar models; feed transcript to LLM
— AI-assisted development —      
CLAUDE.md / AGENT.md Project-level instructions file for AI agents Persistent agent context Defines rules, structure, conventions for the codebase
Codebase indexing Semantic project mapping Global code understanding Without it AI is file-local
AI pair programming Developer + AI collaborating on code in real time Speed + quality Best with short tasks and human review
Spec-driven development Build from formal specifications Reduces AI chaos Best practice for agentic coding projects
Vibe coding Intuitive AI-assisted coding without formal specs Fast prototyping Risk: technical debt accumulation
Iteration loop build → test → fix cycle with AI Core agentic dev cycle Iteration beats perfect prompting
Claude Code / agent mode Agentic coding system by Anthropic Repo-level automation Reads, edits, runs, and verifies code end-to-end
Cursor AI-first IDE Codebase-aware AI Strong refactoring + multi-file context

See also