AI
Information
This page is an overview of the modern AI landscape with links to subtopics. Artificial intelligence in this context primarily refers to systems built on Large Language Models (LLMs) and related neural architectures.
Key categories:
- LLM (Large Language Model): Text-based models trained on large corpora — the foundation of modern AI assistants, coding tools, and reasoning systems. Examples: GPT-4, Claude, Gemma, Llama, Mistral.
- VLM (Vision-Language Model): Models that process both images and text — used for image understanding, OCR, visual question-answering, and multimodal assistants.
- Audio models: Speech-to-text (STT), text-to-speech (TTS), music generation, and ambient sound understanding.
- Multimodal models: Combine multiple input and output types (text + image + audio) in a single model.
- AI Agents: Autonomous systems that use LLMs with tool calling to plan and execute multi-step tasks. See agent.md.
AI Interaction Modalities
AI models have evolved from simple text interfaces to complex multimodal systems:
- Text (LLM): Large Language Models process and generate text — the foundation of modern AI.
- Vision (VLM): Vision-Language Models can interpret images and video content.
- Audio: Models can transcribe speech (STT), generate voices (TTS), and understand music and ambient sounds.
- Multimodal: Combines multiple input types (e.g., text + image) and generates multiple output types.
- Structured Data: Models can generate and parse structured formats like JSON, CSV, and code.