AI
Information
AI Interaction Modalities
AI models have evolved from simple text interfaces to complex multimodal systems:
- FastAPI
- Text (LLM): Large Language Models process and generate text, the foundation of modern AI.
- Vision (VLM): Vision-Language Models can “see” and interpret images and video content.
- Audio: Models can transcribe speech (STT), generate realistic voices (TTS), and even understand music and ambient sounds.
- Multimodal: The ability to combine multiple input types (e.g., text + image) and generate multiple output types ( e.g., text + audio).
- Structured Data: Models can generate and parse structured formats like JSON, CSV, and code, enabling programmatic interaction.