stack-profile:voice-ai-agent
Voice AI Agent Stack (Whisper, TTS, WebSocket, FastAPI, React) overview
An end-to-end voice-powered AI agent architecture for building conversational interfaces with speech input and output. OpenAI Whisper (or whisper.cpp) handles automatic speech recognition, converting audio streams to text. A text-to-speech engine synthesizes agent responses back to audio. WebSocket connections enable full-duplex, low-latency audio streaming between client and server. FastAPI serves as the async backend, coordinating ASR, LLM inference, and TTS in a streaming pipeline. React powers the frontend with audio capture, playback, and visual feedback. Python handles all server-side logic including audio preprocessing and LLM integration. This stack suits voice assistants, call center copilots, and accessibility-first applications. The main tradeoff is latency — the ASR-to-TTS round trip must stay under 1-2 seconds for natural conversation flow.
Attributes
Outgoing edges
- domain:ml-ai·DomainML/AI
- domain:frontend·DomainFrontend
- framework:fastapi·FrameworkFastAPI
- framework:react·FrameworkReact
- language:python·LanguagePython
- language:typescript·LanguageTypeScript
- library:websockets·Librarywebsockets
- tool:docker·ToolDocker
- library:uvicorn·LibraryUvicorn
- workflow:prompt-engineering-iteration·WorkflowPrompt Engineering Iteration
- workflow:agent-evaluation-cycle·WorkflowAgent Evaluation Cycle
- skill-area:audio-processing·SkillAreaAudio Processing Libraries and Services
- skill-area:streaming-realtime-processing·SkillAreaStreaming and Real-time Processing
- skill-area:websocket-design·SkillAreaWebSocket Protocol Design
- skill-area:natural-language-processing·SkillAreaNatural Language Processing
- skill-area:model-serving-deployment·SkillAreaModel Serving and Deployment
- role:ml-engineer·RoleMachine Learning Engineer
- role:fullstack-engineer·RoleFullstack Engineer
- role:frontend-engineer·RoleFrontend Engineer