Agentic AI Atlas

Wiki article

library/ai-agents-conversational

Reading · 22 min

AI Agents and Conversational AI Specialization (Library) reference

AI Agents and Conversational AI specialization focuses on designing, building, and deploying intelligent conversational systems powered by large language models (LLMs), natural language understanding (NLU), and dialogue management. This discipline encompasses chatbots, virtual assistants, autonomous agents, multi-agent systems, and sophisticated conversational interfaces that can understand context, maintain state, and execute complex tasks through natural language interaction.

Page nodewiki/library/ai-agents-conversational.mdNearby pages · 111Documents · 1

Continue reading

Nearby pages in the same section.

Documented graph nodes

Records linked directly from this page’s Page node.

specialization:ai-agents-conversational · Specialization

AI Agents and Conversational AI Specialization

Overview

Roles and Responsibilities

Conversational AI Engineer

A conversational AI engineer designs and implements chatbots, virtual assistants, and conversational interfaces.

**Core responsibilities:**

Design and implement dialogue flows and conversation logic
Integrate NLU/NLP models for intent recognition and entity extraction
Build conversation state management and context tracking
Implement multi-turn conversation handling
Develop fallback strategies and error recovery mechanisms
Integrate with backend services and APIs
Create training data and fine-tune language models
Monitor conversation quality and user satisfaction metrics

**Skills required:**

Natural language processing and understanding
Machine learning and deep learning fundamentals
Python, JavaScript/TypeScript for bot frameworks
Dialogue management systems and state machines
API integration and webhook development
Prompt engineering and LLM fine-tuning
Analytics and conversation flow optimization

AI Agent Developer

An AI agent developer creates autonomous agents that can perceive, reason, plan, and act to achieve goals.

**Core responsibilities:**

Design agent architectures (ReAct, ReWOO, Plan-and-Execute, LangGraph)
Implement tool use and function calling capabilities
Build memory systems (short-term, long-term, semantic)
Create planning and reasoning modules
Develop multi-agent coordination and communication
Implement safety guardrails and content filtering
Optimize agent performance and token efficiency
Design evaluation frameworks for agent behaviors

**Skills required:**

LLM APIs and prompt engineering
Agent frameworks (LangChain, LlamaIndex, AutoGPT, CrewAI)
Vector databases and semantic search
Retrieval-Augmented Generation (RAG)
Function calling and tool integration
Reinforcement learning from human feedback (RLHF)
Software engineering and system design

Conversation Designer

A conversation designer crafts the user experience of conversational interfaces, focusing on natural, intuitive interactions.

**Core responsibilities:**

Design conversation flows and dialogue trees
Write chatbot scripts and response templates
Create persona and brand voice guidelines
Design error handling and repair strategies
Map user journeys and conversation paths
Conduct user research and usability testing
Collaborate with linguists and UX designers
Create conversation design documentation

**Skills required:**

UX design and user research methodologies
Linguistics and pragmatics
Storytelling and copywriting
Flowchart and dialogue mapping tools
Understanding of NLU capabilities and limitations
Human-computer interaction principles
Empathy and user-centered design thinking

Other Related Roles

**NLP Engineer** - Specializes in natural language understanding models, intent classification, entity recognition
**LLM Ops Engineer** - Manages deployment, monitoring, and optimization of large language model systems
**Prompt Engineer** - Designs and optimizes prompts for LLM-based systems
**AI Safety Researcher** - Ensures AI agents behave safely, ethically, and aligned with human values
**Multi-Agent Systems Architect** - Designs complex systems with multiple coordinating agents
**Voice AI Developer** - Specializes in speech-to-text, text-to-speech, and voice-based interactions

Agent Architectures

ReAct (Reasoning and Acting)

An agent architecture that interleaves reasoning traces with action execution.

**Components:**

**Thought** - Agent reasons about the current state and what to do next
**Action** - Agent executes a tool or function call
**Observation** - Agent receives the result of the action
**Loop** - Repeat thought-action-observation until task completion

**Characteristics:**

Explicit reasoning traces improve interpretability
Self-reflection and error correction capabilities
Works well for multi-step tasks requiring planning
Handles complex tool use scenarios
Can recover from mistakes through reasoning

**Use cases:**

Research and information gathering agents
Complex problem-solving tasks
Debugging and troubleshooting assistants
Mathematical reasoning and code generation

ReWOO (Reasoning WithOut Observation)

An agent architecture that plans all actions upfront before executing them, reducing LLM calls.

**Components:**

**Planning phase** - Generate complete action plan with dependencies
**Execution phase** - Execute all tools in parallel where possible
**Response phase** - Synthesize results into final answer

**Characteristics:**

More efficient than ReAct (fewer LLM calls)
Enables parallel tool execution
Better for predictable workflows
Less adaptive to unexpected results
Lower latency for multi-step tasks

**Use cases:**

Data analysis pipelines
Batch processing tasks
Workflows with clear dependencies
Cost-sensitive applications

Plan-and-Execute

An agent architecture that separates high-level planning from low-level execution.

**Components:**

**Planner** - Creates high-level task decomposition
**Executor** - Carries out individual steps
**Replanner** - Updates plan based on execution results
**Monitor** - Tracks progress and detects failures

**Characteristics:**

Hierarchical task decomposition
Can handle long-horizon tasks
Adapts plan based on intermediate results
Better resource allocation
Clear separation of concerns

**Use cases:**

Project management and task automation
Complex workflow orchestration
Long-running research tasks
Multi-step creative projects

LangGraph (Stateful Multi-Agent Workflows)

A framework for building cyclic graphs of agents with explicit state management.

**Components:**

**Nodes** - Individual agents or processing steps
**Edges** - Transitions between nodes (conditional or unconditional)
**State** - Shared state accessible to all nodes
**Checkpoints** - Persistence and resumability

**Characteristics:**

Explicit control flow with cycles and conditionals
Built-in state management and persistence
Human-in-the-loop integration points
Time travel and debugging capabilities
Composable and reusable agent graphs

**Use cases:**

Complex decision-making workflows
Multi-agent collaboration systems
Customer support automation
Agentic applications requiring human oversight

Multi-Agent Systems

Systems with multiple specialized agents working together to solve problems.

**Patterns:**

**Hierarchical** - Manager agent delegates to specialist agents
**Cooperative** - Agents collaborate and share information
**Competitive** - Agents debate or vote on solutions
**Sequential** - Agents process tasks in a pipeline
**Parallel** - Agents work independently and results are aggregated

**Characteristics:**

Specialization and division of labor
Scalability through parallel processing
Robustness through redundancy
Complex coordination requirements
Emergent behaviors from agent interactions

**Use cases:**

Software development teams (architect, developer, tester, reviewer)
Research and analysis (researcher, fact-checker, synthesizer)
Content creation (writer, editor, reviewer)
Complex simulations and game playing

Conversational AI Components

Natural Language Understanding (NLU)

**Intent Recognition:**

Classify user input into predefined intent categories
Handle multi-intent utterances
Confidence scoring and threshold tuning
Out-of-scope intent detection
Few-shot and zero-shot intent classification

**Entity Extraction:**

Named entity recognition (NER) for people, places, organizations
Custom entity types (products, dates, quantities)
Slot filling for dialogue state
Entity linking and normalization
Composite entity extraction

**Sentiment Analysis:**

Emotion detection (positive, negative, neutral)
Fine-grained sentiment scoring
Aspect-based sentiment analysis
Emotion classification (joy, anger, fear, surprise)

**NLU Approaches:**

Rule-based pattern matching (regex, keyword)
Traditional ML (SVM, CRF, logistic regression)
Deep learning (BERT, RoBERTa, transformer models)
LLM-based (GPT-4, Claude for zero-shot/few-shot)
Hybrid approaches combining multiple methods

Dialogue Management

**State Management:**

Conversation context tracking
Slot filling and form handling
User profile and preferences
Session management
Multi-turn context window

**Dialogue Policies:**

Rule-based (if-then conditions, finite state machines)
Retrieval-based (select from predefined responses)
Generative (LLM-generated responses)
Hybrid approaches
Reinforcement learning policies

**Conversation Flow:**

Happy path scenarios
Error handling and clarification requests
Context switching and digression handling
Conversation repair strategies
Proactive suggestions and nudges

**Dialogue State Tracking (DST):**

Track user goals and intents across turns
Update belief state with each user input
Handle slot value changes and corrections
Maintain conversation history
Support multi-domain conversations

Natural Language Generation (NLG)

**Template-Based Generation:**

Predefined response templates with variable slots
Simple, predictable, and fast
Limited flexibility and naturalness
Good for structured domains

**Retrieval-Based Generation:**

Select best response from corpus
Uses similarity matching
Requires large response database
Consistent quality

**Neural NLG:**

Seq2seq models with attention
Transformer-based generation (GPT, T5)
Fine-tuned language models
Controllable generation with attributes

**LLM-Based Generation:**

Zero-shot and few-shot prompting
Instruction-tuned models (GPT-4, Claude)
RAG for grounded generation
Chain-of-thought for reasoning
Constitutional AI for safe outputs

**NLG Considerations:**

Consistency with brand voice and persona
Contextual appropriateness
Diversity and avoiding repetition
Safety and content filtering
Hallucination mitigation

Memory Systems

**Short-Term Memory:**

Recent conversation history (last N turns)
Current dialogue state and context
Active goals and incomplete tasks
Session-based, cleared after conversation ends

**Long-Term Memory:**

User preferences and profile information
Historical interactions and patterns
Learned facts about the user
Cross-session persistence

**Semantic Memory:**

Vector database storage (Pinecone, Weaviate, Chroma)
Embedding-based retrieval
Relevant context injection into prompts
Knowledge base and documentation
Episodic memory of past interactions

**Memory Strategies:**

Sliding window with summarization
Hierarchical memory (short, medium, long-term)
Selective retention of important information
Memory consolidation and compression
Retrieval-augmented generation (RAG)

Chatbot Frameworks and Platforms

LLM Agent Frameworks

**LangChain:**

Comprehensive framework for LLM applications
Chains, agents, tools, and memory abstractions
LangGraph for stateful multi-agent workflows
LangSmith for observability and debugging
Extensive integrations with LLM providers and tools

**LlamaIndex:**

Data framework for LLM applications
Advanced RAG patterns and indexing strategies
Query engines and routers
Sub-question decomposition
Multi-document agents

**Semantic Kernel (Microsoft):**

SDK for integrating LLMs into applications
Planners and plugins architecture
Memory and skills abstractions
Native support for C#, Python, Java
Enterprise-ready with Azure integration

**Haystack:**

End-to-end NLP framework
Pipeline-based architecture
RAG and semantic search focus
Document processing and indexing
Production-ready with monitoring

**AutoGPT / BabyAGI:**

Autonomous agent frameworks
Self-directed task creation and execution
Goal-oriented behavior
Experimental and research-focused
Demonstrates emergent agent capabilities

**CrewAI:**

Multi-agent orchestration framework
Role-based agent design
Task delegation and coordination
Built on LangChain
Simpler API for multi-agent systems

Traditional Chatbot Platforms

**Rasa:**

Open source conversational AI platform
Custom NLU and dialogue management
Machine learning-based intent classification
Rule-based and ML dialogue policies
Self-hosted and full control over data

**Botpress:**

Open source chatbot building platform
Visual flow builder
NLU engine with slot filling
Multi-channel deployment
Extensible with custom code

**Microsoft Bot Framework:**

Enterprise bot development framework
Azure Bot Service integration
Adaptive dialogs and cards
Multi-channel support (Teams, Slack, web)
Language Understanding (LUIS) integration

**Dialogflow (Google):**

Conversational AI platform
Intent and entity recognition
Multi-turn conversations
Rich response types
Integration with Google Cloud

**Amazon Lex:**

AWS service for building conversational interfaces
Powered by Alexa technology
Automatic speech recognition (ASR)
Integration with AWS Lambda
Omnichannel deployment

**IBM Watson Assistant:**

Enterprise conversational AI platform
Intent classification and entity extraction
Dialog skill and search skill
Integration with Watson Discovery
Enterprise features (analytics, versioning)

Conversational UI Frameworks

**Voiceflow:**

Visual conversation design platform
No-code/low-code builder
Prototyping and production deployment
Collaboration features
Multi-channel publishing

**Landbot:**

No-code chatbot builder
Visual drag-and-drop interface
Conversational landing pages
WhatsApp and web chat support
Integration with popular tools

**ManyChat / Chatfuel:**

Facebook Messenger bot platforms
Visual flow builders
Marketing and e-commerce focus
Broadcasting and sequences
Limited to messaging platforms

Tool Use and Function Calling

Function Calling Patterns

**Single Function Call:**

Agent selects one tool to call
Waits for result before continuing
Simple and predictable
Good for sequential workflows

**Parallel Function Calling:**

Agent calls multiple tools simultaneously
Reduces latency for independent operations
Requires careful result aggregation
Supported by GPT-4, Claude 3+

**Nested Function Calling:**

Results of one function call used in subsequent calls
Forms a dependency graph
Requires planning or dynamic execution
Common in ReAct-style agents

**Conditional Function Calling:**

Branch logic based on function results
If-then-else patterns in agent reasoning
Handles different scenarios gracefully

Tool Integration Strategies

**API Integration:**

REST API calls to external services
Authentication and rate limiting
Error handling and retries
Response parsing and validation

**Database Queries:**

SQL or NoSQL database access
Semantic search over vector databases
Caching and query optimization
Security and access control

**Code Execution:**

Python REPL or Jupyter kernel
Sandboxed execution environments
Code validation and safety checks
Result capture and error handling

**Web Browsing:**

Web scraping and crawling
Headless browser automation (Playwright, Puppeteer)
Content extraction and parsing
Rate limiting and politeness policies

**Custom Tools:**

Domain-specific business logic
Internal system integrations
Workflow automation
Report generation

Function Calling Best Practices

**Tool Design:**

Clear, descriptive tool names and descriptions
Well-defined input schemas with examples
Consistent error handling and return formats
Minimal and focused tool functionality
Idempotent operations where possible

**Prompt Engineering:**

Provide clear tool documentation in prompts
Include examples of correct tool usage
Specify when tools should and shouldn't be used
Handle tool failures gracefully
Guide agent through multi-step tool use

**Error Handling:**

Validate tool inputs before execution
Provide informative error messages
Implement retry logic with backoff
Graceful degradation when tools fail
Log errors for debugging

**Security:**

Validate and sanitize all inputs
Implement authentication and authorization
Rate limit tool calls
Audit tool usage
Sandbox dangerous operations

Retrieval-Augmented Generation (RAG)

RAG Pipeline

**Indexing Phase:** 1. Document loading and preprocessing 2. Text chunking and splitting strategies 3. Embedding generation (OpenAI, Cohere, sentence-transformers) 4. Vector database storage 5. Metadata indexing for filtering

**Retrieval Phase:** 1. Query embedding generation 2. Similarity search (cosine, dot product, euclidean) 3. Metadata filtering and reranking 4. Result aggregation and deduplication 5. Context window management

**Generation Phase:** 1. Context injection into prompt 2. Query-aware context formatting 3. LLM generation with retrieved context 4. Citation and source attribution 5. Response validation and filtering

Advanced RAG Patterns

**Hierarchical Retrieval:**

Multi-level document structure (document, section, chunk)
Retrieve broader context then drill down
Parent-child chunk relationships

**Multi-Query RAG:**

Generate multiple query variations
Retrieve for each variation
Aggregate and deduplicate results

**Agentic RAG:**

Agent decides when to retrieve
Iterative retrieval and reasoning
Query decomposition and routing
Self-correction based on retrieval quality

**Hybrid Search:**

Combine semantic (vector) and keyword (BM25) search
Reciprocal rank fusion for result merging
Best of both retrieval paradigms

**Self-RAG:**

Agent critiques retrieval quality
Decides when retrieval is needed
Filters irrelevant retrieved passages
Reflection and self-correction

RAG Optimization

**Chunking Strategies:**

Fixed-size chunks with overlap
Semantic chunking (sentence, paragraph)
Recursive character text splitting
Document structure-aware splitting
Small-to-big retrieval (retrieve small, provide context from large)

**Embedding Models:**

OpenAI text-embedding-ada-002, text-embedding-3
Cohere embed-english-v3.0
Sentence-transformers (all-MiniLM-L6-v2)
Domain-specific fine-tuned embeddings
Multilingual embeddings

**Reranking:**

Cross-encoder models for reranking
Cohere Rerank API
Diversity-based reranking
Maximal marginal relevance (MMR)
LLM-based relevance scoring

**Query Optimization:**

Query expansion and rewriting
Hypothetical document embeddings (HyDE)
Query decomposition for complex questions
Step-back prompting for high-level context

Conversation Design Patterns

Onboarding and Discovery

**First-Time User Experience:**

Welcoming message and value proposition
Capability discovery (what can the bot do)
Example queries or quick replies
Personality and tone setting
Privacy and data handling disclosure

**Progressive Disclosure:**

Start with simple, core functionality
Gradually introduce advanced features
Contextual help and tips
Avoid overwhelming users upfront

Error Handling and Repair

**No Match / Out of Scope:**

Acknowledge confusion gracefully
Suggest alternative phrasings
Offer menu of available actions
Escalation to human agent

**Clarification Requests:**

Ask focused, closed-ended questions
Provide examples of expected input
Use confirmation prompts
Allow user to correct misunderstandings

**Conversation Repair:**

Detect and recover from errors
Allow undo and restart
Maintain conversation history for context
Learn from failed interactions

Personality and Tone

**Persona Design:**

Define character traits and background
Consistent voice across interactions
Match brand identity
Cultural and linguistic appropriateness

**Tone Adaptation:**

Formal vs. casual based on context
Empathetic responses to negative sentiment
Professional for business contexts
Playful for entertainment applications

**Anthropomorphization:**

Set appropriate expectations (bot vs. human)
Transparency about limitations
Avoid uncanny valley
Balance personality with utility

Proactive Engagement

**Contextual Suggestions:**

Predict user needs based on context
Offer relevant follow-up actions
Smart defaults and auto-complete
Progressive disclosure of features

**Notifications and Reminders:**

Push notifications for important events
Periodic check-ins and engagement
Personalized recommendations
Opt-in and preference management

Safety and Alignment

Content Filtering

**Input Filtering:**

Detect harmful, toxic, or inappropriate inputs
Block prompt injection attempts
Sanitize user inputs
Rate limiting and abuse detection

**Output Filtering:**

Safety classifiers for generated content
Toxicity detection (Perspective API)
PII and sensitive data redaction
Hallucination detection

Prompt Injection Defense

**Mitigation Strategies:**

Separate user content from system instructions
Use XML tags or clear delimiters
Instruction hierarchy (system > user)
Output validation and parsing
LLM-based injection detection

**System Prompts:**

Explicit role definition
Clear task boundaries
Safety guidelines and guardrails
Examples of correct behavior

Bias and Fairness

**Bias Mitigation:**

Diverse training data
Bias detection in outputs
Counterfactual fairness testing
Regular audits and monitoring
User feedback integration

**Ethical Considerations:**

Transparency about AI usage
Informed consent for data collection
Right to explanation for decisions
Appeal and correction mechanisms
Human oversight and control

RLHF and Alignment

**Reinforcement Learning from Human Feedback:**

Preference learning from human comparisons
Reward model training
Policy optimization (PPO, DPO)
Constitutional AI principles
Red teaming and adversarial testing

**Alignment Techniques:**

Instruction tuning for helpfulness
Constitutional AI for harmlessness
Value alignment frameworks
Recursive reward modeling
Debate and amplification

Evaluation and Testing

Conversation Quality Metrics

**Automatic Metrics:**

Intent classification accuracy
Entity extraction F1 score
Dialogue success rate
Average conversation length
Fallback rate and containment

**Human Evaluation:**

Conversation satisfaction (CSAT)
Task completion rate
Response appropriateness
Naturalness and fluency
Personality consistency

**LLM-as-Judge:**

Use GPT-4/Claude to score responses
Evaluate relevance, accuracy, helpfulness
Compare multiple agent responses
Scalable alternative to human evaluation

Agent Benchmarks

**General Capability:**

MMLU (Massive Multitask Language Understanding)
HellaSwag (commonsense reasoning)
TruthfulQA (factual accuracy)
HumanEval (code generation)

**Agent-Specific:**

WebArena (web navigation tasks)
AgentBench (multi-domain agent tasks)
Tool-using benchmarks (API-Bank)
Multi-agent coordination tasks
Long-horizon planning tasks

Testing Strategies

**Unit Testing:**

Test individual components (NLU, dialogue policy)
Mock external dependencies
Validate function calling
Test error handling

**Integration Testing:**

End-to-end conversation flows
Multi-turn dialogue scenarios
Tool integration testing
State persistence verification

**Regression Testing:**

Test suite of known good conversations
Track metrics over time
Catch performance degradation
Automated with CI/CD

**A/B Testing:**

Compare different prompts or models
Test conversation flow variations
Measure user engagement
Data-driven optimization

Deployment and Operations

Hosting Options

**Cloud Platforms:**

AWS (Lambda, ECS, API Gateway)
Google Cloud (Cloud Run, Cloud Functions)
Azure (Functions, App Service)
Vercel, Netlify (serverless)

**LLM Providers:**

OpenAI API (GPT-4, GPT-3.5)
Anthropic API (Claude 3 Opus, Sonnet, Haiku)
Google Gemini API
Mistral API, Cohere API
Azure OpenAI Service

**Self-Hosted:**

Ollama for local LLM inference
vLLM for high-throughput serving
TGI (Text Generation Inference)
Ray Serve for scalable deployment

Monitoring and Observability

**Metrics to Track:**

Request latency (p50, p95, p99)
Token usage and costs
Error rates and types
Conversation completion rates
User satisfaction scores

**Logging:**

Conversation history and context
Tool calls and results
Errors and exceptions
Model inputs and outputs
User feedback and ratings

**Tracing:**

LangSmith for LangChain applications
LangFuse for LLM observability
OpenTelemetry for distributed tracing
Custom instrumentation

**Alerting:**

High error rates or latency
Cost anomalies
Safety filter triggers
Unusual usage patterns

Optimization

**Latency Optimization:**

Streaming responses for better UX
Parallel function calling
Caching frequent queries
Smaller/faster models for simple tasks
Speculative execution

**Cost Optimization:**

Prompt compression and optimization
Use smaller models when possible
Cache embeddings and results
Batch processing
Token budget management

**Quality Optimization:**

Few-shot examples and demonstrations
Chain-of-thought prompting
Self-consistency and voting
Retrieval-augmented generation
Fine-tuning for domain adaptation

Best Practices Summary

1. **Start with clear use cases** - Define specific problems and success criteria 2. **Design conversation flows first** - Map out happy paths and error scenarios 3. **Choose the right architecture** - Match agent complexity to task requirements 4. **Invest in evaluation** - Automated metrics and human evaluation 5. **Iterate based on user feedback** - Continuous improvement cycle 6. **Implement robust error handling** - Graceful degradation and recovery 7. **Monitor and optimize continuously** - Track metrics and costs 8. **Prioritize safety and alignment** - Content filtering and guardrails 9. **Document agent behavior** - System prompts, tools, and decision logic 10. **Keep humans in the loop** - Critical decisions and edge cases

Success Metrics

**User Engagement:**

Daily/monthly active users
Conversation frequency and length
Return rate and retention
Feature adoption rates

**Performance Metrics:**

Task completion rate
First contact resolution
Average handling time
Escalation rate

**Quality Metrics:**

User satisfaction (CSAT, NPS)
Response accuracy
Intent recognition accuracy
Conversation coherence

**Business Metrics:**

Cost per conversation
Support ticket deflection
Conversion rates
ROI and cost savings

Career Development

Learning Path

1. **Foundation** - NLP fundamentals, Python programming, machine learning basics 2. **Conversational AI** - Dialogue systems, chatbot frameworks, conversation design 3. **LLM Mastery** - Prompt engineering, RAG, fine-tuning, agent frameworks 4. **Advanced Topics** - Multi-agent systems, RLHF, safety and alignment 5. **Specialization** - Voice AI, multi-modal agents, enterprise deployment

Resources

**Courses** - DeepLearning.AI LangChain courses, Hugging Face NLP course, fast.ai
**Books** - "Building LLMs for Production", "Designing Voice User Interfaces"
**Communities** - LangChain Discord, r/LangChain, AI Agent communities
**Conferences** - NeurIPS, ACL, EMNLP, Applied AI conferences

MCP Apps Sub-Specialization

Overview

MCP Apps extend the Model Context Protocol by enabling interactive UIs that render inline in MCP-enabled hosts such as Claude Desktop, ChatGPT, VS Code, Goose, Postman, and other compliant chat clients. The core architectural pattern is **Tool + Resource**: every MCP App requires a Tool (called by the LLM/host, returns data) and a Resource (serves bundled HTML UI that displays the data). The tool's _meta.ui.resourceUri references the resource's URI.

**Repository**: https://github.com/modelcontextprotocol/ext-apps **npm Package**: @modelcontextprotocol/ext-apps **Specification Version**: 2026-01-26 (stable)

Key Principles

1. **Reference-code-first development** -- always clone the SDK repo for examples and API docs rather than relying on memory 2. **Graceful degradation** -- always provide text fallbacks for non-UI hosts via the content array 3. **CSP-first security** -- all network requests must be declared in CSP since apps run in sandboxed iframes with no same-origin server 4. **Handler-before-connect pattern** -- register ALL handlers BEFORE calling app.connect() 5. **Single-file bundling** -- mandatory via vite-plugin-singlefile for iframe compatibility

Process Catalog

Process File	Name	Description
`create-mcp-app.js`	Create MCP App (Greenfield)	Scaffolds a new interactive UI MCP App from scratch. Covers framework selection, project setup, Tool+Resource implementation, client UI with handler-before-connect pattern, single-file bundling, and verification against basic-host.
`add-app-to-mcp-server.js`	Add Interactive UI to Existing MCP Server	Adds interactive UI capabilities to an existing MCP server. Analyzes existing tools for UI benefit, converts selected tools to App tools, preserves backward compatibility with text fallbacks, and adds build pipeline.
`convert-web-app-to-mcp.js`	Convert Web App to Hybrid MCP App	Converts an existing web application into a hybrid that works both standalone in browser AND as an MCP App. Preserves original standalone functionality while adding MCP data path alongside.
`migrate-openai-app-to-mcp.js`	Migrate OpenAI Apps SDK to MCP Apps	Migrates existing OpenAI Apps SDK applications to the open MCP Apps standard. Handles paradigm shift from synchronous global object to async App instance with event handlers.

Skills Catalog

Skill Directory	Name	Description
`skills/mcp-app-scaffolding/`	MCP App Scaffolding	Scaffolds MCP App project structure with framework selection decision tree (React/Vanilla JS/Vue/Svelte/Preact/Solid), directory layout, dependencies, and entry points.
`skills/mcp-csp-investigation/`	MCP CSP Investigation	Performs comprehensive Content Security Policy audit for MCP Apps in sandboxed iframes. Builds the app, traces every network origin to source, produces categorized domain lists.
`skills/mcp-tool-resource-pattern/`	MCP Tool + Resource Pattern	Implements the core Tool + Resource architectural pattern with `registerAppTool`, `registerAppResource`, text fallback, structuredContent, and app-only helper tools.
`skills/mcp-host-styling-integration/`	MCP Host Styling Integration	Integrates MCP App UI with host theming system using CSS variables with fallback values, `onhostcontextchanged` event, safe area insets, and display mode detection.
`skills/mcp-app-verification/`	MCP App Verification	Runs comprehensive verification checklists for MCP Apps. Tests with basic-host, checks handler-before-connect invariant, text fallback, resource URI linking, single-file bundling, host styling.
`skills/single-file-bundling/`	Single-File Bundling	Configures Vite with `vite-plugin-singlefile` for mandatory single-file HTML bundling. All assets (JS, CSS, images, fonts) must be inlined into a single HTML file.

Agents Catalog

Agent Directory	Name	Role
`agents/mcp-app-architect/`	MCP App Architect	Senior architect for MCP Apps -- designs Tool + Resource patterns, transport configuration, data flow, and CSP security model.
`agents/mcp-ui-developer/`	MCP UI Developer	Frontend specialist for MCP App client-side UI -- implements App lifecycle, event handlers, host styling, and framework-specific patterns.
`agents/mcp-migration-specialist/`	MCP Migration Specialist	Expert in migrating applications to MCP Apps from other platforms (OpenAI Apps SDK, plain web apps). Handles paradigm shifts, API mapping, and verification.
`agents/csp-security-auditor/`	CSP Security Auditor	Security specialist focused on Content Security Policy for MCP Apps in sandboxed iframes. Audits network origins and ensures complete CSP configuration.

Usage Guide

**Create a new MCP App from scratch:**

bash

agent-platform call --harness claude-code \
  --process library/specializations/ai-agents-conversational/create-mcp-app.js#process \
  --workspace .

**Add UI to an existing MCP server:**

bash

agent-platform call --harness claude-code \
  --process library/specializations/ai-agents-conversational/add-app-to-mcp-server.js#process \
  --workspace .

**Convert a web app to hybrid MCP App:**

bash

agent-platform call --harness claude-code \
  --process library/specializations/ai-agents-conversational/convert-web-app-to-mcp.js#process \
  --workspace .

**Migrate from OpenAI Apps SDK:**

bash

agent-platform call --harness claude-code \
  --process library/specializations/ai-agents-conversational/migrate-openai-app-to-mcp.js#process \
  --workspace .

Critical Implementation Notes

**Handler-before-connect invariant**: Register ALL handlers (ontoolinput, ontoolresult, onhostcontextchanged, onteardown) BEFORE calling app.connect(). Violating this produces subtle, hard-to-debug failures.
**CSP silent failures**: All network requests fail SILENTLY without proper CSP declaration in sandboxed iframes. Missing even one origin causes silent failure.
**Text fallback**: Always include a content array with text representation for non-UI hosts.
**Resource URI linking**: Tool's _meta.ui.resourceUri must match the registered resource URI.
**vite-plugin-singlefile is mandatory**: Without it, assets will not load in the sandboxed iframe.

Conclusion

AI Agents and Conversational AI specialization is at the forefront of human-computer interaction, enabling natural, intelligent conversations and autonomous task execution. From simple chatbots to sophisticated multi-agent systems, this field combines natural language processing, machine learning, software engineering, and user experience design. The addition of MCP Apps extends this into interactive UI delivery within AI chat hosts, creating a bridge between conversational AI and rich visual interfaces. By understanding agent architectures, conversation design patterns, MCP Apps patterns, and best practices for deployment and safety, practitioners can build powerful conversational AI systems that deliver real value while maintaining trust and reliability.

AI Agents and Conversational AI Specialization (Library) reference

Page nodewiki/library/ai-agents-conversational.mdNearby pages · 111Documents · 1

Continue reading

Nearby pages in the same section.

Documented graph nodes

Records linked directly from this page’s Page node.

specialization:ai-agents-conversational · Specialization

AI Agents and Conversational AI Specialization

Overview

Roles and Responsibilities

Conversational AI Engineer

A conversational AI engineer designs and implements chatbots, virtual assistants, and conversational interfaces.

**Core responsibilities:**

Design and implement dialogue flows and conversation logic
Integrate NLU/NLP models for intent recognition and entity extraction
Build conversation state management and context tracking
Implement multi-turn conversation handling
Develop fallback strategies and error recovery mechanisms
Integrate with backend services and APIs
Create training data and fine-tune language models
Monitor conversation quality and user satisfaction metrics

**Skills required:**

Natural language processing and understanding
Machine learning and deep learning fundamentals
Python, JavaScript/TypeScript for bot frameworks
Dialogue management systems and state machines
API integration and webhook development
Prompt engineering and LLM fine-tuning
Analytics and conversation flow optimization

AI Agent Developer

An AI agent developer creates autonomous agents that can perceive, reason, plan, and act to achieve goals.

**Core responsibilities:**

Design agent architectures (ReAct, ReWOO, Plan-and-Execute, LangGraph)
Implement tool use and function calling capabilities
Build memory systems (short-term, long-term, semantic)
Create planning and reasoning modules
Develop multi-agent coordination and communication
Implement safety guardrails and content filtering
Optimize agent performance and token efficiency
Design evaluation frameworks for agent behaviors

**Skills required:**

LLM APIs and prompt engineering
Agent frameworks (LangChain, LlamaIndex, AutoGPT, CrewAI)
Vector databases and semantic search
Retrieval-Augmented Generation (RAG)
Function calling and tool integration
Reinforcement learning from human feedback (RLHF)
Software engineering and system design

Conversation Designer

A conversation designer crafts the user experience of conversational interfaces, focusing on natural, intuitive interactions.

**Core responsibilities:**

Design conversation flows and dialogue trees
Write chatbot scripts and response templates
Create persona and brand voice guidelines
Design error handling and repair strategies
Map user journeys and conversation paths
Conduct user research and usability testing
Collaborate with linguists and UX designers
Create conversation design documentation

**Skills required:**

UX design and user research methodologies
Linguistics and pragmatics
Storytelling and copywriting
Flowchart and dialogue mapping tools
Understanding of NLU capabilities and limitations
Human-computer interaction principles
Empathy and user-centered design thinking

Other Related Roles

**NLP Engineer** - Specializes in natural language understanding models, intent classification, entity recognition
**LLM Ops Engineer** - Manages deployment, monitoring, and optimization of large language model systems
**Prompt Engineer** - Designs and optimizes prompts for LLM-based systems
**AI Safety Researcher** - Ensures AI agents behave safely, ethically, and aligned with human values
**Multi-Agent Systems Architect** - Designs complex systems with multiple coordinating agents
**Voice AI Developer** - Specializes in speech-to-text, text-to-speech, and voice-based interactions

Agent Architectures

ReAct (Reasoning and Acting)

An agent architecture that interleaves reasoning traces with action execution.

**Components:**

**Thought** - Agent reasons about the current state and what to do next
**Action** - Agent executes a tool or function call
**Observation** - Agent receives the result of the action
**Loop** - Repeat thought-action-observation until task completion

**Characteristics:**

Explicit reasoning traces improve interpretability
Self-reflection and error correction capabilities
Works well for multi-step tasks requiring planning
Handles complex tool use scenarios
Can recover from mistakes through reasoning

**Use cases:**

Research and information gathering agents
Complex problem-solving tasks
Debugging and troubleshooting assistants
Mathematical reasoning and code generation

ReWOO (Reasoning WithOut Observation)

An agent architecture that plans all actions upfront before executing them, reducing LLM calls.

**Components:**

**Planning phase** - Generate complete action plan with dependencies
**Execution phase** - Execute all tools in parallel where possible
**Response phase** - Synthesize results into final answer

**Characteristics:**

More efficient than ReAct (fewer LLM calls)
Enables parallel tool execution
Better for predictable workflows
Less adaptive to unexpected results
Lower latency for multi-step tasks

**Use cases:**

Data analysis pipelines
Batch processing tasks
Workflows with clear dependencies
Cost-sensitive applications

Plan-and-Execute

An agent architecture that separates high-level planning from low-level execution.

**Components:**

**Planner** - Creates high-level task decomposition
**Executor** - Carries out individual steps
**Replanner** - Updates plan based on execution results
**Monitor** - Tracks progress and detects failures

**Characteristics:**

Hierarchical task decomposition
Can handle long-horizon tasks
Adapts plan based on intermediate results
Better resource allocation
Clear separation of concerns

**Use cases:**

Project management and task automation
Complex workflow orchestration
Long-running research tasks
Multi-step creative projects

LangGraph (Stateful Multi-Agent Workflows)

A framework for building cyclic graphs of agents with explicit state management.

**Components:**

**Nodes** - Individual agents or processing steps
**Edges** - Transitions between nodes (conditional or unconditional)
**State** - Shared state accessible to all nodes
**Checkpoints** - Persistence and resumability

**Characteristics:**

Explicit control flow with cycles and conditionals
Built-in state management and persistence
Human-in-the-loop integration points
Time travel and debugging capabilities
Composable and reusable agent graphs

**Use cases:**

Complex decision-making workflows
Multi-agent collaboration systems
Customer support automation
Agentic applications requiring human oversight

Multi-Agent Systems

Systems with multiple specialized agents working together to solve problems.

**Patterns:**

**Hierarchical** - Manager agent delegates to specialist agents
**Cooperative** - Agents collaborate and share information
**Competitive** - Agents debate or vote on solutions
**Sequential** - Agents process tasks in a pipeline
**Parallel** - Agents work independently and results are aggregated

**Characteristics:**

Specialization and division of labor
Scalability through parallel processing
Robustness through redundancy
Complex coordination requirements
Emergent behaviors from agent interactions

**Use cases:**

Software development teams (architect, developer, tester, reviewer)
Research and analysis (researcher, fact-checker, synthesizer)
Content creation (writer, editor, reviewer)
Complex simulations and game playing

Conversational AI Components

Natural Language Understanding (NLU)

**Intent Recognition:**

Classify user input into predefined intent categories
Handle multi-intent utterances
Confidence scoring and threshold tuning
Out-of-scope intent detection
Few-shot and zero-shot intent classification

**Entity Extraction:**

Named entity recognition (NER) for people, places, organizations
Custom entity types (products, dates, quantities)
Slot filling for dialogue state
Entity linking and normalization
Composite entity extraction

**Sentiment Analysis:**

Emotion detection (positive, negative, neutral)
Fine-grained sentiment scoring
Aspect-based sentiment analysis
Emotion classification (joy, anger, fear, surprise)

**NLU Approaches:**

Rule-based pattern matching (regex, keyword)
Traditional ML (SVM, CRF, logistic regression)
Deep learning (BERT, RoBERTa, transformer models)
LLM-based (GPT-4, Claude for zero-shot/few-shot)
Hybrid approaches combining multiple methods

Dialogue Management

**State Management:**

Conversation context tracking
Slot filling and form handling
User profile and preferences
Session management
Multi-turn context window

**Dialogue Policies:**

Rule-based (if-then conditions, finite state machines)
Retrieval-based (select from predefined responses)
Generative (LLM-generated responses)
Hybrid approaches
Reinforcement learning policies

**Conversation Flow:**

Happy path scenarios
Error handling and clarification requests
Context switching and digression handling
Conversation repair strategies
Proactive suggestions and nudges

**Dialogue State Tracking (DST):**

Track user goals and intents across turns
Update belief state with each user input
Handle slot value changes and corrections
Maintain conversation history
Support multi-domain conversations

Natural Language Generation (NLG)

**Template-Based Generation:**

Predefined response templates with variable slots
Simple, predictable, and fast
Limited flexibility and naturalness
Good for structured domains

**Retrieval-Based Generation:**

Select best response from corpus
Uses similarity matching
Requires large response database
Consistent quality

**Neural NLG:**

Seq2seq models with attention
Transformer-based generation (GPT, T5)
Fine-tuned language models
Controllable generation with attributes

**LLM-Based Generation:**

Zero-shot and few-shot prompting
Instruction-tuned models (GPT-4, Claude)
RAG for grounded generation
Chain-of-thought for reasoning
Constitutional AI for safe outputs

**NLG Considerations:**

Consistency with brand voice and persona
Contextual appropriateness
Diversity and avoiding repetition
Safety and content filtering
Hallucination mitigation

Memory Systems

**Short-Term Memory:**

Recent conversation history (last N turns)
Current dialogue state and context
Active goals and incomplete tasks
Session-based, cleared after conversation ends

**Long-Term Memory:**

User preferences and profile information
Historical interactions and patterns
Learned facts about the user
Cross-session persistence

**Semantic Memory:**

Vector database storage (Pinecone, Weaviate, Chroma)
Embedding-based retrieval
Relevant context injection into prompts
Knowledge base and documentation
Episodic memory of past interactions

**Memory Strategies:**

Sliding window with summarization
Hierarchical memory (short, medium, long-term)
Selective retention of important information
Memory consolidation and compression
Retrieval-augmented generation (RAG)

Chatbot Frameworks and Platforms

LLM Agent Frameworks

**LangChain:**

Comprehensive framework for LLM applications
Chains, agents, tools, and memory abstractions
LangGraph for stateful multi-agent workflows
LangSmith for observability and debugging
Extensive integrations with LLM providers and tools

**LlamaIndex:**

Data framework for LLM applications
Advanced RAG patterns and indexing strategies
Query engines and routers
Sub-question decomposition
Multi-document agents

**Semantic Kernel (Microsoft):**

SDK for integrating LLMs into applications
Planners and plugins architecture
Memory and skills abstractions
Native support for C#, Python, Java
Enterprise-ready with Azure integration

**Haystack:**

End-to-end NLP framework
Pipeline-based architecture
RAG and semantic search focus
Document processing and indexing
Production-ready with monitoring

**AutoGPT / BabyAGI:**

Autonomous agent frameworks
Self-directed task creation and execution
Goal-oriented behavior
Experimental and research-focused
Demonstrates emergent agent capabilities

**CrewAI:**

Multi-agent orchestration framework
Role-based agent design
Task delegation and coordination
Built on LangChain
Simpler API for multi-agent systems

Traditional Chatbot Platforms

**Rasa:**

Open source conversational AI platform
Custom NLU and dialogue management
Machine learning-based intent classification
Rule-based and ML dialogue policies
Self-hosted and full control over data

**Botpress:**

Open source chatbot building platform
Visual flow builder
NLU engine with slot filling
Multi-channel deployment
Extensible with custom code

**Microsoft Bot Framework:**

Enterprise bot development framework
Azure Bot Service integration
Adaptive dialogs and cards
Multi-channel support (Teams, Slack, web)
Language Understanding (LUIS) integration

**Dialogflow (Google):**

Conversational AI platform
Intent and entity recognition
Multi-turn conversations
Rich response types
Integration with Google Cloud

**Amazon Lex:**

AWS service for building conversational interfaces
Powered by Alexa technology
Automatic speech recognition (ASR)
Integration with AWS Lambda
Omnichannel deployment

**IBM Watson Assistant:**

Enterprise conversational AI platform
Intent classification and entity extraction
Dialog skill and search skill
Integration with Watson Discovery
Enterprise features (analytics, versioning)

Conversational UI Frameworks

**Voiceflow:**

Visual conversation design platform
No-code/low-code builder
Prototyping and production deployment
Collaboration features
Multi-channel publishing

**Landbot:**

No-code chatbot builder
Visual drag-and-drop interface
Conversational landing pages
WhatsApp and web chat support
Integration with popular tools

**ManyChat / Chatfuel:**

Facebook Messenger bot platforms
Visual flow builders
Marketing and e-commerce focus
Broadcasting and sequences
Limited to messaging platforms

Tool Use and Function Calling

Function Calling Patterns

**Single Function Call:**

Agent selects one tool to call
Waits for result before continuing
Simple and predictable
Good for sequential workflows

**Parallel Function Calling:**

Agent calls multiple tools simultaneously
Reduces latency for independent operations
Requires careful result aggregation
Supported by GPT-4, Claude 3+

**Nested Function Calling:**

Results of one function call used in subsequent calls
Forms a dependency graph
Requires planning or dynamic execution
Common in ReAct-style agents

**Conditional Function Calling:**

Branch logic based on function results
If-then-else patterns in agent reasoning
Handles different scenarios gracefully

Tool Integration Strategies

**API Integration:**

REST API calls to external services
Authentication and rate limiting
Error handling and retries
Response parsing and validation

**Database Queries:**

SQL or NoSQL database access
Semantic search over vector databases
Caching and query optimization
Security and access control

**Code Execution:**

Python REPL or Jupyter kernel
Sandboxed execution environments
Code validation and safety checks
Result capture and error handling

**Web Browsing:**

Web scraping and crawling
Headless browser automation (Playwright, Puppeteer)
Content extraction and parsing
Rate limiting and politeness policies

**Custom Tools:**

Domain-specific business logic
Internal system integrations
Workflow automation
Report generation

Function Calling Best Practices

**Tool Design:**

Clear, descriptive tool names and descriptions
Well-defined input schemas with examples
Consistent error handling and return formats
Minimal and focused tool functionality
Idempotent operations where possible

**Prompt Engineering:**

Provide clear tool documentation in prompts
Include examples of correct tool usage
Specify when tools should and shouldn't be used
Handle tool failures gracefully
Guide agent through multi-step tool use

**Error Handling:**

Validate tool inputs before execution
Provide informative error messages
Implement retry logic with backoff
Graceful degradation when tools fail
Log errors for debugging

**Security:**

Validate and sanitize all inputs
Implement authentication and authorization
Rate limit tool calls
Audit tool usage
Sandbox dangerous operations

Retrieval-Augmented Generation (RAG)

RAG Pipeline

Advanced RAG Patterns

**Hierarchical Retrieval:**

Multi-level document structure (document, section, chunk)
Retrieve broader context then drill down
Parent-child chunk relationships

**Multi-Query RAG:**

Generate multiple query variations
Retrieve for each variation
Aggregate and deduplicate results

**Agentic RAG:**

Agent decides when to retrieve
Iterative retrieval and reasoning
Query decomposition and routing
Self-correction based on retrieval quality

**Hybrid Search:**

Combine semantic (vector) and keyword (BM25) search
Reciprocal rank fusion for result merging
Best of both retrieval paradigms

**Self-RAG:**

Agent critiques retrieval quality
Decides when retrieval is needed
Filters irrelevant retrieved passages
Reflection and self-correction

RAG Optimization

**Chunking Strategies:**

Fixed-size chunks with overlap
Semantic chunking (sentence, paragraph)
Recursive character text splitting
Document structure-aware splitting
Small-to-big retrieval (retrieve small, provide context from large)

**Embedding Models:**

OpenAI text-embedding-ada-002, text-embedding-3
Cohere embed-english-v3.0
Sentence-transformers (all-MiniLM-L6-v2)
Domain-specific fine-tuned embeddings
Multilingual embeddings

**Reranking:**

Cross-encoder models for reranking
Cohere Rerank API
Diversity-based reranking
Maximal marginal relevance (MMR)
LLM-based relevance scoring

**Query Optimization:**

Query expansion and rewriting
Hypothetical document embeddings (HyDE)
Query decomposition for complex questions
Step-back prompting for high-level context

Conversation Design Patterns

Onboarding and Discovery

**First-Time User Experience:**

Welcoming message and value proposition
Capability discovery (what can the bot do)
Example queries or quick replies
Personality and tone setting
Privacy and data handling disclosure

**Progressive Disclosure:**

Start with simple, core functionality
Gradually introduce advanced features
Contextual help and tips
Avoid overwhelming users upfront

Error Handling and Repair

**No Match / Out of Scope:**

Acknowledge confusion gracefully
Suggest alternative phrasings
Offer menu of available actions
Escalation to human agent

**Clarification Requests:**

Ask focused, closed-ended questions
Provide examples of expected input
Use confirmation prompts
Allow user to correct misunderstandings

**Conversation Repair:**

Detect and recover from errors
Allow undo and restart
Maintain conversation history for context
Learn from failed interactions

Personality and Tone

**Persona Design:**

Define character traits and background
Consistent voice across interactions
Match brand identity
Cultural and linguistic appropriateness

**Tone Adaptation:**

Formal vs. casual based on context
Empathetic responses to negative sentiment
Professional for business contexts
Playful for entertainment applications

**Anthropomorphization:**

Set appropriate expectations (bot vs. human)
Transparency about limitations
Avoid uncanny valley
Balance personality with utility

Proactive Engagement

**Contextual Suggestions:**

Predict user needs based on context
Offer relevant follow-up actions
Smart defaults and auto-complete
Progressive disclosure of features

**Notifications and Reminders:**

Push notifications for important events
Periodic check-ins and engagement
Personalized recommendations
Opt-in and preference management

Safety and Alignment

Content Filtering

**Input Filtering:**

Detect harmful, toxic, or inappropriate inputs
Block prompt injection attempts
Sanitize user inputs
Rate limiting and abuse detection

**Output Filtering:**

Safety classifiers for generated content
Toxicity detection (Perspective API)
PII and sensitive data redaction
Hallucination detection

Prompt Injection Defense

**Mitigation Strategies:**

Separate user content from system instructions
Use XML tags or clear delimiters
Instruction hierarchy (system > user)
Output validation and parsing
LLM-based injection detection

**System Prompts:**

Explicit role definition
Clear task boundaries
Safety guidelines and guardrails
Examples of correct behavior

Bias and Fairness

**Bias Mitigation:**

Diverse training data
Bias detection in outputs
Counterfactual fairness testing
Regular audits and monitoring
User feedback integration

**Ethical Considerations:**

Transparency about AI usage
Informed consent for data collection
Right to explanation for decisions
Appeal and correction mechanisms
Human oversight and control

RLHF and Alignment

**Reinforcement Learning from Human Feedback:**

Preference learning from human comparisons
Reward model training
Policy optimization (PPO, DPO)
Constitutional AI principles
Red teaming and adversarial testing

**Alignment Techniques:**

Instruction tuning for helpfulness
Constitutional AI for harmlessness
Value alignment frameworks
Recursive reward modeling
Debate and amplification

Evaluation and Testing

Conversation Quality Metrics

**Automatic Metrics:**

Intent classification accuracy
Entity extraction F1 score
Dialogue success rate
Average conversation length
Fallback rate and containment

**Human Evaluation:**

Conversation satisfaction (CSAT)
Task completion rate
Response appropriateness
Naturalness and fluency
Personality consistency

**LLM-as-Judge:**

Use GPT-4/Claude to score responses
Evaluate relevance, accuracy, helpfulness
Compare multiple agent responses
Scalable alternative to human evaluation

Agent Benchmarks

**General Capability:**

MMLU (Massive Multitask Language Understanding)
HellaSwag (commonsense reasoning)
TruthfulQA (factual accuracy)
HumanEval (code generation)

**Agent-Specific:**

WebArena (web navigation tasks)
AgentBench (multi-domain agent tasks)
Tool-using benchmarks (API-Bank)
Multi-agent coordination tasks
Long-horizon planning tasks

Testing Strategies

**Unit Testing:**

Test individual components (NLU, dialogue policy)
Mock external dependencies
Validate function calling
Test error handling

**Integration Testing:**

End-to-end conversation flows
Multi-turn dialogue scenarios
Tool integration testing
State persistence verification

**Regression Testing:**

Test suite of known good conversations
Track metrics over time
Catch performance degradation
Automated with CI/CD

**A/B Testing:**

Compare different prompts or models
Test conversation flow variations
Measure user engagement
Data-driven optimization

Deployment and Operations

Hosting Options

**Cloud Platforms:**

AWS (Lambda, ECS, API Gateway)
Google Cloud (Cloud Run, Cloud Functions)
Azure (Functions, App Service)
Vercel, Netlify (serverless)

**LLM Providers:**

OpenAI API (GPT-4, GPT-3.5)
Anthropic API (Claude 3 Opus, Sonnet, Haiku)
Google Gemini API
Mistral API, Cohere API
Azure OpenAI Service

**Self-Hosted:**

Ollama for local LLM inference
vLLM for high-throughput serving
TGI (Text Generation Inference)
Ray Serve for scalable deployment

Monitoring and Observability

**Metrics to Track:**

Request latency (p50, p95, p99)
Token usage and costs
Error rates and types
Conversation completion rates
User satisfaction scores

**Logging:**

Conversation history and context
Tool calls and results
Errors and exceptions
Model inputs and outputs
User feedback and ratings

**Tracing:**

LangSmith for LangChain applications
LangFuse for LLM observability
OpenTelemetry for distributed tracing
Custom instrumentation

**Alerting:**

High error rates or latency
Cost anomalies
Safety filter triggers
Unusual usage patterns

Optimization

**Latency Optimization:**

Streaming responses for better UX
Parallel function calling
Caching frequent queries
Smaller/faster models for simple tasks
Speculative execution

**Cost Optimization:**

Prompt compression and optimization
Use smaller models when possible
Cache embeddings and results
Batch processing
Token budget management

**Quality Optimization:**

Few-shot examples and demonstrations
Chain-of-thought prompting
Self-consistency and voting
Retrieval-augmented generation
Fine-tuning for domain adaptation

Best Practices Summary

Success Metrics

**User Engagement:**

Daily/monthly active users
Conversation frequency and length
Return rate and retention
Feature adoption rates

**Performance Metrics:**

Task completion rate
First contact resolution
Average handling time
Escalation rate

**Quality Metrics:**

User satisfaction (CSAT, NPS)
Response accuracy
Intent recognition accuracy
Conversation coherence

**Business Metrics:**

Cost per conversation
Support ticket deflection
Conversion rates
ROI and cost savings

Career Development

Learning Path

Resources

**Courses** - DeepLearning.AI LangChain courses, Hugging Face NLP course, fast.ai
**Books** - "Building LLMs for Production", "Designing Voice User Interfaces"
**Communities** - LangChain Discord, r/LangChain, AI Agent communities
**Conferences** - NeurIPS, ACL, EMNLP, Applied AI conferences

MCP Apps Sub-Specialization

Overview

**Repository**: https://github.com/modelcontextprotocol/ext-apps **npm Package**: @modelcontextprotocol/ext-apps **Specification Version**: 2026-01-26 (stable)

Key Principles

Process Catalog

Process File	Name	Description
`create-mcp-app.js`	Create MCP App (Greenfield)	Scaffolds a new interactive UI MCP App from scratch. Covers framework selection, project setup, Tool+Resource implementation, client UI with handler-before-connect pattern, single-file bundling, and verification against basic-host.
`add-app-to-mcp-server.js`	Add Interactive UI to Existing MCP Server	Adds interactive UI capabilities to an existing MCP server. Analyzes existing tools for UI benefit, converts selected tools to App tools, preserves backward compatibility with text fallbacks, and adds build pipeline.
`convert-web-app-to-mcp.js`	Convert Web App to Hybrid MCP App	Converts an existing web application into a hybrid that works both standalone in browser AND as an MCP App. Preserves original standalone functionality while adding MCP data path alongside.
`migrate-openai-app-to-mcp.js`	Migrate OpenAI Apps SDK to MCP Apps	Migrates existing OpenAI Apps SDK applications to the open MCP Apps standard. Handles paradigm shift from synchronous global object to async App instance with event handlers.

Skills Catalog

Skill Directory	Name	Description
`skills/mcp-app-scaffolding/`	MCP App Scaffolding	Scaffolds MCP App project structure with framework selection decision tree (React/Vanilla JS/Vue/Svelte/Preact/Solid), directory layout, dependencies, and entry points.
`skills/mcp-csp-investigation/`	MCP CSP Investigation	Performs comprehensive Content Security Policy audit for MCP Apps in sandboxed iframes. Builds the app, traces every network origin to source, produces categorized domain lists.
`skills/mcp-tool-resource-pattern/`	MCP Tool + Resource Pattern	Implements the core Tool + Resource architectural pattern with `registerAppTool`, `registerAppResource`, text fallback, structuredContent, and app-only helper tools.
`skills/mcp-host-styling-integration/`	MCP Host Styling Integration	Integrates MCP App UI with host theming system using CSS variables with fallback values, `onhostcontextchanged` event, safe area insets, and display mode detection.
`skills/mcp-app-verification/`	MCP App Verification	Runs comprehensive verification checklists for MCP Apps. Tests with basic-host, checks handler-before-connect invariant, text fallback, resource URI linking, single-file bundling, host styling.
`skills/single-file-bundling/`	Single-File Bundling	Configures Vite with `vite-plugin-singlefile` for mandatory single-file HTML bundling. All assets (JS, CSS, images, fonts) must be inlined into a single HTML file.

Agents Catalog

Agent Directory	Name	Role
`agents/mcp-app-architect/`	MCP App Architect	Senior architect for MCP Apps -- designs Tool + Resource patterns, transport configuration, data flow, and CSP security model.
`agents/mcp-ui-developer/`	MCP UI Developer	Frontend specialist for MCP App client-side UI -- implements App lifecycle, event handlers, host styling, and framework-specific patterns.
`agents/mcp-migration-specialist/`	MCP Migration Specialist	Expert in migrating applications to MCP Apps from other platforms (OpenAI Apps SDK, plain web apps). Handles paradigm shifts, API mapping, and verification.
`agents/csp-security-auditor/`	CSP Security Auditor	Security specialist focused on Content Security Policy for MCP Apps in sandboxed iframes. Audits network origins and ensures complete CSP configuration.

Usage Guide

**Create a new MCP App from scratch:**

bash

agent-platform call --harness claude-code \
  --process library/specializations/ai-agents-conversational/create-mcp-app.js#process \
  --workspace .

**Add UI to an existing MCP server:**

bash

agent-platform call --harness claude-code \
  --process library/specializations/ai-agents-conversational/add-app-to-mcp-server.js#process \
  --workspace .

**Convert a web app to hybrid MCP App:**

bash

agent-platform call --harness claude-code \
  --process library/specializations/ai-agents-conversational/convert-web-app-to-mcp.js#process \
  --workspace .

**Migrate from OpenAI Apps SDK:**

bash

agent-platform call --harness claude-code \
  --process library/specializations/ai-agents-conversational/migrate-openai-app-to-mcp.js#process \
  --workspace .

Critical Implementation Notes

**Handler-before-connect invariant**: Register ALL handlers (ontoolinput, ontoolresult, onhostcontextchanged, onteardown) BEFORE calling app.connect(). Violating this produces subtle, hard-to-debug failures.
**CSP silent failures**: All network requests fail SILENTLY without proper CSP declaration in sandboxed iframes. Missing even one origin causes silent failure.
**Text fallback**: Always include a content array with text representation for non-UI hosts.
**Resource URI linking**: Tool's _meta.ui.resourceUri must match the registered resource URI.
**vite-plugin-singlefile is mandatory**: Without it, assets will not load in the sandboxed iframe.