AI Voice Chat using Webhook, Memory Manager, OpenAI, Google Gemini & ElevenLabs
This workflow constructs an intelligent voice chat system capable of converting voice input into text, managing contextual memory, generating intelligent responses, and producing natural voice output. By integrating efficient AI models, the user's voice content is accurately recognized while retaining the context of the conversation, allowing for the generation of personalized replies. This provides a smooth voice interaction experience, suitable for scenarios such as smart assistants, customer service, and online education, enhancing user experience and service efficiency.

Workflow Name
AI Voice Chat using Webhook, Memory Manager, OpenAI, Google Gemini & ElevenLabs
Key Features and Highlights
This workflow implements a voice-based intelligent chat system that supports speech-to-text conversion, contextual memory management, intelligent response generation, and speech synthesis output. Key highlights include leveraging OpenAI for speech-to-text, utilizing the Google Gemini model for context understanding and response generation, integrating LangChain’s memory management modules to maintain conversational continuity, and finally producing natural and fluent voice replies via ElevenLabs’ high-quality text-to-speech API.
Core Problems Addressed
- Accurate recognition of user speech content in voice chat (via OpenAI speech-to-text)
- Resolving context gaps and insufficient conversation memory in intelligent chat (through Memory Manager and Window Buffer Memory for context management)
- Providing natural and personalized voice reply experiences (using Google Gemini for response generation and ElevenLabs for text-to-speech)
- Enabling end-to-end automated voice Q&A without manual user intervention
Application Scenarios
- Intelligent voice assistants and customer service chatbots
- Voice-interactive smart home control
- Q&A systems for education tutoring and language learning
- Voice interfaces for online consultation or information retrieval services
- Any natural language interaction system requiring voice input and output
Main Workflow Steps
- Webhook receives voice message: Accept user voice data via HTTP POST.
- OpenAI speech-to-text: Convert the uploaded user voice into text content.
- Retrieve historical conversation context: Use Memory Manager’s “Get Chat” node to fetch prior dialogue, ensuring context continuity.
- Aggregate contextual data: Organize and consolidate historical conversations to provide complete context for the model.
- Invoke Google Gemini chat model: Generate intelligent textual replies based on aggregated context and current user input.
- Save latest conversation content: Use Memory Manager’s “Insert Chat” node to store current Q&A, updating the context memory.
- Text-to-speech conversion: Call ElevenLabs’ API to convert AI-generated text replies into natural speech.
- Respond to Webhook: Return the generated voice data to the requester, completing the voice Q&A loop.
Involved Systems or Services
- Webhook: Handles receiving and responding to HTTP requests, serving as the entry and exit point for voice messages.
- OpenAI (Speech to Text): Converts user speech into text.
- LangChain Memory Manager & Window Buffer Memory: Manages conversation history and context to maintain memory continuity.
- Google Gemini Chat Model: Generates intelligent text replies based on context.
- ElevenLabs: Converts reply text into high-quality speech output, supporting various voice styles.
Target Users and Value Proposition
- Developers and enterprises aiming to rapidly build intelligent voice interaction systems to enhance user experience and service efficiency.
- Technical teams needing to integrate multimodal AI capabilities (speech recognition, natural language understanding and generation, speech synthesis).
- Product managers and engineers in voice assistants, intelligent customer service, online education, smart home, and related fields.
- Those seeking to leverage advanced AI models and cloud APIs to create voice bots with contextual memory and natural interaction capabilities.
This workflow achieves a complete closed loop from voice input to intelligent voice output through multi-node collaboration, balancing context understanding and multi-service integration, significantly enhancing the intelligence and user experience of voice interactions.