AI Voice Chat using Webhook, Memory Manager, OpenAI, Google Gemini & ElevenLabs
This workflow builds a complete AI voice chat system that can transcribe user speech into text in real time and achieve understanding and generation of multi-turn conversations through context memory management. By combining advanced language models with high-quality text-to-speech technology, the system can provide natural and smooth voice responses, making it suitable for scenarios such as intelligent customer service and voice assistants, thereby enhancing user interaction experience and efficiency.

Workflow Name
AI Voice Chat using Webhook, Memory Manager, OpenAI, Google Gemini & ElevenLabs
Key Features and Highlights
This workflow implements a comprehensive AI voice chat system that supports real-time transcription of voice input, context memory management, multi-turn dialogue understanding and generation, and ultimately outputs natural and fluent voice responses through high-quality text-to-speech technology. The system integrates OpenAI’s speech-to-text capabilities, Google Gemini’s advanced language model, and ElevenLabs’ text-to-speech API to ensure intelligent interaction and natural-sounding voice output.
Core Problems Addressed
- Real-time conversion of user speech to text, eliminating input barriers.
- Maintaining multi-turn dialogue context through a memory management node to ensure conversation coherence and accurate understanding of user intent.
- Leveraging powerful language models to generate contextually appropriate intelligent responses.
- Converting AI-generated text into high-quality speech output to support natural voice interaction experiences.
- Flexible invocation of ElevenLabs API for text-to-speech without requiring a pre-configured ElevenLabs node.
Application Scenarios
- Intelligent customer service chatbots supporting voice Q&A and continuous conversations.
- Voice assistants and voice interaction systems.
- Accessible voice communication platforms.
- Voice learning and training tools.
- Any intelligent application requiring natural voice dialogue interaction.
Main Workflow Steps
- Webhook Receives Voice Request: Listens for and receives user voice messages.
- OpenAI Speech-to-Text: Transcribes the received audio into text in real time.
- Retrieve Historical Dialogue Context: Uses the Memory Manager node to fetch previous conversation content, ensuring continuity.
- Aggregate Context Data: Integrates dialogue history to form a complete context.
- Invoke Google Gemini Language Model: Generates intelligent text replies based on the context.
- Insert New Dialogue Content into Memory Manager: Updates the context to keep memory synchronized.
- Text-to-Speech (ElevenLabs): Uses ElevenLabs API to synthesize speech from the text reply.
- Respond with Audio Data via Webhook: Returns the generated voice to the caller, completing the voice Q&A loop.
Involved Systems or Services
- Webhook: Receives and responds to HTTP requests.
- OpenAI: Speech-to-text service.
- LangChain Memory Manager: Dialogue memory management to maintain context.
- Google Gemini (PaLM API): Powerful multi-turn dialogue language generation model.
- ElevenLabs: High-quality text-to-speech API.
Target Users and Value
- Developers and enterprises building intelligent voice interaction systems.
- Industries such as customer service, education, and accessibility technology aiming to enhance user interaction experiences.
- Organizations seeking to reduce manual costs and improve response speed through automated workflows.
- Technical teams with high demands for multi-turn voice dialogue context management.
This workflow integrates industry-leading AI speech recognition, language understanding, and speech synthesis technologies, enabling users to rapidly build intelligent voice chatbots with contextual memory capabilities, significantly enhancing the naturalness and efficiency of voice interactions.