AI Voice Chat using Webhook, Memory Manager, OpenAI, Google Gemini & ElevenLabs

This workflow constructs an intelligent voice chat system capable of converting voice input into text, managing contextual memory, generating intelligent responses, and producing natural voice output. By integrating efficient AI models, the user's voice content is accurately recognized while retaining the context of the conversation, allowing for the generation of personalized replies. This provides a smooth voice interaction experience, suitable for scenarios such as smart assistants, customer service, and online education, enhancing user experience and service efficiency.

Smart VoiceContext Memory

Workflow Name

Key Features and Highlights

This workflow implements a voice-based intelligent chat system that supports speech-to-text conversion, contextual memory management, intelligent response generation, and speech synthesis output. Key highlights include leveraging OpenAI for speech-to-text, utilizing the Google Gemini model for context understanding and response generation, integrating LangChain’s memory management modules to maintain conversational continuity, and finally producing natural and fluent voice replies via ElevenLabs’ high-quality text-to-speech API.

Core Problems Addressed

Accurate recognition of user speech content in voice chat (via OpenAI speech-to-text)
Resolving context gaps and insufficient conversation memory in intelligent chat (through Memory Manager and Window Buffer Memory for context management)
Providing natural and personalized voice reply experiences (using Google Gemini for response generation and ElevenLabs for text-to-speech)
Enabling end-to-end automated voice Q&A without manual user intervention

Application Scenarios

Intelligent voice assistants and customer service chatbots
Voice-interactive smart home control
Q&A systems for education tutoring and language learning
Voice interfaces for online consultation or information retrieval services
Any natural language interaction system requiring voice input and output

Main Workflow Steps

Webhook receives voice message: Accept user voice data via HTTP POST.
OpenAI speech-to-text: Convert the uploaded user voice into text content.
Retrieve historical conversation context: Use Memory Manager’s “Get Chat” node to fetch prior dialogue, ensuring context continuity.
Aggregate contextual data: Organize and consolidate historical conversations to provide complete context for the model.
Invoke Google Gemini chat model: Generate intelligent textual replies based on aggregated context and current user input.
Save latest conversation content: Use Memory Manager’s “Insert Chat” node to store current Q&A, updating the context memory.
Text-to-speech conversion: Call ElevenLabs’ API to convert AI-generated text replies into natural speech.
Respond to Webhook: Return the generated voice data to the requester, completing the voice Q&A loop.

Involved Systems or Services

Webhook: Handles receiving and responding to HTTP requests, serving as the entry and exit point for voice messages.
OpenAI (Speech to Text): Converts user speech into text.
LangChain Memory Manager & Window Buffer Memory: Manages conversation history and context to maintain memory continuity.
Google Gemini Chat Model: Generates intelligent text replies based on context.
ElevenLabs: Converts reply text into high-quality speech output, supporting various voice styles.

Target Users and Value Proposition

Developers and enterprises aiming to rapidly build intelligent voice interaction systems to enhance user experience and service efficiency.
Technical teams needing to integrate multimodal AI capabilities (speech recognition, natural language understanding and generation, speech synthesis).
Product managers and engineers in voice assistants, intelligent customer service, online education, smart home, and related fields.
Those seeking to leverage advanced AI models and cloud APIs to create voice bots with contextual memory and natural interaction capabilities.

This workflow achieves a complete closed loop from voice input to intelligent voice output through multi-node collaboration, balancing context understanding and multi-service integration, significantly enhancing the intelligence and user experience of voice interactions.

Recommend Templates

Play with Spotify from Telegram

This workflow enables intelligent control of Spotify music playback through Telegram. Users can send song-related information in the chat, and the system utilizes AI technology to recognize and search for the tracks, automatically adding them to the playlist and starting playback. This solution simplifies the traditional music operation process, allowing users to quickly find and play their favorite songs without switching applications, thereby enhancing convenience and interaction efficiency. It is suitable for various scenarios, including office, leisure, and smart home environments.

Telegram ControlSpotify Play

Automated Image Metadata Tagging (Community Node)

This workflow utilizes automation technology to intelligently analyze newly added image files and write metadata tags. Whenever a new image is added to a specified folder in Google Drive, the system automatically downloads it and uses an AI model to analyze the image content, generating descriptive keywords that are then written into the image's EXIF metadata. This process requires no manual intervention, significantly enhancing the efficiency and intelligence of image management, making it suitable for various scenarios such as media libraries, digital asset management, and e-commerce platforms.

Image Auto TaggingEXIF Metadata

Automated RFP Response Assistant (AutoRFP)

The automated RFP response assistant efficiently handles tender documents by automatically receiving PDFs and extracting questions. It utilizes company information to generate professional answers using AI, ultimately creating a complete response document. The workflow records Q&A in Google Docs, and upon completion, it automatically sends emails and Slack notifications, helping sales and bidding teams reduce manual work, improve response speed and accuracy, and enhance the company's competitiveness.

RFP AutomationSmart Q&A

WhatsApp AI Sales Assistant Workflow

This workflow is designed to receive customer inquiries via WhatsApp, utilizing the OpenAI GPT-4 intelligent model and memory caching to provide intelligent Q&A based on the product catalog, automatically responding to users with product information. It supports the automatic import and information extraction of PDF product manuals, builds a product knowledge base, and is capable of multi-turn conversation memory, enhancing the efficiency and experience of customer service. It is suitable for scenarios such as enterprise sales and customer support.

Smart Q&AWhatsApp Sales

My workflow 6

This workflow implements AI chatbot functionality through Slack's Slash commands, allowing it to receive user input and generate intelligent replies, which are automatically sent back to the Slack channel. It supports multiple command switches, enhancing the flexibility and efficiency of message interactions, and helps users quickly build an intelligent Q&A system within Slack, addressing the complexity issues of traditional chatbots. It is suitable for scenarios such as internal corporate communication, customer service automated replies, and education and training, significantly improving user experience and work efficiency.

Slack BotSmart Q&A

Testing Multiple Local LLMs with LM Studio

This workflow implements automated testing and performance evaluation of multiple local large language models. It integrates the LM Studio server, supporting dynamic invocation of various models to generate text. Users can guide the model to produce text that meets specific readability standards through custom prompts. Additionally, the workflow includes multiple text analysis metrics that calculate output quality in real-time and automatically save the results to Google Sheets, facilitating subsequent comparisons and data tracking, significantly enhancing the efficiency and accuracy of language model testing.

Local LLM TestText Readability

Twilio SMS Intelligent Buffering Reply Workflow

This workflow receives users' text messages and temporarily caches multiple rapidly sent messages in Redis within a short period. After a 5-second delay for evaluation, these messages are consolidated into a single message, which is sent to an AI model to generate a unified response. Finally, the response is returned to the user via text message. This process effectively addresses the issue of intermittent replies when users input messages frequently, enhancing the coherence of the conversation and improving user experience. It is suitable for scenarios such as customer service auto-replies and intelligent chatbots.

Twilio SMSSmart Buffer

modelo do chatbot

This workflow builds an intelligent chatbot designed to quickly recommend suitable health insurance products based on users' personal information and needs. By combining OpenAI's language model with persistent chat memory, the chatbot can dynamically interpret user input to provide personalized services. Additionally, by integrating external APIs and knowledge bases, it further enriches the content of responses, enhances user interaction experience, and addresses the issues of slow response times and inaccurate matching commonly found in traditional customer service.

Smart ChatHealth Insurance