Easily Compare LLMs Using OpenAI and Google Sheets
This workflow is designed to automate the comparison of different large language models by real-time invoking independent responses from multiple models based on user chat input. It records the results and contextual information into Google Sheets for easy subsequent evaluation and comparison. It supports memory isolation management to ensure accurate context transmission while providing user-friendly templates to facilitate the participation of non-technical personnel in model performance evaluation, thereby enhancing the team's decision-making efficiency and testing accuracy.

Workflow Name
Easily Compare LLMs Using OpenAI and Google Sheets
Key Features and Highlights
- Real-time reception of user chat inputs while independently invoking two different large language models (LLMs) to respond to the same input.
- Automatic synchronization and recording of both models’ responses along with contextual information into Google Sheets for convenient subsequent comparison and evaluation.
- Side-by-side display of the two models’ answers within the chat interface, supporting intuitive comparison.
- Support for session ID-based isolation of model memory to ensure accurate context transfer.
- Flexible compatibility with multiple model providers including OpenRouter, OpenAI, Google Vertex AI, facilitating easy expansion and switching.
- Provides teams with simple, user-friendly Google Sheets templates, enabling non-technical personnel to participate in model performance evaluation.
Core Problem Addressed
In AI agent development, due to the non-deterministic nature of large language models, selecting the most suitable model often requires repeated testing and comparison. This workflow automates the comparison process, eliminating the tedious manual invocation and answer collation, thereby improving efficiency and accuracy.
Application Scenarios
- AI product development teams evaluating the performance of different LLMs.
- Selecting the best language model among multiple options for production deployment.
- Enabling non-technical members within organizations to participate in assessing model response quality.
- Educational and research institutions conducting comparative experiments on language models.
Main Process Steps
- User sends a message via the chat interface to trigger the workflow.
- Define and split the list of models to be compared (defaulting to OpenAI GPT-4.1 and Mistral large model).
- Assign independent session IDs for each model to achieve memory isolation.
- Simultaneously invoke both models to generate responses; AI agent nodes handle model calls and context management.
- Aggregate and organize the two models’ answers, formatting them into easily readable and comparable text.
- Write user input, model responses, context, and evaluation fields into Google Sheets.
- Display both models’ answers in the chat interface to support immediate side-by-side comparison.
Involved Systems or Services
- OpenRouter API (supporting calls to OpenAI, Mistral, and other models)
- Google Sheets (used as the results recording and evaluation platform)
- Core n8n automation platform nodes (such as Set, Split, Loop, Aggregate, etc.)
- LangChain-related nodes (chat trigger, memory management, AI agent)
Target Users and Value
- AI developers and data scientists: Quickly benchmark model performance and optimize model selection.
- Product managers and business personnel: Participate intuitively in model evaluation via Google Sheets.
- Educators and researchers: Conveniently set up multi-model comparison experimental environments.
- Teams: Manage and compare model responses on a unified platform, enhancing decision-making efficiency.
This workflow significantly simplifies the multi-model comparison process. Through automation and structured data recording, it helps teams scientifically and systematically select the best language model, reducing trial-and-error costs in AI projects.