Demonstration Workflow for Prompt-Based Object Detection and Image Annotation Using Google Gemini 2.0

This workflow utilizes the Google Gemini 2.0 multimodal AI model to achieve image object detection and annotation based on text prompts. By automatically identifying specific objects (such as rabbits) and drawing precise bounding boxes, it enhances the efficiency of image analysis and annotation. It addresses the issue of limited flexibility in traditional models, supports dynamic localization of different semantic targets, and ensures that the detection results match the original image size. This makes it suitable for scenarios such as intelligent image analysis, anomaly behavior detection, and automated labeling in e-commerce.

Object DetectionImage Annotation

Workflow Name

Key Features and Highlights

This workflow demonstrates how to leverage the Google Gemini 2.0 multimodal AI model to perform text-prompt-driven image object detection. It automatically identifies the locations of specific objects in images (e.g., rabbits) and draws precise bounding boxes on the original image. The detection coordinates are normalized and scaled to ensure that annotations perfectly match the original image dimensions. The entire process is fully automated without manual intervention, significantly enhancing the efficiency of image analysis and annotation.

Core Problems Addressed

Traditional image object detection often relies on fixed models, making it difficult to flexibly specify detection targets. This workflow solves the challenge of dynamically locating objects based on different semantic targets through prompt-based requests, enabling context-driven intelligent image recognition and localization. Additionally, by employing coordinate scaling and image editing nodes, it addresses the issue of mismatched detection results and original image sizes, making the outputs intuitive and easy to use.

Application Scenarios

Intelligent image content analysis and annotation
Visual search and classification, e.g., “label all adults with children”
Anomaly detection in surveillance scenarios
Automated product image annotation for e-commerce
Media content management and retrieval
AI-assisted image editing and enhancement

Main Workflow Steps

Download Test Image: Retrieve the target image via an HTTP request node.
Obtain Image Size Information: Extract the image’s width and height using the image editing node.
Invoke Google Gemini 2.0 Object Detection API: Send a text-prompted request such as “detect all rabbits in the image,” receiving bounding box coordinates in normalized form.
Extract and Process Returned Coordinates: Use a code node to scale the normalized coordinates to the original image dimensions.
Draw Bounding Boxes: Utilize the image editing node to draw detected object bounding boxes on the original image for visual annotation.

Systems and Services Involved

Google Gemini 2.0 API: Provides multimodal, text-prompt-driven object detection capabilities.
n8n HTTP Request Node: Downloads images and calls the API.
n8n Image Editing Node: Retrieves image metadata and draws bounding boxes.
n8n Code Node: Performs coordinate scaling calculations.
n8n Manual Trigger Node: Initiates the entire workflow execution.

Target Users and Value

AI developers and image processing engineers looking to quickly build and validate multimodal object detection capabilities.
Content moderators and managers who require automated image annotation and filtering.
Product managers and business personnel exploring AI-driven intelligent image solutions.
Any users needing to automatically identify and annotate specific objects in images based on textual descriptions, significantly reducing manual labeling time while improving efficiency and accuracy.

This workflow offers a practical and intuitive demonstration of cutting-edge multimodal AI technology applied to image understanding, empowering users to effortlessly build intelligent visual automation processes.

Recommend Templates

⚡📽️ Ultimate AI-Powered Chatbot for YouTube Summarization & Analysis

This workflow utilizes AI technology to automatically transcribe, extract information, and analyze content from YouTube videos. Users can interact with the system through a chat interface, quickly ask questions, and receive video summaries and key analyses, saving viewing time. It integrates the YouTube Data API and open-source tools, combined with a powerful language model, to provide accurate content output. It is suitable for scenarios such as education, content creation, and market analysis, enhancing the convenience and efficiency of information retrieval.

Video TranscriptionContent Analysis

Ultimate Personal Assistant

This workflow is designed to provide comprehensive personal assistant services, automatically handling user requests related to emails, calendars, contacts, content creation, and information search. Through an intelligent agent, users can interact with the system via text or voice, enabling multimodal operations. It integrates advanced natural language processing technology to ensure efficient recognition and routing of requests, streamlining daily task management and enhancing work efficiency and response speed. It is suitable for professionals and content creators, facilitating an intelligent work experience.

Smart AssistantMultimodal Interaction

AI-Driven Automated Company Information Research and Data Enrichment Workflow

This workflow utilizes advanced AI models and various data scraping tools to automate the research and structured output of company information. Users can quickly obtain multidimensional information, including LinkedIn links, market positioning, and pricing plans, starting from a company name or domain. It supports both scheduled and manual triggers, significantly enhancing research efficiency, reducing labor costs, and ensuring data accuracy and ease of management. It is suitable for various scenarios such as market research, sales, and product analysis, aiding in business decision-making and market insights.

Company ResearchAutomated Collection

AI-Powered WhatsApp Chatbot for Text, Voice, Images & PDFs

This workflow utilizes the WhatsApp platform and OpenAI's AI technology to create an intelligent chatbot that supports automatic recognition and responses for text, voice, images, and PDF documents. By analyzing different types of messages, the chatbot can quickly understand user needs, provide accurate feedback, enhance customer service response speed, and improve information retrieval efficiency. It accommodates diverse communication scenarios, significantly enhancing the user experience.

Multimodal AIWhatsApp Bot

Text Automations Using Apple Shortcuts

This workflow utilizes Apple Shortcuts and OpenAI models to achieve intelligent automation processing of selected text. Users can quickly perform various operations such as translation, grammar correction, text shortening, or expansion, significantly enhancing the efficiency and quality of text editing. With seamless integration through Webhooks, the operations are convenient and efficient, making it suitable for content creators, editors, and users who need cross-language communication, meeting the demands of mobile office work and real-time text processing.

Text AutomationApple Shortcuts

🧠 Give Your AI Agent Chatbot Long Term Memory Tools Router

This workflow provides long-term memory management capabilities for the AI chatbot, allowing it to persistently store and retrieve historical conversations and key information. Through a dynamic tool router, it automatically calls different tools based on task instructions, achieving efficient task distribution. Additionally, by integrating the OpenAI GPT-4o-mini model, it enhances context understanding and intelligent response capabilities, while supporting multi-channel notifications through platforms such as Telegram and Gmail, significantly improving information delivery efficiency and providing a personalized user experience.

long-term memorytool router

Dynamically Generate HTML Page from User Request Using OpenAI Structured Output

This workflow can dynamically generate HTML pages that conform to structured output specifications based on user input. By calling OpenAI's API, it automatically converts user descriptions into a predefined JSON format, then generates standard HTML code and applies Tailwind CSS for styling enhancement. The overall process simplifies web design, making it suitable for scenarios such as rapid prototyping, personalized web page generation, and AI-assisted UI design, thereby improving the efficiency and controllability of web page generation.

Structured OutputDynamic Webpages

AI Agent To Chat With YouTube

This workflow integrates multiple APIs to intelligently analyze YouTube videos and comments, helping content creators and marketers gain insights into audience preferences. It automatically retrieves video information, analyzes comments in bulk, transcribes content, and evaluates thumbnail designs, while utilizing AI agents to handle user requests, achieving data management and conversation memory. This tool significantly reduces the cost of manual analysis and enhances the relevance and viewing effectiveness of video content, making it an effective tool for optimizing YouTube operations.

YouTube AnalyticsSmart Chat