Demonstration Workflow for Prompt-Based Object Detection and Image Annotation Using Google Gemini 2.0
This workflow utilizes the Google Gemini 2.0 multimodal AI model to achieve image object detection and annotation based on text prompts. By automatically identifying specific objects (such as rabbits) and drawing precise bounding boxes, it enhances the efficiency of image analysis and annotation. It addresses the issue of limited flexibility in traditional models, supports dynamic localization of different semantic targets, and ensures that the detection results match the original image size. This makes it suitable for scenarios such as intelligent image analysis, anomaly behavior detection, and automated labeling in e-commerce.

Workflow Name
Demonstration Workflow for Prompt-Based Object Detection and Image Annotation Using Google Gemini 2.0
Key Features and Highlights
This workflow demonstrates how to leverage the Google Gemini 2.0 multimodal AI model to perform text-prompt-driven image object detection. It automatically identifies the locations of specific objects in images (e.g., rabbits) and draws precise bounding boxes on the original image. The detection coordinates are normalized and scaled to ensure that annotations perfectly match the original image dimensions. The entire process is fully automated without manual intervention, significantly enhancing the efficiency of image analysis and annotation.
Core Problems Addressed
Traditional image object detection often relies on fixed models, making it difficult to flexibly specify detection targets. This workflow solves the challenge of dynamically locating objects based on different semantic targets through prompt-based requests, enabling context-driven intelligent image recognition and localization. Additionally, by employing coordinate scaling and image editing nodes, it addresses the issue of mismatched detection results and original image sizes, making the outputs intuitive and easy to use.
Application Scenarios
- Intelligent image content analysis and annotation
- Visual search and classification, e.g., “label all adults with children”
- Anomaly detection in surveillance scenarios
- Automated product image annotation for e-commerce
- Media content management and retrieval
- AI-assisted image editing and enhancement
Main Workflow Steps
- Download Test Image: Retrieve the target image via an HTTP request node.
- Obtain Image Size Information: Extract the image’s width and height using the image editing node.
- Invoke Google Gemini 2.0 Object Detection API: Send a text-prompted request such as “detect all rabbits in the image,” receiving bounding box coordinates in normalized form.
- Extract and Process Returned Coordinates: Use a code node to scale the normalized coordinates to the original image dimensions.
- Draw Bounding Boxes: Utilize the image editing node to draw detected object bounding boxes on the original image for visual annotation.
Systems and Services Involved
- Google Gemini 2.0 API: Provides multimodal, text-prompt-driven object detection capabilities.
- n8n HTTP Request Node: Downloads images and calls the API.
- n8n Image Editing Node: Retrieves image metadata and draws bounding boxes.
- n8n Code Node: Performs coordinate scaling calculations.
- n8n Manual Trigger Node: Initiates the entire workflow execution.
Target Users and Value
- AI developers and image processing engineers looking to quickly build and validate multimodal object detection capabilities.
- Content moderators and managers who require automated image annotation and filtering.
- Product managers and business personnel exploring AI-driven intelligent image solutions.
- Any users needing to automatically identify and annotate specific objects in images based on textual descriptions, significantly reducing manual labeling time while improving efficiency and accuracy.
This workflow offers a practical and intuitive demonstration of cutting-edge multimodal AI technology applied to image understanding, empowering users to effortlessly build intelligent visual automation processes.