AI-Powered Automatic Image Caption and Text Watermark Generation
This workflow integrates advanced multimodal visual language models to automate the generation of titles and descriptions for images, overlaying them as watermarks on the pictures. Users simply need to import an image, and the system will automatically adjust the size, generate text, and ensure an aesthetically pleasing display, significantly reducing the time cost of manual writing. This feature is particularly suitable for fields such as media, e-commerce, and social media, assisting content creators and designers in enhancing their work efficiency and visual impact.

Workflow Name
AI-Powered Automatic Image Caption and Text Watermark Generation
Key Features and Highlights
This workflow integrates Google Gemini’s multimodal vision-language model to automatically generate precise and creative titles and descriptive texts for input images. The generated text is then overlaid as a watermark at the bottom of the image. The entire process is fully automated without manual intervention and supports image resizing and intelligent calculation of text positioning to ensure the watermark is clear and aesthetically pleasing on the image.
Core Problems Addressed
- Automates the generation of contextually relevant titles and descriptions for images, significantly reducing the time and effort required for manual text creation.
- Enables seamless fusion of images and text, facilitating content publishing, copyright marking, and social media sharing.
- Leverages advanced multimodal AI models to enhance the accuracy and creativity of image understanding and text generation.
Application Scenarios
- Media and publishing industries, automatically generating captions for images to improve content production efficiency.
- E-commerce platforms, automatically creating attractive titles and descriptions for product images to enhance user experience.
- Social media management, quickly producing visual content with watermarks and captions to strengthen brand communication.
- Photographers and designers, automatically adding copyright information or creative descriptions to their works.
Main Workflow Steps
- Import Images: Download images from free stock photo sites like Pexels via HTTP request nodes, or replace with other trigger methods for image import.
- Image Preprocessing: Resize images to 512x512 pixels to meet the input requirements of the AI model.
- Invoke Google Gemini Vision-Language Model: Send the preprocessed images to the Google Gemini model to generate titles and descriptive text.
- Structured Parsing of Generated Content: Use a structured output parser to format and process the AI-generated text.
- Calculate Text Positioning: Employ a custom code node to compute the position and size of the text box, ensuring the text is appropriately placed at the bottom of the image.
- Text Overlay and Composition: Use the Edit Image node to overlay the generated title and description onto the image with a semi-transparent background and white font.
- Output Result: Produce images with AI-generated text watermarks, ready for publication and reuse.
Systems and Services Involved
- Google Gemini Chat Model (Google PaLM API) — Multimodal vision-language AI model
- HTTP Request Node — Image resource acquisition
- Edit Image Node — Image editing and text overlay
- Code Node — Calculation of text position and size
- Langchain Node Suite — AI model invocation and output parsing
Target Users and Value
- Content creators, editors, and media professionals seeking rapid generation of image captions.
- E-commerce operators aiming to improve product image copy quality and visual appeal.
- Social media managers automating the creation of visually engaging images with text.
- Designers and photographers who want to effortlessly add copyright or descriptive information to their works.
- Automation enthusiasts and developers interested in practical applications of multimodal AI models in image-text processing.
This workflow fully leverages n8n’s low-code automation capabilities combined with cutting-edge AI technology, enabling users to efficiently complete image text generation and composition tasks, thereby greatly enhancing work efficiency and content quality.