Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls

This workflow automatically processes CSV files exported from Screaming Frog to generate an `llms.txt` file that meets AI training standards. It supports multilingual environments and features intelligent URL filtering and optional AI text classification, ensuring that the extracted content is of high quality and highly relevant. Users simply need to upload the file to obtain structured data, facilitating AI model training and website content optimization, significantly enhancing work efficiency and the accuracy of data processing. The final file can be easily downloaded or directly saved to cloud storage.

Tags

Web CrawlerText Generation

Workflow Name

Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls

Key Features and Highlights

This workflow automatically generates AI training-ready llms.txt text files based on CSV exports from Screaming Frog website crawls. It supports automatic field adaptation in multilingual environments and includes flexible, extensible URL filtering criteria. An optional AI text classifier can be employed for intelligent content filtering to ensure high quality and relevance of the generated files. The resulting llms.txt files can be directly downloaded within the n8n interface or seamlessly integrated for automatic upload and storage on cloud drives such as Google Drive and OneDrive.

Core Problems Addressed

Traditional web crawl data is often disorganized and unsuitable for direct use in training large language models (LLMs). This workflow automates the cleaning and filtering of high-quality, indexable page information from websites, producing structured text files that are easy for machine learning models to understand. It significantly reduces manual filtering and formatting efforts while improving the accuracy and efficiency of training data preparation.

Use Cases

  • SEO specialists and content strategists needing to quickly generate website content index files to aid content optimization and discovery
  • AI developers training custom language models using website crawl data
  • Digital marketing teams organizing website structure and content descriptions for automated reporting and analysis
  • Multilingual website content management, supporting languages such as French, Italian, German, Spanish, and more

Main Workflow Steps

  1. Form Trigger: Upload the website name, a brief description, and the Screaming Frog-exported internal_html.csv file
  2. Data Extraction: Parse the CSV file to extract seven key fields including URL, title, description, and status code
  3. URL Filtering: Filter pages with status code 200, that are indexable by search engines, and have an HTML content type
  4. (Optional) Text Classification: Enable an AI text classifier to intelligently distinguish high-quality content from others based on URL, title, description, and word count
  5. Format Setting: Generate text lines for each record in the format - [Title](URL): Description; omit the colon and description if no description is available
  6. Content Aggregation: Combine all qualifying lines into a complete llms.txt file content
  7. File Generation and Download: Produce the final text file, which can be downloaded directly or automatically saved to cloud storage via a replaceable upload node

Involved Systems or Services

  • Screaming Frog SEO Spider (website crawler with CSV export)
  • n8n Automation Platform (workflow engine)
  • OpenAI GPT-4o-mini (optional AI text classification model)
  • Cloud Storage Services (e.g., Google Drive, OneDrive; users must configure and replace the upload node accordingly)

Target Users and Value Proposition

  • Website administrators and SEO experts: Quickly organize website content structure and improve SEO content filtering efficiency
  • AI engineers and data scientists: Build high-quality training corpora to enhance language model performance
  • Content operations and digital marketing professionals: Automate content directory generation to support content management and optimization decisions
  • Multilingual website operation teams: Automatically adapt fields across languages, simplifying workflows without language barriers

With this workflow, users only need to upload a simple Screaming Frog export file to effortlessly obtain a structured llms.txt file, greatly enhancing the convenience and accuracy of applying AI to website content.

Recommend Templates

Building RAG Chatbot for Movie Recommendations with Qdrant and OpenAI

This workflow builds an intelligent movie recommendation chatbot that utilizes Retrieval-Augmented Generation (RAG) technology, combining the Qdrant vector database and OpenAI language model to provide personalized movie recommendations for users. By importing rich IMDb data, it generates text vectors and conducts efficient similarity searches, allowing for a deep understanding of users' movie preferences, optimizing recommendation results, and enhancing user interaction experience. It is particularly suitable for online film platforms and movie review communities.

movie recommendationvector search

Competitor Research Intelligent Agent

This workflow utilizes an automated intelligent agent to help users efficiently conduct competitor research. Users only need to input the target company's official website link, and the system can automatically identify similar companies, collect and analyze their basic information, products and services, and customer reviews. Ultimately, all data will be consolidated into a detailed report, stored in Notion, significantly enhancing research efficiency and addressing the issues of scattered information and cumbersome organization found in traditional research methods, thereby supporting market analysis and strategic decision-making.

Competitor ResearchMulti-Agent Analysis

RAG & GenAI App With WordPress Content

This workflow automates the extraction of article and page content from WordPress websites to create an intelligent question-and-answer system based on retrieval-augmented generative artificial intelligence. It filters, transforms, and vectorizes the content, storing the data in a Supabase database to support efficient semantic retrieval and dynamic questioning. By integrating OpenAI's GPT-4 model, users can enjoy a more precise query experience while achieving persistent management of chat memory, enhancing the contextual continuity of interactions and increasing the intelligent utilization value of the website's content.

RAG ApplicationsSmart Q&A

Slack AI Chatbot with RAG for Company Staff

This workflow builds an intelligent chatbot integrated into the Slack platform, utilizing RAG technology to connect in real-time with the company's internal knowledge base. It helps employees quickly query company documents, policies, and processes. The chatbot supports natural language interaction, accurately extracting relevant information and responding in a friendly format to ensure the information is accurate and reliable. This system not only enhances the efficiency of information retrieval but also automates responses to IT support and human resources-related inquiries, significantly improving employees' work experience and communication efficiency.

Slack BotRAG QA

Intelligent YouTube Video Summarization and Q&A Generation

This workflow can automatically extract transcribed text from specified YouTube videos, generate concise summaries, and intelligently provide question-and-answer examples related to the video content. By integrating advanced text processing and natural language generation technologies, it significantly enhances the efficiency of information retrieval, making it suitable for professionals such as content creators, educators, and market analysts, helping them quickly grasp the main points of the videos and manage knowledge for content reuse.

Video SummarySmart Q&A

EU Sustainable Legislation Agenda Automated Screening and Task Creation Workflow

This workflow automatically retrieves legislative procedure data from the European Parliament's official website for the past 18 days, using advanced AI technology to intelligently filter topics related to environmental sustainability. The filtered results will be stored in Google Sheets, and Google task reminders will be generated for each relevant topic to help users efficiently track and manage legislative developments. This process significantly enhances information processing efficiency, ensuring that users can stay updated on key sustainable development policies in a timely manner.

Sustainable LegislationSmart Screening

Perplexity Researcher

This workflow automatically generates prompts that meet AI model requirements by receiving user queries, and it calls relevant APIs for in-depth content retrieval, extracting and outputting concise, structured answers. It can provide authoritative materials with citations, ensuring the professionalism and credibility of the results. This helps users quickly access the latest research materials in a specific field, enhancing information retrieval efficiency and content quality. It is applicable in various scenarios such as academic research, content creation, and industry analysis.

Intelligent SearchContent Extraction

Notion Knowledge Base Assistant

This workflow combines advanced AI language models with the Notion knowledge base to provide intelligent Q&A services. Users can input questions, and the system will automatically retrieve relevant content and generate accurate answers, along with links to Notion pages, ensuring the reliability and traceability of the information. This assistant enhances the efficiency of knowledge queries and is suitable for various scenarios such as internal knowledge management in enterprises, customer support, and personal information retrieval, helping users quickly access the information they need.

Knowledge BaseSmart Search