Convert URL HTML to Markdown and Get Page Links

This workflow automatically converts webpage content from HTML format to structured Markdown and extracts all links from the webpage. Users can batch process multiple URLs, and the system will automatically manage API request rate limits to ensure efficient and stable data scraping. The workflow is flexible, supporting the reading of URLs from a user database and outputting the processing results to a specified data storage system, making it suitable for scenarios such as content analysis, market research, and website link management.

Workflow Diagram
Convert URL HTML to Markdown and Get Page Links Workflow diagram

Workflow Name

Convert URL HTML to Markdown and Get Page Links

Key Features and Highlights

This workflow leverages the Firecrawl.dev API to automatically convert webpage content from HTML into a structured Markdown format while extracting all links present on the page. It supports batch processing of URLs and automatically manages API request rate limits to ensure efficient and stable web data crawling. Additionally, the workflow is designed with flexibility in mind, allowing URLs to be read from the user’s own database and outputting the processed results to user-specified data storage systems.

Core Problems Addressed

  • Automatically convert webpage content into Markdown format that is easier for AI and text analysis processing, eliminating complex HTML tags.
  • Batch crawl and organize multiple webpages and their links, saving time and labor costs associated with manual collection and formatting.
  • Handle API call rate limits to prevent request overload and failures.
  • Support customizable data sources and output endpoints for seamless integration into existing business workflows.

Application Scenarios

  • Scenarios requiring conversion of webpage content into AI-friendly formats for large language model (LLM) analysis.
  • Content collection and organization for market research, competitor analysis, and content aggregation platforms.
  • Automated crawling of webpage links to build sitemaps or link repositories.
  • Enterprises or developers aiming to efficiently process web data in bulk via API.

Main Workflow Steps

  1. Manual Workflow Trigger: Initiate the process via a manual trigger node.
  2. Retrieve URL List: Read the list of webpages to crawl from the user’s own data source or example configuration, requiring the URL column to be named “Page.”
  3. Batch URL Splitting: Split URLs into batches (default 40 URLs per batch, with 10 requests per batch) to comply with server memory and API constraints.
  4. Wait for Rate Control: Pause between batches for a set duration to avoid triggering API rate limits.
  5. Call Firecrawl API: For each URL, invoke Firecrawl.dev’s crawling endpoint to obtain the page’s Markdown content and all links.
  6. Data Formatting: Organize the returned title, description, content, and links into structured data.
  7. Output Results: Export the processed data to the user’s specified database or storage system (e.g., Airtable).

Involved Systems or Services

  • Firecrawl.dev API: Webpage content crawling and conversion service.
  • User Data Source: User’s URL storage database (with a column named “Page”).
  • Data Output Services: Third-party databases such as Airtable.
  • n8n Automation Platform: For workflow orchestration and node management.

Target Users and Value

  • Content operators and data analysts who need to batch crawl and organize webpage content for subsequent analysis.
  • AI researchers and developers requiring high-quality Markdown-formatted training data.
  • Market research teams seeking to quickly aggregate competitor webpage information through automation.
  • Developers and automation engineers looking to rapidly integrate web data crawling capabilities to improve work efficiency.

This workflow is created by Simon @ automake.io, designed with a focus on ease of use and efficiency to help users effortlessly achieve structured collection and management of webpage content.