Read Sitemap and Filter URLs
This workflow can automatically read the sitemap.xml file of a website and convert its XML data into JSON format, extracting all URL entries. Users can quickly filter the links that meet their criteria based on custom filtering conditions, such as links to documents ending with .pdf. This process significantly enhances the efficiency of sitemap data processing, allowing users to quickly access specific types of resources, making it suitable for various scenarios such as SEO optimization, content management, and data analysis.

Workflow Name
Read Sitemap and Filter URLs
Key Features and Highlights
This workflow automatically reads the sitemap.xml file of a specified website, converts the XML data into JSON format, extracts all URL entries, and filters the links based on user-defined criteria. In the default example, it filters and outputs document links ending with .pdf, enabling quick identification and retrieval of specific resource types.
Core Problem Addressed
Many websites’ sitemap.xml files contain a large number of links to various page types and resources. Manual extraction and filtering are inefficient and prone to errors. This workflow automates the parsing and precise filtering process, significantly improving the efficiency of sitemap data handling and allowing users to directly obtain the target links they need.
Application Scenarios
- SEO specialists needing to extract specific types of page links from a website for analysis.
- Content operators or resource managers quickly locating and downloading PDFs, images, and other resources from a website.
- Developers or data analysts automatically crawling and organizing website structure data.
- Automated testing processes requiring verification of link validity and resource distribution on websites.
Main Process Steps
- Manually trigger the workflow to start.
- Set and input the target website’s sitemap.xml URL.
- Retrieve the sitemap.xml content via an HTTP request.
- Convert the sitemap from XML format to a manipulable JSON format.
- Split all URL entries within the JSON and process them individually.
- Filter links according to user-defined rules (e.g., by default, links ending with .pdf).
- Output the filtered results for subsequent use or further processing.
Involved Systems or Services
- HTTP Request Node: Used to fetch the sitemap.xml file.
- XML Conversion Node: Converts XML data to JSON format.
- Filter Node: Filters URLs based on specified rules.
- Manual Trigger Node: Allows users to initiate the workflow actively.
- Sticky Note Node: Provides process explanations and configuration tips.
Target Users and Usage Value
- Website administrators and SEO experts, facilitating quick access to website structure and resource links.
- Content managers, simplifying the organization and download of resources in specified formats.
- Automation and data analysis engineers, enhancing data extraction and preprocessing efficiency.
- Any users who need to automate sitemap data processing to save time and reduce manual errors.
This workflow features a clear structure and flexible configuration. Users only need to modify the sitemap URL and filtering rules to quickly adapt it to different websites and diverse requirements, greatly simplifying the complexity of sitemap data extraction and filtering.