API Schema Crawler & Extractor

The API architecture crawling and extraction workflow is an intelligent automation tool that efficiently searches, crawls, and extracts API documentation for specified services. By integrating search engines, web crawlers, and large language models, this workflow not only accurately identifies API operations but also structures the information for storage in Google Sheets. Additionally, it generates customized API architecture JSON files for centralized management and sharing, significantly enhancing development and integration efficiency, and helping users quickly obtain and organize API information.

Workflow Diagram
API Schema Crawler & Extractor Workflow diagram

Workflow Name

API Schema Crawler & Extractor

Key Features and Highlights

This workflow automates the intelligent search, content crawling, information extraction, and custom API schema generation for specified service APIs. Core highlights include:

  • Automatically retrieving web links related to target service APIs via Google Search
  • Using the Apify platform for web content crawling, filtering out irrelevant resources to ensure data accuracy
  • Leveraging Google Gemini large language model (LLM) for intelligent processing such as content classification, API operation extraction, and product identification
  • Structuring extracted API operations and storing them in Google Sheets for easy management and review
  • Generating customized API schema JSON files and uploading them to Google Drive for centralized document management
  • Multi-stage workflow design (Research, Extraction, Generation) supporting asynchronous batch processing and status tracking

Core Problems Addressed

  • Manual API documentation search is cumbersome and prone to missing critical information
  • API documentation formats vary widely and lack uniform structure, making it difficult to quickly extract effective API operation data
  • Need for unified management and standardized API schema document generation to improve development and integration efficiency

Application Scenarios

  • Software development teams requiring rapid understanding of third-party service APIs
  • Automated API documentation collection and maintenance systems
  • Product managers or technical analysts conducting API service research and comparative analysis
  • Automated testing or integration platforms needing dynamic API interface information retrieval
  • Data-driven API catalogs or knowledge base construction

Main Workflow Steps

  1. Research Phase:
    • Retrieve the list of services to research from Google Sheets
    • Use Google Search to find API-related documentation links
    • Crawl web content via Apify, filtering out irrelevant files
    • Store crawled content as vector embeddings in Qdrant for subsequent retrieval
  2. Extraction Phase:
    • Extract pending items from Google Sheets based on research results
    • Query the vector database to locate relevant products and documentation content
    • Use Google Gemini model to extract REST API operations (GET, POST, PATCH, DELETE, etc.)
    • Write the extracted API operation information back into Google Sheets
  3. Generation Phase:
    • Aggregate all extracted API operation data
    • Use code nodes to consolidate and generate customized JSON-format API schema documents
    • Upload the generated documents to Google Drive for sharing and archiving

Involved Systems and Services

  • Google Sheets: Serves as the database storing service lists, intermediate crawl and extraction data, and final results
  • Apify: Used for web content crawling and batch crawl management
  • Google Gemini Model (LLM): Performs text classification, information extraction, and semantic search
  • Qdrant Vector Database: Stores vector representations of web content for efficient semantic retrieval
  • Google Drive: Stores the generated API schema document files
  • n8n Automation Platform: Integrates the above services to realize workflow automation

Target Users and Value

  • API developers, architects, and technical analysts can quickly and automatically obtain and organize API information, boosting work efficiency
  • Product managers and business analysts gain better understanding of service functionalities and API capabilities to support decision-making and planning
  • Automation testing and integration teams achieve dynamic API documentation updates and management
  • Enterprises or teams needing bulk research and maintenance of multi-service API documentation

In summary, the API Schema Crawler & Extractor workflow is a highly automated and intelligent solution for API documentation collection and processing. By combining search engines, web crawlers, large language models, and vector databases, it enables precise identification and structured management of API operations, significantly simplifying the API research and generation process while greatly enhancing user productivity and data utilization value.