News Extraction

This workflow automatically scrapes the latest content from specified news websites, extracting the publication time, title, and body of the news articles. It then uses AI technology to generate summaries and key technical keywords for each news item, ultimately storing the organized data in a database. This process enables efficient monitoring and analysis of news sources without RSS feeds, making it suitable for various scenarios such as media monitoring, market research, and content management, significantly enhancing the efficiency and accuracy of information retrieval.

Workflow Diagram
News Extraction Workflow diagram

Workflow Name

News Extraction

Key Features and Highlights

This workflow automatically scrapes the latest news content from the specified news website (https://www.colt.net/resources/type/news/), extracting the publication date, title, and full text of each news article. It then leverages ChatGPT to generate a concise summary and three key technical keywords for each news item. The processed data is ultimately stored in a NocoDB database, enabling end-to-end automated news collection and intelligent analysis.

Core Problems Addressed

  • The target news website does not provide RSS feeds, making traditional subscription methods ineffective for obtaining the latest updates.
  • News pages only offer links and publication dates, lacking article summaries and keyword information.
  • Manual filtering and organizing of news content is time-consuming and prone to omissions.
  • There is a need for periodic automatic updates to ensure the timeliness of news data.

Application Scenarios

  • Monitoring technology media and aggregating news information.
  • Tracking industry trends for enterprises or R&D teams.
  • Enabling market researchers to quickly capture key points from competitor news.
  • Feeding data into automated content management systems.

Main Workflow Steps

  1. Scheduled Trigger: Automatically initiate the workflow via a weekly scheduled task.
  2. Web Scraping: Retrieve the HTML content of the news listing page.
  3. Data Extraction: Use CSS selectors to extract news links and publication dates separately.
  4. Data Splitting: Split the extracted links and dates into individual entries for iterative processing.
  5. Filter Recent News: Select news articles published within the last 7 days.
  6. Single News Scraping: Visit each news link sequentially to extract the title and full article content.
  7. Intelligent Analysis: Call the ChatGPT API to generate a news summary and extract three key technical keywords.
  8. Data Integration: Combine the title, date, link, summary, and keywords into a complete record.
  9. Storage and Output: Write the final structured data into the NocoDB database for easy querying and analysis.

Involved Systems and Services

  • n8n Automation Platform: Workflow design and scheduling.
  • HTTP Request Node: Web page content retrieval.
  • HTML Extract Node: Data extraction from pages using CSS selectors.
  • OpenAI ChatGPT API: Summary generation and keyword extraction.
  • NocoDB Database: Storage of news data with SQL query support.

Target Users and Value

  • Media monitoring personnel and content editors seeking automated acquisition and organization of news from sites without RSS feeds.
  • Corporate market analysts and technical R&D teams aiming to quickly grasp the latest industry developments and technical keywords.
  • Automation workflow developers interested in integrating web scraping with AI-based text processing.
  • Any users requiring regular batch collection, analysis, and structured storage of news content.

By combining web scraping technology with AI-driven text understanding, this workflow achieves intelligent news extraction and summarization for non-RSS news sites, significantly enhancing information acquisition efficiency and content value. It is well-suited for various industry news automation needs.

News Extraction