logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

ai-driven-web-data-acquisition-service

A comprehensive backend solution for advanced acquisition of digital content from the internet, featuring natural language querying of retrieved documents, automated extraction of structured schema, and raw HTML capture, all supporting client-side script execution. It proficiently manages parallel operations, incorporates mechanisms for mimicking various user agents, and utilizes rotating intermediary nodes for reliable scraping of highly interactive web assets.

Author

ai-driven-web-data-acquisition-service logo

webscraping-ai

No License

Quick Info

GitHub GitHub Stars 31
NPM Weekly Downloads 0
Tools 1
Last Updated 2026-02-19

Tags

webscrapingapisscrapingai webscrapingwebscraping aiweb data

AI-Powered Web Data Retrieval Engine (MCP Endpoint)

This implementation serves as a Model Context Protocol (MCP) intermediary, leveraging the capabilities of the WebScraping.AI platform to facilitate sophisticated web content harvesting.

Core Capabilities

  • Interpretive analysis of retrieved web page material based on user queries.
  • Automated discovery and extraction of predefined data structures.
  • Fetching of raw HyperText Markup Language, including execution of embedded JavaScript.
  • Plain-text document content extraction.
  • Targeted data retrieval utilizing Cascading Style Sheets (CSS) selectors.
  • Support for diverse proxy infrastructures (commercial, residential) with geographical targeting.
  • Dynamic page rendering via headless Chromium/Chrome environments.
  • Coordinated handling of numerous simultaneous data requests with throttling controls.
  • Execution of custom JavaScript payloads within the target page context.
  • Simulation of desktop, tablet, and mobile device viewpoints.
  • Transparent reporting on API usage metrics.

Deployment Instructions

Quick Start via npx

bash env WEBSCRAPING_AI_API_KEY=your_secret_key npx -y webscraping-ai-mcp

Local Setup

bash

Clone source repository

git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git cd webscraping-ai-mcp-server

Install necessary packages

npm install

Execute the service

npm start

Configuration within IDE (e.g., Cursor v0.45.6+)

Integration is achieved by defining the server connection in Cursor's configuration files:

  1. Project Scope (Recommended): Place a .cursor/mcp.json file in your root directory:

{ "servers": { "webscraping-ai-remote": { "type": "command", "command": "npx -y webscraping-ai-mcp", "env": { "WEBSCRAPING_AI_API_KEY": "your_actual_key", "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "10" } } } }

  1. Global Scope: Configure via ~/.cursor/mcp.json for system-wide access.

Windows Users Note: If execution fails, substitute the command with shell execution: cmd /c "set WEBSCRAPING_AI_API_KEY=your_key && npx -y webscraping-ai-mcp".

This setup enables automatic invocation of the WebScraping.AI toolset whenever the AI agent determines remote data access is required.

Claude Desktop Integration

Modify claude_desktop_config.json as follows:

{ "mcpServers": { "scraper-proxy-service": { "command": "npx", "args": ["-y", "webscraping-ai-mcp"], "env": { "WEBSCRAPING_AI_API_KEY": "YOUR_SECURE_KEY", "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "8" } } } }

Parameterization Details

Environment Variables (Configuration)

Mandatory Setting

  • WEBSCRAPING_AI_API_KEY: The credential required for authenticating requests against the WebScraping.AI service. Obtainable from the service provider's portal.

Optional Overrides

  • WEBSCRAPING_AI_CONCURRENCY_LIMIT: Maximum parallel operations allowed (Default: 5).
  • WEBSCRAPING_AI_DEFAULT_PROXY_TYPE: Initial proxy category (Options: datacenter, residential; Default: residential).
  • WEBSCRAPING_AI_DEFAULT_JS_RENDERING: Boolean toggle for enabling client-side script execution (Default: true).
  • WEBSCRAPING_AI_DEFAULT_TIMEOUT: Maximum allowable time (in milliseconds) for page retrieval (Default: 15000; Max: 30000).
  • WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT: Maximum time (in ms) permitted for browser script execution/rendering (Default: 2000).

Tool Interface Definitions

1. Content Query Tool (webscraping_ai_question)

Engages the service to answer specific queries about page contents.

{ "name": "webscraping_ai_question", "arguments": { "url": "https://some-target.org", "question": "Summarize the key findings presented on this page.", "timeout": 25000, "js": true, "wait_for": "#article-body", "proxy": "datacenter", "country": "gb" } }

Sample Output Structure:

{ "content": [ { "type": "text", "text": "The primary subject matter revolves around regulatory compliance standards for digital finance." } ], "isError": false }

2. Data Schema Tool (webscraping_ai_fields)

Automates the extraction of structured key-value pairs defined by the user.

{ "name": "webscraping_ai_fields", "arguments": { "url": "https://storefront.net/item-101", "fields": { "product_identifier": "Locate the SKU number", "inventory_count": "Retrieve current stock level", "shipping_policy": "Find the return window details" }, "js": true } }

3. Full Markup Tool (webscraping_ai_html)

Retrieves the entire DOM structure after rendering dynamic elements.

{ "name": "webscraping_ai_html", "arguments": { "url": "https://interactive-site.io", "js": true, "wait_for": "body" } }

4. Text Content Tool (webscraping_ai_text)

Extracts only the rendered, user-visible textual information.

5. Single Element Selector Tool (webscraping_ai_selected)

Fetches the HTML content corresponding to a single, specific CSS path.

6. Multiple Element Selector Tool (webscraping_ai_selected_multiple)

Retrieves content arrays corresponding to multiple defined CSS paths.

7. Account Status Tool (webscraping_ai_account)

Queries the provisioning status of the associated API key.

Universal Parameter Set (Applicable to all scraping methods)

  • timeout: Operation deadline (ms).
  • js: Boolean to activate browser rendering.
  • js_timeout: Rendering deadline (ms).
  • wait_for: Selector triggering successful completion.
  • proxy: Proxy pool designation (datacenter or residential).
  • country: Desired proxy geographic origin (e.g., us, de, jp).
  • custom_proxy: Direct URL for a proprietary proxy endpoint.
  • device: Emulated client type (desktop, mobile, tablet).
  • error_on_404: Treat HTTP 404 as a fatal error (Boolean).
  • error_on_redirect: Treat HTTP redirection as a fatal error (Boolean).
  • js_script: Inline JavaScript code to inject and execute.

Operational Resilience and Feedback

The system incorporates mechanisms for:

  • Self-correction for temporary service disruptions.
  • Graduated delay strategies for handling server-side throttling.
  • Provision of detailed diagnostic feedback upon failure.
  • Built-in network stability protocols.

Illustrative Failure Response:

{ "content": [ { "type": "text", "text": "Acquisition Failed: Service responded with HTTP 429 (Rate Limit Exceeded)." } ], "isError": true }

Integration Framework (MCP)

This server adheres to the Model Context Protocol specification, ensuring interoperability with diverse LLM orchestrators capable of protocol negotiation.

(The subsequent section detailing Claude integration using Node.js SDK examples has been maintained for practical reference.)

Development Cycle

bash

Repository Access

git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git cd webscraping-ai-mcp-server

npm install

npm test

Environment setup template

cp .env.example .env

Launching with the MCP Debugger

npx @modelcontextprotocol/inspector node src/index.js

Contribution Guidelines

  1. Fork the primary repository.
  2. Establish a dedicated feature branch.
  3. Validate changes: npm test.
  4. Submit a Merge Request.

Licensing

Distributed under the terms of the MIT License (Refer to LICENSE file).

See Also

`