ai-driven-web-data-acquisition-service
A comprehensive backend solution for advanced acquisition of digital content from the internet, featuring natural language querying of retrieved documents, automated extraction of structured schema, and raw HTML capture, all supporting client-side script execution. It proficiently manages parallel operations, incorporates mechanisms for mimicking various user agents, and utilizes rotating intermediary nodes for reliable scraping of highly interactive web assets.
Author

webscraping-ai
Quick Info
Actions
Tags
AI-Powered Web Data Retrieval Engine (MCP Endpoint)
This implementation serves as a Model Context Protocol (MCP) intermediary, leveraging the capabilities of the WebScraping.AI platform to facilitate sophisticated web content harvesting.
Core Capabilities
- Interpretive analysis of retrieved web page material based on user queries.
- Automated discovery and extraction of predefined data structures.
- Fetching of raw HyperText Markup Language, including execution of embedded JavaScript.
- Plain-text document content extraction.
- Targeted data retrieval utilizing Cascading Style Sheets (CSS) selectors.
- Support for diverse proxy infrastructures (commercial, residential) with geographical targeting.
- Dynamic page rendering via headless Chromium/Chrome environments.
- Coordinated handling of numerous simultaneous data requests with throttling controls.
- Execution of custom JavaScript payloads within the target page context.
- Simulation of desktop, tablet, and mobile device viewpoints.
- Transparent reporting on API usage metrics.
Deployment Instructions
Quick Start via npx
bash env WEBSCRAPING_AI_API_KEY=your_secret_key npx -y webscraping-ai-mcp
Local Setup
bash
Clone source repository
git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git cd webscraping-ai-mcp-server
Install necessary packages
npm install
Execute the service
npm start
Configuration within IDE (e.g., Cursor v0.45.6+)
Integration is achieved by defining the server connection in Cursor's configuration files:
- Project Scope (Recommended): Place a
.cursor/mcp.jsonfile in your root directory:
{ "servers": { "webscraping-ai-remote": { "type": "command", "command": "npx -y webscraping-ai-mcp", "env": { "WEBSCRAPING_AI_API_KEY": "your_actual_key", "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "10" } } } }
- Global Scope: Configure via
~/.cursor/mcp.jsonfor system-wide access.
Windows Users Note: If execution fails, substitute the command with shell execution:
cmd /c "set WEBSCRAPING_AI_API_KEY=your_key && npx -y webscraping-ai-mcp".
This setup enables automatic invocation of the WebScraping.AI toolset whenever the AI agent determines remote data access is required.
Claude Desktop Integration
Modify claude_desktop_config.json as follows:
{ "mcpServers": { "scraper-proxy-service": { "command": "npx", "args": ["-y", "webscraping-ai-mcp"], "env": { "WEBSCRAPING_AI_API_KEY": "YOUR_SECURE_KEY", "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "8" } } } }
Parameterization Details
Environment Variables (Configuration)
Mandatory Setting
WEBSCRAPING_AI_API_KEY: The credential required for authenticating requests against the WebScraping.AI service. Obtainable from the service provider's portal.
Optional Overrides
WEBSCRAPING_AI_CONCURRENCY_LIMIT: Maximum parallel operations allowed (Default:5).WEBSCRAPING_AI_DEFAULT_PROXY_TYPE: Initial proxy category (Options:datacenter,residential; Default:residential).WEBSCRAPING_AI_DEFAULT_JS_RENDERING: Boolean toggle for enabling client-side script execution (Default:true).WEBSCRAPING_AI_DEFAULT_TIMEOUT: Maximum allowable time (in milliseconds) for page retrieval (Default:15000; Max:30000).WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT: Maximum time (in ms) permitted for browser script execution/rendering (Default:2000).
Tool Interface Definitions
1. Content Query Tool (webscraping_ai_question)
Engages the service to answer specific queries about page contents.
{ "name": "webscraping_ai_question", "arguments": { "url": "https://some-target.org", "question": "Summarize the key findings presented on this page.", "timeout": 25000, "js": true, "wait_for": "#article-body", "proxy": "datacenter", "country": "gb" } }
Sample Output Structure:
{ "content": [ { "type": "text", "text": "The primary subject matter revolves around regulatory compliance standards for digital finance." } ], "isError": false }
2. Data Schema Tool (webscraping_ai_fields)
Automates the extraction of structured key-value pairs defined by the user.
{ "name": "webscraping_ai_fields", "arguments": { "url": "https://storefront.net/item-101", "fields": { "product_identifier": "Locate the SKU number", "inventory_count": "Retrieve current stock level", "shipping_policy": "Find the return window details" }, "js": true } }
3. Full Markup Tool (webscraping_ai_html)
Retrieves the entire DOM structure after rendering dynamic elements.
{ "name": "webscraping_ai_html", "arguments": { "url": "https://interactive-site.io", "js": true, "wait_for": "body" } }
4. Text Content Tool (webscraping_ai_text)
Extracts only the rendered, user-visible textual information.
5. Single Element Selector Tool (webscraping_ai_selected)
Fetches the HTML content corresponding to a single, specific CSS path.
6. Multiple Element Selector Tool (webscraping_ai_selected_multiple)
Retrieves content arrays corresponding to multiple defined CSS paths.
7. Account Status Tool (webscraping_ai_account)
Queries the provisioning status of the associated API key.
Universal Parameter Set (Applicable to all scraping methods)
timeout: Operation deadline (ms).js: Boolean to activate browser rendering.js_timeout: Rendering deadline (ms).wait_for: Selector triggering successful completion.proxy: Proxy pool designation (datacenterorresidential).country: Desired proxy geographic origin (e.g.,us,de,jp).custom_proxy: Direct URL for a proprietary proxy endpoint.device: Emulated client type (desktop,mobile,tablet).error_on_404: Treat HTTP 404 as a fatal error (Boolean).error_on_redirect: Treat HTTP redirection as a fatal error (Boolean).js_script: Inline JavaScript code to inject and execute.
Operational Resilience and Feedback
The system incorporates mechanisms for:
- Self-correction for temporary service disruptions.
- Graduated delay strategies for handling server-side throttling.
- Provision of detailed diagnostic feedback upon failure.
- Built-in network stability protocols.
Illustrative Failure Response:
{ "content": [ { "type": "text", "text": "Acquisition Failed: Service responded with HTTP 429 (Rate Limit Exceeded)." } ], "isError": true }
Integration Framework (MCP)
This server adheres to the Model Context Protocol specification, ensuring interoperability with diverse LLM orchestrators capable of protocol negotiation.
(The subsequent section detailing Claude integration using Node.js SDK examples has been maintained for practical reference.)
Development Cycle
bash
Repository Access
git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git cd webscraping-ai-mcp-server
npm install
npm test
Environment setup template
cp .env.example .env
Launching with the MCP Debugger
npx @modelcontextprotocol/inspector node src/index.js
Contribution Guidelines
- Fork the primary repository.
- Establish a dedicated feature branch.
- Validate changes:
npm test. - Submit a Merge Request.
Licensing
Distributed under the terms of the MIT License (Refer to LICENSE file).
