mcp-internet-scraper-toolkit
A Model Context Protocol (MCP) service enabling sophisticated web data retrieval via Google Search integration, capable of rendering and processing retrieved HTML content, featuring robust anti-bot mechanisms, resource pooling, and content caching.
Author

Claw256
Quick Info
Actions
Tags
Web Content Acquisition Utility (MCP Server)
This specialized MCP server furnishes advanced capabilities for interfacing with Google Search and viewing resultant webpage content, incorporating sophisticated evasion techniques against automated detection systems.
Core Capabilities
- Integration with Google Custom Search, supporting granular query refinement.
- Facility to display retrieved web documents, including dynamic Markdown transformation.
- Built-in request throttling and content result caching layers.
- Efficient management of browser execution environments via instance pooling.
- Advanced countermeasures against bot fingerprinting, leveraging
rebrowser-puppeteer.
System Requirements
- Execution environment: Bun runtime, version 1.0 or newer.
- Authentication credentials: Valid Google API Key and a configured Search Engine ID (CX).
Deployment Steps
bash
Dependency acquisition
bun install
Compilation artifacts generation
bun run build
Configuration Directives
Authentication Token Management (Cookies)
To facilitate access to sites requiring authenticated sessions, the following procedure must be observed:
- Install the designated Chrome extension: Get cookies.txt LOCALLY
- Navigate to and authenticate successfully with all necessary external domains.
- Export your session cookies using the extension utility, saving them in JSON format.
- Safeguard the exported credentials file.
- Declare the absolute filesystem path to this file via the
BROWSER_COOKIES_PATHenvironment variable.
MCP Service Initialization Settings
Inject the following configuration block into your primary MCP manifest file (cline_mcp_settings.json or claude_desktop_config.json):
- For Cline Users:
%APPDATA%\Code\User\globalStorage\rooveterinaryinc.roo-cline\settings\cline_mcp_settings.json - For Claude Desktop Users:
- *NIX Systems:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
- *NIX Systems:
{ "mcpServers": { "web-search": { "command": "bun", "args": [ "run", "/ABSOLUTE/PATH/TO/web_search_mcp/dist/index.js" ], "env": { "GOOGLE_API_KEY": "your_api_key", "GOOGLE_SEARCH_ENGINE_ID": "your_search_engine_id", "MAX_CONCURRENT_BROWSERS": "3", "BROWSER_TIMEOUT": "30000", "RATE_LIMIT_WINDOW": "60000", "RATE_LIMIT_MAX_REQUESTS": "60", "SEARCH_CACHE_TTL": "3600", "VIEW_URL_CACHE_TTL": "7200", "MAX_CACHE_ITEMS": "1000", "BROWSER_POOL_MIN": "1", "BROWSER_POOL_MAX": "5", "BROWSER_POOL_IDLE_TIMEOUT": "30000", "REBROWSER_PATCHES_RUNTIME_FIX_MODE": "addBinding", "REBROWSER_PATCHES_SOURCE_URL": "jquery.min.js", "REBROWSER_PATCHES_UTILITY_WORLD_NAME": "util", "REBROWSER_PATCHES_DEBUG": "0", "BROWSER_COOKIES_PATH": "C:\path\to\cookies.json", "LOG_LEVEL": "info", "NO_COLOR": "0", "BUN_FORCE_COLOR": "1", "FORCE_COLOR": "1" } } } }
Substitute /ABSOLUTE/PATH/TO/web_search_mcp with the canonical installation directory path for the server.
Operational Telemetry Control
The verbosity and presentation of operational output are governed by these environment variables:
LOG_LEVEL: Controls message severity (options: error, warn, info, debug). Default isinfo.NO_COLOR: Suppresses terminal colorization when set to"1".BUN_FORCE_COLOR: Manages color output specifically within the Bun runtime (set to"0"to disable).FORCE_COLOR: Global override for color rendering (set to"0"to disable).
Bot Evasion Protocol Details
This utility employs the rebrowser-puppeteer library for enhanced stealth:
-
Runtime Integrity Patching (Leak Prevention):
- Utilizes the
addBindingmechanism to circumvent detection based onRuntime.Enablecalls. - Ensures functional integrity across Worker threads and cross-origin iframes.
- Retains necessary access context within the main JavaScript environment.
- Utilizes the
-
Source Attribution Obfuscation:
- Modifies Puppeteer’s internal
sourceURLmetadata to mimic standard, non-automated script loading patterns.
- Modifies Puppeteer’s internal
-
Execution Environment Normalization:
- Assigns a generic identifier to the utility execution context.
-
Browser Launch Hardening:
- Deactivates flags commonly associated with automated browsing instances.
- Applies optimized Chromium launch arguments.
- Configures consistent viewport dimensions and window properties.
Integration with Claude Desktop
- Verify that Claude Desktop is installed and running the most recent release.
- Locate and open the configuration file path specified above.
- Embed the server configuration details shown in the Configuration section.
- Initiate a restart of the Claude Desktop application.
- Confirmation of successful tool registration is indicated by the presence of the specialized tool icon (hammer icon:
).
Exposed Functionality
1. Web Search Interface
typescript { name: "search", params: { query: string; // The search phrase trustedDomains?: string[]; // Optional list of preferred sources excludedDomains?: string[]; // Optional list of domains to ignore resultCount?: number; // Maximum number of results to fetch safeSearch?: boolean; // Toggle for restricted content filtering dateRestrict?: string; // Time constraint (e.g., 'd' for day, 'w' for week) } }
2. Remote Document Retrieval
typescript { name: "view_url", params: { url: string; // The precise Uniform Resource Locator to access includeImages?: boolean; // Option to fetch associated imagery includeVideos?: boolean; // Option to fetch embedded video assets preserveLinks?: boolean; // Maintain hyperlink structure upon rendering formatCode?: boolean; // Special rendering for source code blocks } }
Diagnostics and Remediation
Issues with Claude Desktop Connectivity
-
Review output streams: bash # Unix-like systems tail -n 20 -f ~/Library/Logs/Claude/mcp*.log
Windows systems
type %APPDATA%\Claude\Logs\mcp*.log
-
Common Failure Modes:
- Service not appearing: Verify JSON syntax accuracy and absolute path specifications.
- Tool invocations failing: Consult server logs and cycle the Claude Desktop application.
- Path resolution problems: Guarantee the use of fully qualified paths.
For in-depth troubleshooting assistance, consult the official MCP debugging documentation.
Development Workflow
bash
Initiate development session with file watching
bun --watch run dev
Execute automated quality assurance routines
bun run test
Run code style validation
bun run lint
Critical Observations
-
Stealth Limitations:
- The integrated evasion features thwart prevalent detection methodologies.
- For maximum success against advanced targets, supplementary measures (e.g., proxy rotation, user-agent cycling) may be necessary.
- Certain sites employ proprietary detection logic that may still identify automated activity.
-
Resource Management:
- Browser environments are maintained in a pool for rapid reuse.
- Unutilized browser sessions are automatically terminated to conserve overhead.
- Built-in constraints manage resource consumption, preventing system overload.
Licensing
Distributed under the MIT License.
Contextual Definition: Headless Browser
WIKIPEDIA: A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers, but they are executed via a command-line interface or using network communication. They are particularly useful for testing web pages as they are able to render and understand HTML the same way a browser would, including styling elements such as page layout, color, font selection and execution of JavaScript and Ajax which are usually not available when using other testing methods. Since version 59 of Google Chrome and version 56 of Firefox, there is native support for remote control of the browser. This made earlier efforts obsolete, notably PhantomJS.
== Primary Applications == The chief use cases for headless browsers include:
- Executing automated testing sequences for contemporary web platforms.
- Generating static image captures of rendered web documents.
- Orchestrating automated validation runs for JavaScript libraries.
- Automating complex user interactions with dynamic web interfaces.
=== Secondary Applications === Headless browsers also serve valuable functions in web data harvesting. Google indicated in 2009 that employing a headless browser could aid in indexing content from sites heavily reliant on Ajax. Conversely, headless browser technology has been leveraged nefariously for activities such as:
- Initiating distributed denial-of-service attacks against web servers.
- Artificially inflating digital advertisement view counts.
- Executing web interactions contrary to site terms of service (e.g., automated credential testing). However, a comprehensive traffic analysis conducted in 2018 revealed no discernible bias by malicious entities favoring headless browser deployment over traditional browsers for harmful operations like DDoS, SQL injection, or XSS exploits.
== Automation Frameworks == As several leading browser engines natively support headless execution modes via dedicated APIs, established software solutions have emerged to offer standardized automation interfaces. These include:
- Selenium WebDriver – A framework adhering to the W3C WebDriver specification.
- Playwright – A Node.js utility designed for cross-engine automation (Chromium, Firefox, WebKit).
- Puppeteer – A specialized Node.js library targeting Chrome and Firefox control.
=== Test Harness Integration === Certain software solutions for quality assurance integrate headless browsing capabilities directly into their operational apparatus.
- Capybara relies on headless browsing (via WebKit or Headless Chrome) to simulate user actions within its testing protocols.
- Jasmine defaults to Selenium but permits configuration for WebKit or Headless Chrome execution.
- Cypress, a prominent frontend testing environment.
- QF-Test, a tool for automated GUI testing that supports headless browser utilization.
=== Alternative Rendering Approaches ===
An alternative paradigm involves utilizing environments that expose browser-like APIs without full rendering capabilities. For example, Deno incorporates inherent browser APIs within its architecture. For Node.js environments, jsdom stands as the most feature-complete simulation. While these often support core browser functionalities (HTML parsing, cookies, XHR, limited JavaScript execution), they generally lack full DOM rendering fidelity and possess restricted DOM event handling compared to true headless execution, typically resulting in faster execution times than full-stack headless browsers.
