logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

mcp-internet-scraper-toolkit

A Model Context Protocol (MCP) service enabling sophisticated web data retrieval via Google Search integration, capable of rendering and processing retrieved HTML content, featuring robust anti-bot mechanisms, resource pooling, and content caching.

Author

mcp-internet-scraper-toolkit logo

Claw256

MIT License

Quick Info

GitHub GitHub Stars 5
NPM Weekly Downloads 0
Tools 1
Last Updated 2026-02-19

Tags

scrapingautomationbrowserbrowser automationautomation websearching scraping

Web Content Acquisition Utility (MCP Server)

This specialized MCP server furnishes advanced capabilities for interfacing with Google Search and viewing resultant webpage content, incorporating sophisticated evasion techniques against automated detection systems.

Core Capabilities

  • Integration with Google Custom Search, supporting granular query refinement.
  • Facility to display retrieved web documents, including dynamic Markdown transformation.
  • Built-in request throttling and content result caching layers.
  • Efficient management of browser execution environments via instance pooling.
  • Advanced countermeasures against bot fingerprinting, leveraging rebrowser-puppeteer.

System Requirements

  • Execution environment: Bun runtime, version 1.0 or newer.
  • Authentication credentials: Valid Google API Key and a configured Search Engine ID (CX).

Deployment Steps

bash

Dependency acquisition

bun install

Compilation artifacts generation

bun run build

Configuration Directives

Authentication Token Management (Cookies)

To facilitate access to sites requiring authenticated sessions, the following procedure must be observed:

  1. Install the designated Chrome extension: Get cookies.txt LOCALLY
  2. Navigate to and authenticate successfully with all necessary external domains.
  3. Export your session cookies using the extension utility, saving them in JSON format.
  4. Safeguard the exported credentials file.
  5. Declare the absolute filesystem path to this file via the BROWSER_COOKIES_PATH environment variable.

MCP Service Initialization Settings

Inject the following configuration block into your primary MCP manifest file (cline_mcp_settings.json or claude_desktop_config.json):

  • For Cline Users: %APPDATA%\Code\User\globalStorage\rooveterinaryinc.roo-cline\settings\cline_mcp_settings.json
  • For Claude Desktop Users:
    • *NIX Systems: ~/Library/Application Support/Claude/claude_desktop_config.json
    • Windows: %APPDATA%\Claude\claude_desktop_config.json

{ "mcpServers": { "web-search": { "command": "bun", "args": [ "run", "/ABSOLUTE/PATH/TO/web_search_mcp/dist/index.js" ], "env": { "GOOGLE_API_KEY": "your_api_key", "GOOGLE_SEARCH_ENGINE_ID": "your_search_engine_id", "MAX_CONCURRENT_BROWSERS": "3", "BROWSER_TIMEOUT": "30000", "RATE_LIMIT_WINDOW": "60000", "RATE_LIMIT_MAX_REQUESTS": "60", "SEARCH_CACHE_TTL": "3600", "VIEW_URL_CACHE_TTL": "7200", "MAX_CACHE_ITEMS": "1000", "BROWSER_POOL_MIN": "1", "BROWSER_POOL_MAX": "5", "BROWSER_POOL_IDLE_TIMEOUT": "30000", "REBROWSER_PATCHES_RUNTIME_FIX_MODE": "addBinding", "REBROWSER_PATCHES_SOURCE_URL": "jquery.min.js", "REBROWSER_PATCHES_UTILITY_WORLD_NAME": "util", "REBROWSER_PATCHES_DEBUG": "0", "BROWSER_COOKIES_PATH": "C:\path\to\cookies.json", "LOG_LEVEL": "info", "NO_COLOR": "0", "BUN_FORCE_COLOR": "1", "FORCE_COLOR": "1" } } } }

Substitute /ABSOLUTE/PATH/TO/web_search_mcp with the canonical installation directory path for the server.

Operational Telemetry Control

The verbosity and presentation of operational output are governed by these environment variables:

  • LOG_LEVEL: Controls message severity (options: error, warn, info, debug). Default is info.
  • NO_COLOR: Suppresses terminal colorization when set to "1".
  • BUN_FORCE_COLOR: Manages color output specifically within the Bun runtime (set to "0" to disable).
  • FORCE_COLOR: Global override for color rendering (set to "0" to disable).

Bot Evasion Protocol Details

This utility employs the rebrowser-puppeteer library for enhanced stealth:

  1. Runtime Integrity Patching (Leak Prevention):

    • Utilizes the addBinding mechanism to circumvent detection based on Runtime.Enable calls.
    • Ensures functional integrity across Worker threads and cross-origin iframes.
    • Retains necessary access context within the main JavaScript environment.
  2. Source Attribution Obfuscation:

    • Modifies Puppeteer’s internal sourceURL metadata to mimic standard, non-automated script loading patterns.
  3. Execution Environment Normalization:

    • Assigns a generic identifier to the utility execution context.
  4. Browser Launch Hardening:

    • Deactivates flags commonly associated with automated browsing instances.
    • Applies optimized Chromium launch arguments.
    • Configures consistent viewport dimensions and window properties.

Integration with Claude Desktop

  1. Verify that Claude Desktop is installed and running the most recent release.
  2. Locate and open the configuration file path specified above.
  3. Embed the server configuration details shown in the Configuration section.
  4. Initiate a restart of the Claude Desktop application.
  5. Confirmation of successful tool registration is indicated by the presence of the specialized tool icon (hammer icon: ).

Exposed Functionality

1. Web Search Interface

typescript { name: "search", params: { query: string; // The search phrase trustedDomains?: string[]; // Optional list of preferred sources excludedDomains?: string[]; // Optional list of domains to ignore resultCount?: number; // Maximum number of results to fetch safeSearch?: boolean; // Toggle for restricted content filtering dateRestrict?: string; // Time constraint (e.g., 'd' for day, 'w' for week) } }

2. Remote Document Retrieval

typescript { name: "view_url", params: { url: string; // The precise Uniform Resource Locator to access includeImages?: boolean; // Option to fetch associated imagery includeVideos?: boolean; // Option to fetch embedded video assets preserveLinks?: boolean; // Maintain hyperlink structure upon rendering formatCode?: boolean; // Special rendering for source code blocks } }

Diagnostics and Remediation

Issues with Claude Desktop Connectivity

  1. Review output streams: bash # Unix-like systems tail -n 20 -f ~/Library/Logs/Claude/mcp*.log

    Windows systems

    type %APPDATA%\Claude\Logs\mcp*.log

  2. Common Failure Modes:

    • Service not appearing: Verify JSON syntax accuracy and absolute path specifications.
    • Tool invocations failing: Consult server logs and cycle the Claude Desktop application.
    • Path resolution problems: Guarantee the use of fully qualified paths.

For in-depth troubleshooting assistance, consult the official MCP debugging documentation.

Development Workflow

bash

Initiate development session with file watching

bun --watch run dev

Execute automated quality assurance routines

bun run test

Run code style validation

bun run lint

Critical Observations

  1. Stealth Limitations:

    • The integrated evasion features thwart prevalent detection methodologies.
    • For maximum success against advanced targets, supplementary measures (e.g., proxy rotation, user-agent cycling) may be necessary.
    • Certain sites employ proprietary detection logic that may still identify automated activity.
  2. Resource Management:

    • Browser environments are maintained in a pool for rapid reuse.
    • Unutilized browser sessions are automatically terminated to conserve overhead.
    • Built-in constraints manage resource consumption, preventing system overload.

Licensing

Distributed under the MIT License.

Contextual Definition: Headless Browser

WIKIPEDIA: A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers, but they are executed via a command-line interface or using network communication. They are particularly useful for testing web pages as they are able to render and understand HTML the same way a browser would, including styling elements such as page layout, color, font selection and execution of JavaScript and Ajax which are usually not available when using other testing methods. Since version 59 of Google Chrome and version 56 of Firefox, there is native support for remote control of the browser. This made earlier efforts obsolete, notably PhantomJS.

== Primary Applications == The chief use cases for headless browsers include:

  • Executing automated testing sequences for contemporary web platforms.
  • Generating static image captures of rendered web documents.
  • Orchestrating automated validation runs for JavaScript libraries.
  • Automating complex user interactions with dynamic web interfaces.

=== Secondary Applications === Headless browsers also serve valuable functions in web data harvesting. Google indicated in 2009 that employing a headless browser could aid in indexing content from sites heavily reliant on Ajax. Conversely, headless browser technology has been leveraged nefariously for activities such as:

  • Initiating distributed denial-of-service attacks against web servers.
  • Artificially inflating digital advertisement view counts.
  • Executing web interactions contrary to site terms of service (e.g., automated credential testing). However, a comprehensive traffic analysis conducted in 2018 revealed no discernible bias by malicious entities favoring headless browser deployment over traditional browsers for harmful operations like DDoS, SQL injection, or XSS exploits.

== Automation Frameworks == As several leading browser engines natively support headless execution modes via dedicated APIs, established software solutions have emerged to offer standardized automation interfaces. These include:

  • Selenium WebDriver – A framework adhering to the W3C WebDriver specification.
  • Playwright – A Node.js utility designed for cross-engine automation (Chromium, Firefox, WebKit).
  • Puppeteer – A specialized Node.js library targeting Chrome and Firefox control.

=== Test Harness Integration === Certain software solutions for quality assurance integrate headless browsing capabilities directly into their operational apparatus.

  • Capybara relies on headless browsing (via WebKit or Headless Chrome) to simulate user actions within its testing protocols.
  • Jasmine defaults to Selenium but permits configuration for WebKit or Headless Chrome execution.
  • Cypress, a prominent frontend testing environment.
  • QF-Test, a tool for automated GUI testing that supports headless browser utilization.

=== Alternative Rendering Approaches === An alternative paradigm involves utilizing environments that expose browser-like APIs without full rendering capabilities. For example, Deno incorporates inherent browser APIs within its architecture. For Node.js environments, jsdom stands as the most feature-complete simulation. While these often support core browser functionalities (HTML parsing, cookies, XHR, limited JavaScript execution), they generally lack full DOM rendering fidelity and possess restricted DOM event handling compared to true headless execution, typically resulting in faster execution times than full-stack headless browsers.

See Also

`