web-content-extractor-mcp

An MCP service engineered to grant AI agents readership access to digital documents hosted on web domains exhibiting robot deterrence mechanisms, effectively bridging the visibility gap between standard browser rendering and automated access.

Designated Application

This utility is optimized strictly for the retrieval of reference documentation and instructive text (HTML/text only) in small batches from sites employing bot screening. It is explicitly not intended or validated for comprehensive web crawling or mass data aggregation.

Insight: This solution owes its genesis to collaborative development with Claude Sonnets versions 3.7 and 4.5, leveraging the LLM Context framework.

Deployment Procedure

Prerequisites

Operation requires Python version 3.10 or newer
The uv package manager is mandatory

Installation Steps

bash

Install the extractor utility

uv tool install scrapling-fetch-mcp

Install required browser binaries (MANDATORY - significant download volume)

uvx --from scrapling-fetch-mcp scrapling install

Crucial Notice: The browser component installation involves downloading several hundred megabytes; this process must finalize before the initial invocation. If the MCP endpoint times out upon first use, please allow several minutes for the background installation of the browser environment to complete before re-attempting connection.

Configuration in Claude Desktop

Integrate the following configuration snippet into your Claude Desktop MCP settings file:

macOS Path: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows Path: %APPDATA%\Claude\claude_desktop_config.json

{ "mcpServers": { "scrapling-fetch": { "command": "uvx", "args": ["scrapling-fetch-mcp"] } } }

Remember to refresh Claude Desktop after modifying the configuration file.

Core Functionality

This MCP server exposes two distinct functionalities that Claude can invoke autonomously when directed to access web content:

Document Fetching: Acquires the entirety of a webpage, including support for sequential page navigation.
Pattern Localization: Pinpoints and extracts specific data segments using defined regular expression criteria.

The selection between these capabilities is determined dynamically by the AI based on the user's directive:

"Retrieve the official specifications located at https://example.com/api" "Isolate every instance of 'security protocol' on the fetched document" "Present the setup guide found on their primary landing page"

Evasion Modes

The tools support three graduated tiers for circumventing bot countermeasures:

basic: Expedited performance (1-2 seconds), adequate for the majority of targets.
stealth: Medium latency (3-8 seconds), effective against more robust defenses.
max-stealth: Maximal latency (10+ seconds), reserved for highly obfuscated or protected endpoints.

Claude defaults to the basic mode, automatically escalating to higher levels if the initial attempt fails.

Recommendations for Optimal Outcomes

Employ natural language queries; the system manages all underlying technical execution.
For exceptionally lengthy documents, Claude can autonomously manage multi-page traversal.
When seeking precise information, specify the target content, which prompts the use of pattern matching.
The contextual metadata returned aids Claude in deciding between full-page load or targeted search.

Constraints

Restricted to the extraction of static textual data (manuals, articles, references).
Unsuitable for high-throughput scraping operations or mass data warehousing.
Functionality may be compromised on sites necessitating user credentials or session tokens.
Operational speed is contingent upon the target site's complexity and applied protection intensity.

Developed using Scrapling to enable web access bypassing bot detection protocols.

Licensing

Apache 2.0

WIKIPEDIA: XMLHttpRequest (XHR) is an API in the form of a JavaScript object whose methods transmit HTTP requests from a web browser to a web server. The methods allow a browser-based application to send requests to the server after page loading is complete, and receive information back. XMLHttpRequest is a component of Ajax programming. Prior to Ajax, hyperlinks and form submissions were the primary mechanisms for interacting with the server, often replacing the current page with another one.

== History == The concept behind XMLHttpRequest was conceived in 2000 by the developers of Microsoft Outlook. The concept was then implemented within the Internet Explorer 5 browser (1999). However, the original syntax did not use the XMLHttpRequest identifier. Instead, the developers used the identifiers ActiveXObject("Msxml2.XMLHTTP") and ActiveXObject("Microsoft.XMLHTTP"). As of Internet Explorer 7 (2006), all browsers support the XMLHttpRequest identifier. The XMLHttpRequest identifier is now the de facto standard in all the major browsers, including Mozilla's Gecko layout engine (2002), Safari 1.2 (2004) and Opera 8.0 (2005).

=== Standards === The World Wide Web Consortium (W3C) published a Working Draft specification for the XMLHttpRequest object on April 5, 2006. On February 25, 2008, the W3C published the Working Draft Level 2 specification. Level 2 added methods to monitor event progress, allow cross-site requests, and handle byte streams. At the end of 2011, the Level 2 specification was absorbed into the original specification. At the end of 2012, the WHATWG took over development and maintains a living document using Web IDL.

== Usage == Generally, sending a request with XMLHttpRequest has several programming steps.

Create an XMLHttpRequest object by calling a constructor: Call the "open" method to specify the request type, identify the relevant resource, and select synchronous or asynchronous operation: For an asynchronous request, set a listener that will be notified when the request's state changes: Initiate the request by calling the "send" method: Respond to state changes in the event listener. If the server sends response data, by default it is captured in the "responseText" property. When the object stops processing the response, it changes to state 4, the "done" state. Aside from these general steps, XMLHttpRequest has many options to control how the request is sent and how the response is processed. Custom header fields can be added to the request to indicate how the server should fulfill it, and data can be uploaded to the server by providing it in the "send" call. The response can be parsed from the JSON format into a readily usable JavaScript object, or processed gradually as it arrives rather than waiting for the entire text. The request can be aborted prematurely or set to fail if not completed in a specified amount of time.

== Cross-domain requests ==

In the early development of the World Wide Web, it was found possible to brea