logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

mcp-web-content-processor-py

A backend service utilizing Python to retrieve and restructure digital document payloads from diverse web addresses, accommodating both static material and dynamically generated HTML via JavaScript execution. This utility enables structured extraction of web assets, including multimedia components.

Author

mcp-web-content-processor-py logo

tatn

MIT License

Quick Info

GitHub GitHub Stars 7
NPM Weekly Downloads 0
Tools 1
Last Updated 2026-02-19

Tags

scrapingautomationwebautomation webbrowser automationscraping processing

mcp-server-fetch-python Reimagined: Web Content Acquisition and Transformation Engine

This repository hosts an MCP server component engineered for fetching and reformatting internet content into various specified outputs. It furnishes robust capabilities for data acquisition from web destinations, specifically supporting content reliant on client-side JavaScript rendering and the retrieval of embedded media assets.

Server Fetch Python MCP server

Core Capabilities

Available Utilities

The service exposes four discrete functional modules:

  • extract-plain-data: Pulls foundational textual content straight from URIs, bypassing any browser-level rendering procedures.
  • Parameters:
    • url: The Uniform Resource Locator pointing to the target document (e.g., text, structured data formats like JSON, XML, CSV, TSV). (Mandatory)
  • Optimal for scenarios requiring high-speed access or when dealing with inherently structured, non-interactive data sources.

  • retrieve-fully-rendered-html: Secures the complete, client-side rendered HTML structure utilizing a concealed (headless) browser instance.

  • Parameters:
    • url: The destination web address. (Mandatory)
  • Crucial for interacting with contemporary web applications and Single Page Applications (SPAs) that necessitate JavaScript execution to materialize content.

  • format-as-markdown: Processes fetched page content and translates it into cleanly formatted Markdown syntax.

  • Parameters:
    • url: The Uniform Resource Locator for the webpage. (Mandatory)
  • Aims to retain hierarchical document structure while yielding highly readable textual representations.

  • ai-process-media-to-markdown: Engages artificial intelligence routines to derive textual insights from encapsulated media elements.

  • Parameters:
    • url: The URI pointing to the target media resource (e.g., images, video streams). (Mandatory)
  • Leverages visual analysis (computer vision) and Optical Character Recognition (OCR) for content interpretation.
  • Prerequisite: Requires a valid OPENAI_API_KEY credential to be defined within the execution environment variables.
  • Will issue an explicit failure notification if the key is absent or if media processing encounters insurmountable obstacles.

Operational Integration

Utilization within Claude Desktop Client

To integrate this component with the Claude Desktop application, incorporate the following configuration snippet into your settings file:

On macOS environments: ~/Library/Application\ Support/Claude/claude_desktop_config.json
On Windows systems: %APPDATA%/Claude/claude_desktop_config.json

"mcpServers": { "mcp-server-fetch-python": { "command": "uvx", "args": [ "mcp-server-fetch-python" ] } }

Environment Configuration Parameters

The operational characteristics of this server can be tuned via specific environment variables:

  • OPENAI_API_KEY: Essential for activating the functionality of the ai-process-media-to-markdown utility. This secret is mandatory for enabling AI-driven image understanding and content extraction.
  • PYTHONIOENCODING: Should be explicitly set to "utf-8" if discrepancies arise concerning character set interpretation within the resultant output streams.
  • MODEL_NAME: Designates the specific foundational model utilized for AI operations. Defaults to "gpt-4o" if not explicitly overridden.

Example configuration snippet detailing environment overrides:

"mcpServers": { "mcp-server-fetch-python": { "command": "uvx", "args": [ "mcp-server-fetch-python" ], "env": { "OPENAI_API_KEY": "sk-****", "PYTHONIOENCODING": "utf-8", "MODEL_NAME": "gpt-4o"
} } }

Local Deployment Instructions

Alternatively, the service can be initialized and executed directly on a local machine:

powershell git clone https://github.com/tatn/mcp-server-fetch-python.git cd mcp-server-fetch-python uv sync uv build

Following successful local compilation, update the Claude Desktop configuration file with the following structure, pointing to the repository's location:

"mcpServers": { "mcp-server-fetch-python": { "command": "uv", "args": [ "--directory", "path\to\mcp-server-fetch-python", # User must substitute this placeholder with the actual repository path "run", "mcp-server-fetch-python" ] } }

Development and Diagnostics

Troubleshooting Interface Access

To initiate the MCP Inspector utility for diagnostic purposes, utilize npx with one of the following invocation sequences:

bash npx @modelcontextprotocol/inspector uvx mcp-server-fetch-python

bash npx @modelcontextprotocol/inspector uv --directory path\to\mcp-server-fetch-python run mcp-server-fetch-python

WIKIPEDIA REFERENCE: A headless browser functions as a web browser stripped of its conventional graphical output layer. These environments grant automated orchestration of web page activities, analogous to standard browsers, but are controlled via command-line interfaces or network protocols. They are invaluable for rigorous web asset validation because they process and interpret HTML, including intricate styling aspects (layout, typography, color) and JavaScript execution, capabilities often absent in simpler parsing methods. Since the advent of native remote control support in Google Chrome (version 59+) and Firefox (version 56+), older automation mechanisms, such as PhantomJS, have largely become superseded.

== Primary Application Scenarios == The chief domains benefiting from headless browser utilization encompass:

  • Systematic validation of contemporary web architectures (functional testing).
  • Automated capture of high-fidelity static page snapshots.
  • Execution of automated tests for client-side scripting frameworks.
  • Programmatic control and simulation of user interaction patterns on web interfaces.

=== Secondary Utility Cases === Headless agents also offer benefits in complex web data aggregation. In 2009, Google noted their utility in indexing content from sites heavily reliant on Ajax. Conversely, these tools have been associated with misuse, including:

  • Orchestrating Distributed Denial of Service (DDoS) assaults against endpoints.
  • Programmatic inflation of advertising impressions.
  • Automating site actions outside their intended operational scope (e.g., compromised credential trials). However, a 2018 traffic analysis indicated no pronounced preference among malicious entities for headless environments over conventional browser interfaces when executing attacks like DDoS, SQL injection, or XSS.

== Software Ecosystem == As primary browser vendors now natively integrate headless modality through accessible APIs, a suite of unified interface software has emerged for managing browser automation tasks. Notable examples include:

  • Selenium WebDriver – Adheres to W3C standards for WebDriver implementation.
  • Playwright – A Node.js utility designed for automating Chromium, Firefox, and WebKit engines.
  • Puppeteer – A Node.js framework focused on automated control of Chrome or Firefox instances.

=== Test Automation Frameworks === Several testing apparatuses integrate headless capabilities into their operational design:

  • Capybara employs headless browsing (via WebKit or Headless Chrome) to precisely mimic end-user behavior in its testing protocols.
  • Jasmine typically defaults to Selenium, but supports configuration for WebKit or Headless Chrome for browser-based test execution.
  • Cypress, a specialized framework for front-end validation.
  • QF-Test, a commercial utility for automated GUI testing, which accommodates headless browser operations.

=== Alternative Methodologies === An alternative involves leveraging software that exposes browser-like Application Programming Interfaces (APIs). For instance, Deno incorporates browser APIs directly into its design philosophy. For Node.js environments, jsdom offers the most comprehensive emulation. While these alternatives typically manage core browser functions (HTML parsing, cookie handling, XHR requests, limited JavaScript), they generally lack full DOM rendering capabilities and event simulation, often resulting in faster execution speeds than full browser emulation.

See Also

`