omni-fetch-gateway-mcp
A high-performance MCP service engineered for advanced web data acquisition, featuring multi-source retrieval pipelines fine-tuned for Large Language Model (LLM) semantic ingestion. It systematically prunes extraneous webpage elements and serializes the remaining high-fidelity context into an AI-consumable structure.
Author

weidwonder
Quick Info
Actions
Tags
OmniFetch Gateway MCP Service
This deployment functions as an intelligent information retrieval conduit conforming to the Model Context Protocol (MCP). It furnishes AI agents with robust searching capabilities and sophisticated web content analysis, specifically optimized for consumption by Large Language Models (LLMs). Leveraging concurrent, multi-engine querying and intelligent content curation, this server accelerates the process of converting raw internet data into maximally digestible formats for artificial intelligence processing.
Core Capabilities
- 🌐 Aggregated Search Engine Interface: Native support for diverse search backends, including DuckDuckGo and Google.
- 🧠 LLM-Centric Extraction: Advanced parsing algorithms that intelligently discard boilerplate/noise and isolate high-signal content.
- ✅ Value Focus: Automated identification and retention of primary narrative and evidentiary material.
- 🔗 Verifiability Output: Generates diverse serialization formats, intrinsically linking extracted data back to its origin source.
- ⚡ Performance Architecture: Built upon a high-throughput, non-blocking asynchronous framework leveraging FastMCP principles.
Deployment Instructions
Method A: Standard Environment Setup
- Prerequisites Check:
- Python version requirement: Minimum 3.9.
-
Strong recommendation for utilizing isolated virtual environments.
-
Source Retrieval: bash git clone https://github.com/yourusername/crawl4ai-mcp-server.git cd crawl4ai-mcp-server
-
Environment Initialization: bash python -m venv fetch_env source fetch_env/bin/activate # For Unix-like systems
or
.\fetch_env\Scripts\activate # For Windows PowerShell/CMD
-
Dependency Installation: bash pip install -r requirements.txt
-
Browser Component Installation (for advanced rendering): bash playwright install
Method B: Integrated Deployment via Smithery (for Claude Desktop)
Utilize the Smithery utility to seamlessly register the OmniFetch Gateway service directly into your local Claude Extension Hub:
bash npx -y @smithery/cli install @weidwonder/crawl4ai-mcp-server --client claude
Operational Interface
The server exposes the following primary functional modules:
Module: network_search
This utility provides comprehensive web querying across integrated search providers:
- DuckDuckGo (Default): Operational without requiring external API credentials; processes AbstractText, Search Snippets, and Related Concepts.
- Google Search: Requires prior configuration of API credentials for utilization; offers superior result precision in some domains.
- Unified Mode: Capability to poll all configured search engines concurrently for maximal result breadth.
Parameters:
- search_term: The textual query string.
- result_count: Maximum number of indexed results to fetch (Default: 10).
- provider: Selection for the search indexer.
- "duckduckgo": Default, no API needed.
- "google": Requires configured credentials.
- "all": Executes queries across all active providers.
Invocation Examples: python
Standard DuckDuckGo execution
{ "search_term": "quantum computing theory", "result_count": 5 }
Parallel execution across all available indices
{ "search_term": "quantum computing theory", "result_count": 5, "provider": "all" }
Module: content_ingest
This specialized tool performs LLM-oriented semantic parsing on fetched URLs, transforming HTML into structured, context-rich text:
source_attribution_markdown: Default output. Markdown format enriched with inline source references for lineage tracking.lean_context_markdown: Highly compressed Markdown, scrubbed of non-essential prose for maximum token efficiency.base_markdown: Simple conversion from HTML to Markdown structure.reference_extract: Isolates and presents only citation and bibliography sections.lean_html: The raw HTML equivalent of thelean_context_markdownoutput.standard_markdown: The default Markdown serialization.
Invocation Example: python { "target_uri": "https://example.com/deep_dive", "serialization_format": "source_attribution_markdown" }
Configuration Note: For Google Search enablement, credentials must be provisioned in config.json:
{ "google_credentials": { "api_key": "your-g-api-key", "search_engine_id": "your-cse-id" } }
LLM Context Optimization Strategies
The gateway employs systematic processing layers designed to enhance data suitability for neural network comprehension:
- Semantic Chunking: Automated differentiation and preservation of main article body versus peripheral elements.
- Noise Suppression: Aggressive filtering of navigational aids, advertisements, site footers, and other non-substantive content.
- Evidential Integrity: Mandatory inclusion of source URLs within the output stream to facilitate fact-checking.
- Minimalist Filtering: Removal of excessively short or context-free text fragments (minimum length threshold of 10 tokens).
- Output Standardization: Prioritizing
source_attribution_markdownto ensure high context fidelity for subsequent AI reasoning.
Project Structure Outline
fetch_gateway_root/ ├── service_modules/ │ ├── main_entry.py # Primary server initialization and routing │ └── query_engine.py # Logic for search orchestration ├── configuration_defaults.json # Template for runtime parameters ├── metadata.toml # Project dependency and build info ├── dependency_list.txt # List of required external libraries └── MANUAL.md # Comprehensive documentation
Configuration Management
-
Establish the active configuration file: bash cp configuration_defaults.json config.json
-
Integrate Google service credentials into
config.json:
{ "google_credentials": { "api_key": "your-google-api-key-here", "search_engine_id": "your-google-cse-id-here" } }
Chronology of Releases
- 2025.02.08: Integrated multi-provider search support (DuckDuckGo primary, Google secondary).
- 2025.02.07: Architectural refactor to utilize FastMCP paradigm; improved dependency resolution.
- 2025.02.07: Refined content exclusion parameters, optimizing token density while guaranteeing URL traceability.
Licensing
Distributed under the MIT License.
Collaborations
Contributions via Issues and Pull Requests are highly encouraged.
Personnel
- Steward: weidwonder
- Development: Claude Sonnet 3.5
- Note: 100% of source code generated by Claude. Estimated consumption: $9 ($2 for initial coding, $7 for iterative correction/debugging). Development duration: 3 hours (0.5h coding, 0.5h setup, 2.0h iterative refinement).
Acknowledgment
Gratitude extended to all contributors.
Special mention to: - The original Crawl4ai repository for foundational concepts in web data extraction methodology.
WIKIPEDIA CONTEXT: Business management tools encompass the totality of systems, procedures, analytical frameworks, and operational methodologies employed by enterprises to maintain relevance in dynamic markets, secure a competitive standing, and enhance overall organizational output. These tools span departmental functions, including planning, process automation, record keeping, personnel administration, and strategic control mechanisms. Modern business applications have undergone significant technological evolution, necessitating a strategic, adaptive approach to tool selection rather than mere adoption of the newest solution to combat cost pressures and better align product delivery with evolving customer demands.

