mcp-web-ingest-analyzer
Synchronizes web indexing outputs with sophisticated AI language mechanisms for automated vetting and semantic interpretation of retrieved digital assets. Features an advanced full-document search portal and supports integration with various crawling agents to deepen data comprehension.
Author

pragmar
Quick Info
Actions
Tags
Portal | Source Repository | Reference Manual | Package Index
mcp-web-ingest-analyzer
High-fidelity retrieval and querying capabilities for data gathered via web spiders. Utilizing mcp-web-ingest-analyzer, your generative intelligence pipeline can vet and interpret digitized web artifacts either under explicit command or through autonomous operation. The operational server component incorporates a comprehensive text search engine with support for logical operators, alongside resource classification based on MIME type, HTTP response code, and other metadata.
mcp-web-ingest-analyzer equips the Large Language Model (LLM) with a comprehensive functional repertoire and is interoperable with a spectrum of established web harvesting utilities:
| Harvester/Format | Functionality Summary | Supported OS | Configuration Guide |
|---|---|---|---|
| ArchiveBox | Offline web preservation utility | macOS/Linux | Setup Guide |
| HTTrack | Desktop web mirroring application | macOS/Windows/Linux | Setup Guide |
| InterroBot | Interactive crawling and evaluation suite | macOS/Windows/Linux | Setup Guide |
| Katana | Command-line instrument focused on security reconnaissance | macOS/Windows/Linux | Setup Guide |
| SiteOne | Interactive crawling and evaluation suite | macOS/Windows/Linux | Setup Guide |
| WARC | Standardized digital archive container format | diverse | Setup Guide |
| wget | CLI utility for recursive site copying | macOS/Linux | Setup Guide |
mcp-web-ingest-analyzer is distributed under a permissive, open-source license, and mandates the presence of Claude Desktop along with Python (version 3.10 or newer). Installation is facilitated via the command line interface using pip:
bash pip install mcp-server-webcrawl
For comprehensive, sequential instructions on deploying the MCP server stack, consult the Setup Guides.
Core Functionalities
- Compatibility layer for Claude Desktop operations
- Support for heterogeneous crawler outputs
- Metadata-driven filtering by classification, response code, and more
- Native support for Boolean expression parsing
- Utility for Markdown conversion and result segment extraction
- Facility for constructing bespoke site-specific data repositories
Procedure Definitions (Routines)
mcp-web-ingest-analyzer furnishes the necessary tooling to explore archived web indices fluidly, allowing for on-the-fly adaptation based on user inquiry. This is its foundational design principle.
It is equally equipped to execute pre-defined procedural sequences (packaged as prompts). These routines can be custom-authored or sourced from the included library. These instructional sets are designed for direct copy-paste integration as raw Markdown blocks. They leverage the advanced querying mechanism granted to the LLM, enabling the embedding of complex logic, sequential instructions, or even interactive loops, as demonstrated by the Gopher Service implementation.
| Routine | Retrieval Link | Group | Purpose Description |
|---|---|---|---|
| 🔍 SEO Analysis | auditseo.md |
audit | Deep dive into technical Search Engine Optimization metrics. Covers fundamentals with branching options for detailed review. |
| 🔗 Broken Link Scan | audit404.md |
audit | Identifies inaccessible uniform resource locators (URLs) and analyzes error recurrence patterns. Proposes corrective actions alongside fault identification. |
| ⚡ Speed Evaluation | auditperf.md |
audit | Assessment of site load performance and optimization bottlenecks. Delivers candid, actionable feedback. |
| 📁 Asset Inventory | auditfiles.md |
audit | Examination of file structure and resource composition across the indexed site. Reveals the underlying makeup of the digital footprint. |
| 🌐 Gopher Interface | gopher.md |
interface | A throwback search interface reminiscent of legacy Gopher client environments. |
| ⚙️ Query Validator | testsearch.md |
self-test | A comprehensive test suite to verify the logical consistency of the search query parser during translation to the FTS5 indexing structure. |
If you wish to bypass the initial site selection step (reducing one interaction), append the command "run pasted for [site identifier or URL]" within the same request body as the routine's Markdown. If pasted without this explicit context, the system will prompt you to select an indexed archive from a displayed roster.
Logical Search Syntax
The retrieval mechanism supports targeted attribute searches (attribute: value) alongside compound logical constructions. The default fulltext search spans the URI, document body, and metadata headers.
Familiarity with the query grammar is beneficial, even though the LLM is the primary consumer of the API interface; queries are typically abstracted away in the main view. The specific generated statement can be revealed by expanding the MCP disclosure element.
Illustrative Search Expressions
| Sample Query | Interpretation |
|---|---|
| privacy | Single-term fulltext match |
| "privacy policy" | Fulltext match for the exact character sequence |
| boundar* | Fulltext prefix match (e.g., boundary, boundaries) |
| id: 12345 | Attribute search for a specific resource identifier |
| url: example.com/somedir | Attribute search for URIs containing the specified path segment |
| type: html | Attribute search restricted solely to HTML documents |
| status: 200 | Attribute search matching resources that returned HTTP success code 200 |
| status: >=400 | Attribute search matching resources with HTTP error codes of 400 or higher |
| content: h1 | Attribute search within the document body for the term 'h1' |
| headers: text/xml | Attribute search within the HTTP response metadata headers |
| privacy AND policy | Fulltext match requiring both terms to be present |
| privacy OR policy | Fulltext match requiring either term to be present |
| policy NOT privacy | Fulltext match for 'policy' excluding any that also contain 'privacy' |
| (login OR signin) AND form | Complex fulltext match: finds documents containing 'form' and either 'login' or 'signin' |
| type: html AND status: 200 | Combined filter: finds only successfully retrieved HTML documents |
Attribute Search Definitions
Attribute searching enables fine-grained filtering by designating specific columns within the search index. This shifts the focus from scanning all textual content to isolating specific metadata points such as resource location, headers, or body text, leading to superior operational efficiency for targeted retrievals.
| Attribute | Data Scope |
|---|---|
| id | Internal database identifier |
| url | Uniform Resource Locator of the archived asset |
| type | Enumerated category of the resource (refer to content types table) |
| size | File size reported in bytes |
| status | HTTP response code received during retrieval |
| headers | Full set of HTTP response meta-information |
| content | The primary payload body (HTML, CSS, JavaScript, etc.) |
Data Field Inclusion Policy
A subset of attributes can be explicitly requested alongside search results, while core identifiers are perpetually included. Invoking headers or content carries a significant cost in token consumption; employ these judiciously or utilize the 'Extras' feature to summarize high-volume data efficiently for the LLM context window. Field inclusion is a primary configuration argument, separate from the attribute filtering performed via the search query.
| Field | Availability Status |
|---|---|
| id | Always present |
| url | Always present |
| type | Always present |
| status | Always present |
| created | Requires explicit request |
| modified | Requires explicit request |
| size | Requires explicit request |
| headers | Requires explicit request |
| content | Requires explicit request |
Content Type Classifications
Archived data encompasses more than just standard web pages. The type: filter allows grouping by general resource category, which is particularly useful for isolating media without resorting to intricate file extension matching. For instance, one might query for type: html NOT content: login to find pages lacking specific text, or type: img to focus analysis on image assets. The following table details all recognized content types within the retrieval system.
| Type | Classification Description |
|---|---|
| html | Standard web documents |
| iframe | Embedded document frames |
| img | Raster and vector graphics from the web |
| audio | Sound assets loaded by the browser |
| video | Multimedia clips loaded by the browser |
| font | Web typography files |
| style | Cascading Style Sheets (CSS) |
| script | Executable JavaScript files |
| rss | XML-based content syndication feeds |
| text | Unformatted, raw text content |
| Portable Document Format files | |
| doc | Microsoft Word binary or XML documents |
| other | Any resource not fitting a defined category |
Supplementary Processing (Extras)
The extras parameter governs optional post-retrieval processing steps, allowing transformation of raw HTTP data (e.g., rendering HTML as Markdown, generating context snippets, applying custom Regular Expressions, or utilizing XPath selectors) or linking the LLM to external data visualizations (like thumbnails). These options are combinable to tailor the output format precisely to the analytical requirement.
| Extra | Processing Detail |
|---|---|
| thumbnails | Renders images into base64 encoded format for direct visual ingestion and semantic description by AI agents. Optimizes token usage. Functional for image types; SVG formats are excluded from this process. Compatible when filtering by type: img. |
| markdown | Converts raw HTML payload into a condensed Markdown representation, significantly minimizing token overhead and enhancing LLM readability. Applicable when filtering by type: html. |
| regex | Applies specified regular expression patterns against crawled text artifacts (HTML, CSS, JS, etc.) to extract matching data segments. Offers broader scope than XPath for non-HTML documents. Patterns are defined via the extrasRegex argument. |
| snippets | Locates and returns small contextual blocks surrounding fulltext query matches. If used without requesting the content field or markdown extra, it serves as a highly efficient method for result refinement without loading full pages. Effective across HTML, CSS, JS, and other text-based archives. Mimics classic search engine result highlighting. |
| xpath | Selects and returns data based on W3C standard XPath expressions applied to HTML documents. Use selectors like text() for pure text extraction, or element selectors for the surrounding HTML structure. Only operative for resources matching type: html. Selectors are supplied via the extrasXpath argument. |
By leveraging these supplemental features, users can engineer token-efficient data responses. Markdown conversion can shrink HTML size by roughly two-thirds; snippets yield small, fixed-size contextual summaries; and XPath allows for surgical extraction. More precise data requests ensure a higher volume of relevant findings can be accommodated within the active LLM context window.
The underlying philosophy is that the LLM should manage this complexity autonomously. If monitoring indicates an over-reliance on the raw "content" field (unfiltered HTML), a directive within the chat to leverage the extras functionality for token budgeting should suffice.
Direct Terminal Operation
Bypass AI integration: Execute traditional Boolean queries directly within the command line interface against your web archives.
mcp-web-ingest-analyzer possesses the capacity to function as a standalone terminal search utility for locally or remotely indexed web crawls. While local archive searches are straightforward, its utility expands significantly when combined with SSH access to remote hosts, allowing immediate interrogation of archives situated elsewhere without necessitating downloads, synchronization, or complex authentication procedures. The interactive mode facilitates rapid initiation of searches against remote crawl datasets.
Initiate execution using --crawler and --datasource flags to pre-load parameters, or configure the required source and harvester interactively post-launch.
bash mcp-web-ingest-analyzer --crawler wget --datasrc /path/to/datasrc --interactive
Alternatively, define the harvester and source within the interactive session:
mcp-web-ingest-analyzer --interactive
Interactive mode provides a mechanism for iterative exploration of indexed data subsets, accessible anytime, anywhere, directly within a terminal session.
WIKIPEDIA: XMLHttpRequest (XHR) is an API accessible via a JavaScript object that provides methods for dispatching HTTP requests from a browser environment to a web server. These methods permit browser-based applications to communicate with the server subsequent to initial page rendering, allowing for dynamic data exchange. XMLHttpRequest constitutes a fundamental pillar of Asynchronous JavaScript and XML (Ajax) programming methodologies. Before the widespread adoption of Ajax, server interaction primarily relied on standard hyperlink navigation and HTML form submissions, operations that typically resulted in the complete replacement of the current displayed page.
== Genesis ==
The conceptual foundation for the XMLHttpRequest capability was formulated in the year 2000 by the development team responsible for Microsoft Outlook. This concept was subsequently integrated into the Internet Explorer 5 browser release (1999). However, the initial implementation did not utilize the standardized XMLHttpRequest object identifier. Instead, developers employed COM object instantiation via ActiveXObject("Msxml2.XMLHTTP") or ActiveXObject("Microsoft.XMLHTTP"). As of Internet Explorer 7 (released in 2006), universal support for the explicit XMLHttpRequest identifier has been established across all major browser engines, including Mozilla's Gecko engine (2002), Apple's Safari 1.2 (2004), and Opera 8.0 (2005).
=== Standardization Efforts === The World Wide Web Consortium (W3C) formally published an initial Working Draft specification for the XMLHttpRequest object on April 5, 2006. A subsequent Level 2 specification was advanced by the W3C on February 25, 2008. The Level 2 specification introduced enhancements such as event progress monitoring, support for cross-origin data transfers, and capacity for handling raw byte streams. By the close of 2011, the enhancements defined in Level 2 were merged back into the primary specification document. In late 2012, stewardship of the specification's maintenance transitioned to the WHATWG, which now sustains a continuously updated living document defined using Web IDL notation.
== Implementation Protocol == Generally, the process of dispatching a request using XMLHttpRequest necessitates adherence to a sequence of programming actions.
- Instantiate an XMLHttpRequest object by invoking its constructor:
- Invoke the "open" method to define the request HTTP verb, specify the target resource URI, and select either synchronous or asynchronous execution mode:
- If utilizing asynchronous operation, establish an event listener callback function intended to process changes in the request's state:
- Commence the data transfer by executing the "send" method:
- Process state transitions within the registered event handler. Upon successful receipt of server data, it is typically stored in the "responseText" property by default. When the object concludes its processing cycle, its state transitions to 4, signifying the "done" status.
Beyond these fundamental steps, XMLHttpRequest offers extensive configurability for controlling transmission behavior and response parsing. Custom HTTP headers can be prepended to the request to convey server instructions, and data payloads can be uploaded to the server via an argument passed to the "send" call. The returned data stream can be deserialized from JSON into native JavaScript objects or processed incrementally as data arrives, avoiding a mandatory wait for the complete transmission. Furthermore, a request can be terminated prematurely or assigned a timeout threshold, causing failure if not completed within the specified duration.
== Inter-Domain Communication ==
During the nascent phase of the World Wide Web's evolution, it was recognized that enabling restricted access across different security origins could lead to significant security vulnerabilities, leading to the implementation of origin policies that govern cross-domain communication via XHR.
