Portal | Source Repository | Reference Manual | Package Index

mcp-web-ingest-analyzer

High-fidelity retrieval and querying capabilities for data gathered via web spiders. Utilizing mcp-web-ingest-analyzer, your generative intelligence pipeline can vet and interpret digitized web artifacts either under explicit command or through autonomous operation. The operational server component incorporates a comprehensive text search engine with support for logical operators, alongside resource classification based on MIME type, HTTP response code, and other metadata.

mcp-web-ingest-analyzer equips the Large Language Model (LLM) with a comprehensive functional repertoire and is interoperable with a spectrum of established web harvesting utilities:

Harvester/Format	Functionality Summary	Supported OS	Configuration Guide
ArchiveBox	Offline web preservation utility	macOS/Linux	Setup Guide
HTTrack	Desktop web mirroring application	macOS/Windows/Linux	Setup Guide
InterroBot	Interactive crawling and evaluation suite	macOS/Windows/Linux	Setup Guide
Katana	Command-line instrument focused on security reconnaissance	macOS/Windows/Linux	Setup Guide
SiteOne	Interactive crawling and evaluation suite	macOS/Windows/Linux	Setup Guide
WARC	Standardized digital archive container format	diverse	Setup Guide
wget	CLI utility for recursive site copying	macOS/Linux	Setup Guide

mcp-web-ingest-analyzer is distributed under a permissive, open-source license, and mandates the presence of Claude Desktop along with Python (version 3.10 or newer). Installation is facilitated via the command line interface using pip:

bash pip install mcp-server-webcrawl

For comprehensive, sequential instructions on deploying the MCP server stack, consult the Setup Guides.

Core Functionalities

Compatibility layer for Claude Desktop operations
Support for heterogeneous crawler outputs
Metadata-driven filtering by classification, response code, and more
Native support for Boolean expression parsing
Utility for Markdown conversion and result segment extraction
Facility for constructing bespoke site-specific data repositories

Procedure Definitions (Routines)

mcp-web-ingest-analyzer furnishes the necessary tooling to explore archived web indices fluidly, allowing for on-the-fly adaptation based on user inquiry. This is its foundational design principle.

It is equally equipped to execute pre-defined procedural sequences (packaged as prompts). These routines can be custom-authored or sourced from the included library. These instructional sets are designed for direct copy-paste integration as raw Markdown blocks. They leverage the advanced querying mechanism granted to the LLM, enabling the embedding of complex logic, sequential instructions, or even interactive loops, as demonstrated by the Gopher Service implementation.

Routine	Retrieval Link	Group	Purpose Description
🔍 SEO Analysis	`auditseo.md`	audit	Deep dive into technical Search Engine Optimization metrics. Covers fundamentals with branching options for detailed review.
🔗 Broken Link Scan	`audit404.md`	audit	Identifies inaccessible uniform resource locators (URLs) and analyzes error recurrence patterns. Proposes corrective actions alongside fault identification.
⚡ Speed Evaluation	`auditperf.md`	audit	Assessment of site load performance and optimization bottlenecks. Delivers candid, actionable feedback.
📁 Asset Inventory	`auditfiles.md`	audit	Examination of file structure and resource composition across the indexed site. Reveals the underlying makeup of the digital footprint.
🌐 Gopher Interface	`gopher.md`	interface	A throwback search interface reminiscent of legacy Gopher client environments.
⚙️ Query Validator	`testsearch.md`	self-test	A comprehensive test suite to verify the logical consistency of the search query parser during translation to the FTS5 indexing structure.

If you wish to bypass the initial site selection step (reducing one interaction), append the command "run pasted for [site identifier or URL]" within the same request body as the routine's Markdown. If pasted without this explicit context, the system will prompt you to select an indexed archive from a displayed roster.

Logical Search Syntax

The retrieval mechanism supports targeted attribute searches (attribute: value) alongside compound logical constructions. The default fulltext search spans the URI, document body, and metadata headers.

Familiarity with the query grammar is beneficial, even though the LLM is the primary consumer of the API interface; queries are typically abstracted away in the main view. The specific generated statement can be revealed by expanding the MCP disclosure element.

Illustrative Search Expressions

Sample Query	Interpretation
privacy	Single-term fulltext match
"privacy policy"	Fulltext match for the exact character sequence
boundar*	Fulltext prefix match (e.g., boundary, boundaries)
id: 12345	Attribute search for a specific resource identifier
url: example.com/somedir	Attribute search for URIs containing the specified path segment
type: html	Attribute search restricted solely to HTML documents
status: 200	Attribute search matching resources that returned HTTP success code 200
status: >=400	Attribute search matching resources with HTTP error codes of 400 or higher
content: h1	Attribute search within the document body for the term 'h1'
headers: text/xml	Attribute search within the HTTP response metadata headers
privacy AND policy	Fulltext match requiring both terms to be present
privacy OR policy	Fulltext match requiring either term to be present
policy NOT privacy	Fulltext match for 'policy' excluding any that also contain 'privacy'
(login OR signin) AND form	Complex fulltext match: finds documents containing 'form' and either 'login' or 'signin'
type: html AND status: 200	Combined filter: finds only successfully retrieved HTML documents

Attribute Search Definitions

Attribute searching enables fine-grained filtering by designating specific columns within the search index. This shifts the focus from scanning all textual content to isolating specific metadata points such as resource location, headers, or body text, leading to superior operational efficiency for targeted retrievals.

Attribute	Data Scope
id	Internal database identifier
url	Uniform Resource Locator of the archived asset
type	Enumerated category of the resource (refer to content types table)
size	File size reported in bytes
status	HTTP response code received during retrieval
headers	Full set of HTTP response meta-information
content	The primary payload body (HTML, CSS, JavaScript, etc.)

Data Field Inclusion Policy

A subset of attributes can be explicitly requested alongside search results, while core identifiers are perpetually included. Invoking headers or content carries a significant cost in token consumption; employ these judiciously or utilize the 'Extras' feature to summarize high-volume data efficiently for the LLM context window. Field inclusion is a primary configuration argument, separate from the attribute filtering performed via the search query.

Field	Availability Status
id	Always present
url	Always present
type	Always present
status	Always present
created	Requires explicit request
modified	Requires explicit request
size	Requires explicit request
headers	Requires explicit request
content	Requires explicit request

Content Type Classifications

Archived data encompasses more than just standard web pages. The type: filter allows grouping by general resource category, which is particularly useful for isolating media without resorting to intricate file extension matching. For instance, one might query for type: html NOT content: login to find pages lacking specific text, or type: img to focus analysis on image assets. The following table details all recognized content types within the retrieval system.

Type	Classification Description
html	Standard web documents
iframe	Embedded document frames
img	Raster and vector graphics from the web
audio	Sound assets loaded by the browser
video	Multimedia clips loaded by the browser
font	Web typography files
style	Cascading Style Sheets (CSS)
script	Executable JavaScript files
rss	XML-based content syndication feeds
text	Unformatted, raw text content
pdf	Portable Document Format files
doc	Microsoft Word binary or XML documents
other	Any resource not fitting a defined category

Supplementary Processing (Extras)

The extras parameter governs optional post-retrieval processing steps, allowing transformation of raw HTTP data (e.g., rendering HTML as Markdown, generating context snippets, applying custom Regular Expressions, or utilizing XPath selectors) or linking the LLM to external data visualizations (like thumbnails). These options are combinable to tailor the output format precisely to the analytical requirement.

Extra	Processing Detail
thumbnails	Renders images into base64 encoded format for direct visual ingestion and semantic description by AI agents. Optimizes token usage. Functional for image types; SVG formats are excluded from this process. Compatible when filtering by `type: img`.
markdown	Converts raw HTML payload into a condensed Markdown representation, significantly minimizing token overhead and enhancing LLM readability. Applicable when filtering by `type: html`.
regex	Applies specified regular expression patterns against crawled text artifacts (HTML, CSS, JS, etc.) to extract matching data segments. Offers broader scope than XPath for non-HTML documents. Patterns are defined via the `extrasRegex` argument.
snippets	Locates and returns small contextual blocks surrounding fulltext query matches. If used without requesting the `content` field or `markdown` extra, it serves as a highly efficient method for result refinement without loading full pages. Effective across HTML, CSS, JS, and other text-based archives. Mimics classic search engine result highlighting.
xpath	Selects and returns data based on W3C standard XPath expressions applied to HTML documents. Use selectors like `text()` for pure text extraction, or element selectors for the surrounding HTML structure. Only operative for resources matching `type: html`. Selectors are supplied via the `extrasXpath` argument.

By leveraging these supplemental features, users can engineer token-efficient data responses. Markdown conversion can shrink HTML size by roughly two-thirds; snippets yield small, fixed-size contextual summaries; and XPath allows for surgical extraction. More precise data requests ensure a higher volume of relevant findings can be accommodated within the active LLM context window.

The underlying philosophy is that the LLM should manage this complexity autonomously. If monitoring indicates an over-reliance on the raw "content" field (unfiltered HTML), a directive within the chat to leverage the extras functionality for token budgeting should suffice.

Direct Terminal Operation

Bypass AI integration: Execute traditional Boolean queries directly within the command line interface against your web archives.

mcp-web-ingest-analyzer possesses the capacity to function as a standalone terminal search utility for locally or remotely indexed web crawls. While local archive searches are straightforward, its utility expands significantly when combined with SSH access to remote hosts, allowing immediate interrogation of archives situated elsewhere without necessitating downloads, synchronization, or complex authentication procedures. The interactive mode facilitates rapid initiation of searches against remote crawl datasets.

Initiate execution using --crawler and --datasource flags to pre-load parameters, or configure the required source and harvester interactively post-launch.

bash mcp-web-ingest-analyzer --crawler wget --datasrc /path/to/datasrc --interactive

Alternatively, define the harvester and source within the interactive session:

mcp-web-ingest-analyzer --interactive

Interactive mode provides a mechanism for iterative exploration of indexed data subsets, accessible anytime, anywhere, directly within a terminal session.

WIKIPEDIA: XMLHttpRequest (XHR) is an API accessible via a JavaScript object that provides methods for dispatching HTTP requests from a browser environment to a web server. These methods permit browser-based applications to communicate with the server subsequent to initial page rendering, allowing for dynamic data exchange. XMLHttpRequest constitutes a fundamental pillar of Asynchronous JavaScript and XML (Ajax) programming methodologies. Before the widespread adoption of Ajax, server interaction primarily relied on standard hyperlink navigation and HTML form submissions, operations that typically resulted in the complete replacement of the current displayed page.

== Genesis == The conceptual foundation for the XMLHttpRequest capability was formulated in the year 2000 by the development team responsible for Microsoft Outlook. This concept was subsequently integrated into the Internet Explorer 5 browser release (1999). However, the initial implementation did not utilize the standardized XMLHttpRequest object identifier. Instead, developers employed COM object instantiation via ActiveXObject("Msxml2.XMLHTTP") or ActiveXObject("Microsoft.XMLHTTP"). As of Internet Explorer 7 (released in 2006), universal support for the explicit XMLHttpRequest identifier has been established across all major browser engines, including Mozilla's Gecko engine (2002), Apple's Safari 1.2 (2004), and Opera 8.0 (2005).

=== Standardization Efforts === The World Wide Web Consortium (W3C) formally published an initial Working Draft specification for the XMLHttpRequest object on April 5, 2006. A subsequent Level 2 specification was advanced by the W3C on February 25, 2008. The Level 2 specification introduced enhancements such as event progress monitoring, support for cross-origin data transfers, and capacity for handling raw byte streams. By the close of 2011, the enhancements defined in Level 2 were merged back into the primary specification document. In late 2012, stewardship of the specification's maintenance transitioned to the WHATWG, which now sustains a continuously updated living document defined using Web IDL notation.

== Implementation Protocol == Generally, the process of dispatching a request using XMLHttpRequest necessitates adherence to a sequence of programming actions.

Instantiate an XMLHttpRequest object by invoking its constructor:
Invoke the "open" method to define the request HTTP verb, specify the target resource URI, and select either synchronous or asynchronous execution mode:
If utilizing asynchronous operation, establish an event listener callback function intended to process changes in the request's state:
Commence the data transfer by executing the "send" method:
Process state transitions within the registered event handler. Upon successful receipt of server data, it is typically stored in the "responseText" property by default. When the object concludes its processing cycle, its state transitions to 4, signifying the "done" status.

Beyond these fundamental steps, XMLHttpRequest offers extensive configurability for controlling transmission behavior and response parsing. Custom HTTP headers can be prepended to the request to convey server instructions, and data payloads can be uploaded to the server via an argument passed to the "send" call. The returned data stream can be deserialized from JSON into native JavaScript objects or processed incrementally as data arrives, avoiding a mandatory wait for the complete transmission. Furthermore, a request can be terminated prematurely or assigned a timeout threshold, causing failure if not completed within the specified duration.

== Inter-Domain Communication ==

During the nascent phase of the World Wide Web's evolution, it was recognized that enabling restricted access across different security origins could lead to significant security vulnerabilities, leading to the implementation of origin policies that govern cross-domain communication via XHR.

mcp-web-ingest-analyzer

Author

pragmar

Quick Info

Actions

Tags

mcp-web-ingest-analyzer

Core Functionalities

Procedure Definitions (Routines)

Logical Search Syntax

Attribute Search Definitions

Data Field Inclusion Policy

Content Type Classifications

Supplementary Processing (Extras)

Direct Terminal Operation

Alternatively, define the harvester and source within the interactive session:

See Also