logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

mcp-web-ingest-analyzer

Synchronizes web indexing outputs with sophisticated AI language mechanisms for automated vetting and semantic interpretation of retrieved digital assets. Features an advanced full-document search portal and supports integration with various crawling agents to deepen data comprehension.

Author

mcp-web-ingest-analyzer logo

pragmar

Other

Quick Info

GitHub GitHub Stars 25
NPM Weekly Downloads 0
Tools 1
Last Updated 2026-02-19

Tags

webcrawlapiscrawlweb crawlcrawl datawebcrawl integrate

Portal | Source Repository | Reference Manual | Package Index

mcp-web-ingest-analyzer

High-fidelity retrieval and querying capabilities for data gathered via web spiders. Utilizing mcp-web-ingest-analyzer, your generative intelligence pipeline can vet and interpret digitized web artifacts either under explicit command or through autonomous operation. The operational server component incorporates a comprehensive text search engine with support for logical operators, alongside resource classification based on MIME type, HTTP response code, and other metadata.

mcp-web-ingest-analyzer equips the Large Language Model (LLM) with a comprehensive functional repertoire and is interoperable with a spectrum of established web harvesting utilities:

Harvester/Format Functionality Summary Supported OS Configuration Guide
ArchiveBox Offline web preservation utility macOS/Linux Setup Guide
HTTrack Desktop web mirroring application macOS/Windows/Linux Setup Guide
InterroBot Interactive crawling and evaluation suite macOS/Windows/Linux Setup Guide
Katana Command-line instrument focused on security reconnaissance macOS/Windows/Linux Setup Guide
SiteOne Interactive crawling and evaluation suite macOS/Windows/Linux Setup Guide
WARC Standardized digital archive container format diverse Setup Guide
wget CLI utility for recursive site copying macOS/Linux Setup Guide

mcp-web-ingest-analyzer is distributed under a permissive, open-source license, and mandates the presence of Claude Desktop along with Python (version 3.10 or newer). Installation is facilitated via the command line interface using pip:

bash pip install mcp-server-webcrawl

For comprehensive, sequential instructions on deploying the MCP server stack, consult the Setup Guides.

Core Functionalities

  • Compatibility layer for Claude Desktop operations
  • Support for heterogeneous crawler outputs
  • Metadata-driven filtering by classification, response code, and more
  • Native support for Boolean expression parsing
  • Utility for Markdown conversion and result segment extraction
  • Facility for constructing bespoke site-specific data repositories

Procedure Definitions (Routines)

mcp-web-ingest-analyzer furnishes the necessary tooling to explore archived web indices fluidly, allowing for on-the-fly adaptation based on user inquiry. This is its foundational design principle.

It is equally equipped to execute pre-defined procedural sequences (packaged as prompts). These routines can be custom-authored or sourced from the included library. These instructional sets are designed for direct copy-paste integration as raw Markdown blocks. They leverage the advanced querying mechanism granted to the LLM, enabling the embedding of complex logic, sequential instructions, or even interactive loops, as demonstrated by the Gopher Service implementation.

Routine Retrieval Link Group Purpose Description
🔍 SEO Analysis auditseo.md audit Deep dive into technical Search Engine Optimization metrics. Covers fundamentals with branching options for detailed review.
🔗 Broken Link Scan audit404.md audit Identifies inaccessible uniform resource locators (URLs) and analyzes error recurrence patterns. Proposes corrective actions alongside fault identification.
⚡ Speed Evaluation auditperf.md audit Assessment of site load performance and optimization bottlenecks. Delivers candid, actionable feedback.
📁 Asset Inventory auditfiles.md audit Examination of file structure and resource composition across the indexed site. Reveals the underlying makeup of the digital footprint.
🌐 Gopher Interface gopher.md interface A throwback search interface reminiscent of legacy Gopher client environments.
⚙️ Query Validator testsearch.md self-test A comprehensive test suite to verify the logical consistency of the search query parser during translation to the FTS5 indexing structure.

If you wish to bypass the initial site selection step (reducing one interaction), append the command "run pasted for [site identifier or URL]" within the same request body as the routine's Markdown. If pasted without this explicit context, the system will prompt you to select an indexed archive from a displayed roster.

Logical Search Syntax

The retrieval mechanism supports targeted attribute searches (attribute: value) alongside compound logical constructions. The default fulltext search spans the URI, document body, and metadata headers.

Familiarity with the query grammar is beneficial, even though the LLM is the primary consumer of the API interface; queries are typically abstracted away in the main view. The specific generated statement can be revealed by expanding the MCP disclosure element.

Illustrative Search Expressions

Sample Query Interpretation
privacy Single-term fulltext match
"privacy policy" Fulltext match for the exact character sequence
boundar* Fulltext prefix match (e.g., boundary, boundaries)
id: 12345 Attribute search for a specific resource identifier
url: example.com/somedir Attribute search for URIs containing the specified path segment
type: html Attribute search restricted solely to HTML documents
status: 200 Attribute search matching resources that returned HTTP success code 200
status: >=400 Attribute search matching resources with HTTP error codes of 400 or higher
content: h1 Attribute search within the document body for the term 'h1'
headers: text/xml Attribute search within the HTTP response metadata headers
privacy AND policy Fulltext match requiring both terms to be present
privacy OR policy Fulltext match requiring either term to be present
policy NOT privacy Fulltext match for 'policy' excluding any that also contain 'privacy'
(login OR signin) AND form Complex fulltext match: finds documents containing 'form' and either 'login' or 'signin'
type: html AND status: 200 Combined filter: finds only successfully retrieved HTML documents

Attribute Search Definitions

Attribute searching enables fine-grained filtering by designating specific columns within the search index. This shifts the focus from scanning all textual content to isolating specific metadata points such as resource location, headers, or body text, leading to superior operational efficiency for targeted retrievals.

Attribute Data Scope
id Internal database identifier
url Uniform Resource Locator of the archived asset
type Enumerated category of the resource (refer to content types table)
size File size reported in bytes
status HTTP response code received during retrieval
headers Full set of HTTP response meta-information
content The primary payload body (HTML, CSS, JavaScript, etc.)

Data Field Inclusion Policy

A subset of attributes can be explicitly requested alongside search results, while core identifiers are perpetually included. Invoking headers or content carries a significant cost in token consumption; employ these judiciously or utilize the 'Extras' feature to summarize high-volume data efficiently for the LLM context window. Field inclusion is a primary configuration argument, separate from the attribute filtering performed via the search query.

Field Availability Status
id Always present
url Always present
type Always present
status Always present
created Requires explicit request
modified Requires explicit request
size Requires explicit request
headers Requires explicit request
content Requires explicit request

Content Type Classifications

Archived data encompasses more than just standard web pages. The type: filter allows grouping by general resource category, which is particularly useful for isolating media without resorting to intricate file extension matching. For instance, one might query for type: html NOT content: login to find pages lacking specific text, or type: img to focus analysis on image assets. The following table details all recognized content types within the retrieval system.

Type Classification Description
html Standard web documents
iframe Embedded document frames
img Raster and vector graphics from the web
audio Sound assets loaded by the browser
video Multimedia clips loaded by the browser
font Web typography files
style Cascading Style Sheets (CSS)
script Executable JavaScript files
rss XML-based content syndication feeds
text Unformatted, raw text content
pdf Portable Document Format files
doc Microsoft Word binary or XML documents
other Any resource not fitting a defined category

Supplementary Processing (Extras)

The extras parameter governs optional post-retrieval processing steps, allowing transformation of raw HTTP data (e.g., rendering HTML as Markdown, generating context snippets, applying custom Regular Expressions, or utilizing XPath selectors) or linking the LLM to external data visualizations (like thumbnails). These options are combinable to tailor the output format precisely to the analytical requirement.

Extra Processing Detail
thumbnails Renders images into base64 encoded format for direct visual ingestion and semantic description by AI agents. Optimizes token usage. Functional for image types; SVG formats are excluded from this process. Compatible when filtering by type: img.
markdown Converts raw HTML payload into a condensed Markdown representation, significantly minimizing token overhead and enhancing LLM readability. Applicable when filtering by type: html.
regex Applies specified regular expression patterns against crawled text artifacts (HTML, CSS, JS, etc.) to extract matching data segments. Offers broader scope than XPath for non-HTML documents. Patterns are defined via the extrasRegex argument.
snippets Locates and returns small contextual blocks surrounding fulltext query matches. If used without requesting the content field or markdown extra, it serves as a highly efficient method for result refinement without loading full pages. Effective across HTML, CSS, JS, and other text-based archives. Mimics classic search engine result highlighting.
xpath Selects and returns data based on W3C standard XPath expressions applied to HTML documents. Use selectors like text() for pure text extraction, or element selectors for the surrounding HTML structure. Only operative for resources matching type: html. Selectors are supplied via the extrasXpath argument.

By leveraging these supplemental features, users can engineer token-efficient data responses. Markdown conversion can shrink HTML size by roughly two-thirds; snippets yield small, fixed-size contextual summaries; and XPath allows for surgical extraction. More precise data requests ensure a higher volume of relevant findings can be accommodated within the active LLM context window.

The underlying philosophy is that the LLM should manage this complexity autonomously. If monitoring indicates an over-reliance on the raw "content" field (unfiltered HTML), a directive within the chat to leverage the extras functionality for token budgeting should suffice.

Direct Terminal Operation

Bypass AI integration: Execute traditional Boolean queries directly within the command line interface against your web archives.

mcp-web-ingest-analyzer possesses the capacity to function as a standalone terminal search utility for locally or remotely indexed web crawls. While local archive searches are straightforward, its utility expands significantly when combined with SSH access to remote hosts, allowing immediate interrogation of archives situated elsewhere without necessitating downloads, synchronization, or complex authentication procedures. The interactive mode facilitates rapid initiation of searches against remote crawl datasets.

Initiate execution using --crawler and --datasource flags to pre-load parameters, or configure the required source and harvester interactively post-launch.

bash mcp-web-ingest-analyzer --crawler wget --datasrc /path/to/datasrc --interactive

Alternatively, define the harvester and source within the interactive session:

mcp-web-ingest-analyzer --interactive

Interactive mode provides a mechanism for iterative exploration of indexed data subsets, accessible anytime, anywhere, directly within a terminal session.

WIKIPEDIA: XMLHttpRequest (XHR) is an API accessible via a JavaScript object that provides methods for dispatching HTTP requests from a browser environment to a web server. These methods permit browser-based applications to communicate with the server subsequent to initial page rendering, allowing for dynamic data exchange. XMLHttpRequest constitutes a fundamental pillar of Asynchronous JavaScript and XML (Ajax) programming methodologies. Before the widespread adoption of Ajax, server interaction primarily relied on standard hyperlink navigation and HTML form submissions, operations that typically resulted in the complete replacement of the current displayed page.

== Genesis == The conceptual foundation for the XMLHttpRequest capability was formulated in the year 2000 by the development team responsible for Microsoft Outlook. This concept was subsequently integrated into the Internet Explorer 5 browser release (1999). However, the initial implementation did not utilize the standardized XMLHttpRequest object identifier. Instead, developers employed COM object instantiation via ActiveXObject("Msxml2.XMLHTTP") or ActiveXObject("Microsoft.XMLHTTP"). As of Internet Explorer 7 (released in 2006), universal support for the explicit XMLHttpRequest identifier has been established across all major browser engines, including Mozilla's Gecko engine (2002), Apple's Safari 1.2 (2004), and Opera 8.0 (2005).

=== Standardization Efforts === The World Wide Web Consortium (W3C) formally published an initial Working Draft specification for the XMLHttpRequest object on April 5, 2006. A subsequent Level 2 specification was advanced by the W3C on February 25, 2008. The Level 2 specification introduced enhancements such as event progress monitoring, support for cross-origin data transfers, and capacity for handling raw byte streams. By the close of 2011, the enhancements defined in Level 2 were merged back into the primary specification document. In late 2012, stewardship of the specification's maintenance transitioned to the WHATWG, which now sustains a continuously updated living document defined using Web IDL notation.

== Implementation Protocol == Generally, the process of dispatching a request using XMLHttpRequest necessitates adherence to a sequence of programming actions.

  1. Instantiate an XMLHttpRequest object by invoking its constructor:
  2. Invoke the "open" method to define the request HTTP verb, specify the target resource URI, and select either synchronous or asynchronous execution mode:
  3. If utilizing asynchronous operation, establish an event listener callback function intended to process changes in the request's state:
  4. Commence the data transfer by executing the "send" method:
  5. Process state transitions within the registered event handler. Upon successful receipt of server data, it is typically stored in the "responseText" property by default. When the object concludes its processing cycle, its state transitions to 4, signifying the "done" status.

Beyond these fundamental steps, XMLHttpRequest offers extensive configurability for controlling transmission behavior and response parsing. Custom HTTP headers can be prepended to the request to convey server instructions, and data payloads can be uploaded to the server via an argument passed to the "send" call. The returned data stream can be deserialized from JSON into native JavaScript objects or processed incrementally as data arrives, avoiding a mandatory wait for the complete transmission. Furthermore, a request can be terminated prematurely or assigned a timeout threshold, causing failure if not completed within the specified duration.

== Inter-Domain Communication ==

During the nascent phase of the World Wide Web's evolution, it was recognized that enabling restricted access across different security origins could lead to significant security vulnerabilities, leading to the implementation of origin policies that govern cross-domain communication via XHR.

See Also

`