DocVectorIndex: Semantic Archival and Retrieval for Technical Artifacts

This system is engineered to crawl, ingest, transform, and index complex documentation repositories, leveraging PostgreSQL with the pgvector extension for high-fidelity, AI-driven semantic searching. It exposes this capability via standardized MCP interfaces for deep integration into AI-augmented IDE workflows.

Core Capabilities

Automated Ingestion: Performs recursive scraping of designated documentation endpoints, respecting established crawl boundaries and throttling policies.
Data Normalization: Transforms raw web content (HTML) into canonical Markdown format, systematically extracting pertinent structural and descriptive metadata.
Vectorization Pipeline: Employs leading foundation models, mediated via AWS Bedrock, to generate high-dimensional vector representations of content chunks.
Asynchronous Job Orchestration: Provides robust mechanisms for monitoring and auditing the entire ingestion lifecycle via job tracking.
IDE Interoperability: Features native support for the MCP protocol, enabling seamless invocation from AI agents embedded within coding platforms.

Future Enhancements (V2 Focus)

Single-Page Application (SPA) Support: Current web traversal logic requires updating to properly handle modern, client-side rendered documentation portals.
Data Persistence Optimization: Refactoring to leverage database write-behind caching strategies to reduce immediate DB insertion latency during bulk operations.

Initial Deployment Guide (Local Development)

Prerequisites Checklist

Docker Engine and Docker Compose installed.
Node.js environment (version 16 or newer).
Git source control utility.
Active AWS Credential Set with access permissions for Amazon Bedrock services.
Configured AWS Command Line Interface (CLI).

Setup Sequence

Repository Cloning: Secure a local copy of the repository: bash git clone https://github.com/visheshd/docmcp.git cd docmcp
Environment Variable Configuration:
- Duplicate the template file: bash cp .env.example .env
- Database Connection: Adjust DATABASE_URL to point to the local PostgreSQL instance (port 5433): postgresql://postgres:postgres@localhost:5433/docmcp
- AWS Bedrock Credentials: Configure regional settings and access keys within .env (e.g., AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), or ensure the environment satisfies AWS CLI credential resolution.
- Review other operational parameters like logging verbosity.
Environment Launch: bash # Ensure execution rights chmod +x dev-start.sh

Initiate the stack

./dev-start.sh

This script automates the instantiation of the necessary PostgreSQL service (with pgvector enabled), dependency resolution, schema migration execution, and the loading of initial configuration data. The database endpoint is exposed externally on port 5433.
Data Ingestion Command: Utilize the dedicated ingestion script to populate the vector index: bash # Fundamental crawl operation npm run add-docs -- --url https://example.com/docs --max-depth 3

Advanced ingestion with metadata tagging

npm run add-docs -- \ --url https://example.com/docs \ --max-depth 3 \ --tags library,api-v2 \ --package toolkit \ --version 2.1.0 \ --wait

Argument reference: - --url: Source documentation endpoint (Mandatory). - --max-depth: Defines traversal limits (Default is 3). - --tags: Keywords for semantic grouping. - --package/--version: Schema identifiers for version control. - --wait: Blocks until the entire processing workflow is concluded.
Query Execution: After ingestion, semantic lookups are performed using the established MCP interfaces. Refer to the section detailing MCP tool invocation.
Environment Shutdown: bash docker-compose -f docker-compose.dev.yml down

This setup prioritizes a rapid development iteration cycle using containerized infrastructure for the database backend.

AI IDE Integration (Cursor Example)

To facilitate immediate access within the Cursor IDE, integrate the following transport configuration into the IDE's settings file:

{ "docmcp-local-stdio": { "transport": "stdio", "command": "node", "args": [ "/dist/stdio-server.js" ], "clientInfo": { "name": "cursor-client", "version": "1.0.0" } } }

Ensure <DOCMCP_INSTALL_ROOT> is substituted with the absolute filesystem path where the DocVectorIndex project resides (e.g., /home/user/code/docmcp). Cursor must be restarted for this configuration to become active.

System Topology

The architecture is decomposed into specialized operational modules:

CrawlerEngine: Responsible for adherence to robots.txt and link graph traversal.
TransformationEngine: Manages content sanitization (HTML to Markdown) and metadata abstraction.
WorkflowManager: Oversees the state transitions and progress reporting for all background processing activities.
VectorStore: Handles the persistent storage and nearest-neighbor search execution utilizing vector similarity metrics.
MCP Bridge: The standardized communication layer facilitating AI agent interaction.

Document Indexing Workflow

Initiation: A documentation ingestion request is submitted via the add_documentation MCP utility, generating a transient Job entity marked as 'pending'.
Acquisition: The CrawlerEngine traverses the specified domain, respecting depth constraints, and materializes raw Document records.
Refinement: The TransformationEngine cleans the raw content, normalizes formatting, and annotates records with contextual attributes (version, package).
Embedding Generation: The VectorStore component segments documents into semantically meaningful units, uses Bedrock to generate numerical embeddings for each segment, and commits them to the pgvector index.
Completion: The WorkflowManager finalizes the Job record, updating its status to 'complete' and ensuring all derived artifacts are query-ready.
Information Retrieval: User queries trigger a reverse process: the query is vectorized, similarity search is run against the index, and highly relevant text snippets, tied back to their source documents, are returned via the query_documentation MCP function.