DocVectorIndex
A utility for ingesting and querying technical manuals and API specifications through a vector database, powered by LLM embeddings for precise semantic retrieval within AI development environments.
Author

visheshd
Quick Info
Actions
Tags
DocVectorIndex: Semantic Archival and Retrieval for Technical Artifacts
This system is engineered to crawl, ingest, transform, and index complex documentation repositories, leveraging PostgreSQL with the pgvector extension for high-fidelity, AI-driven semantic searching. It exposes this capability via standardized MCP interfaces for deep integration into AI-augmented IDE workflows.
Core Capabilities
- Automated Ingestion: Performs recursive scraping of designated documentation endpoints, respecting established crawl boundaries and throttling policies.
- Data Normalization: Transforms raw web content (HTML) into canonical Markdown format, systematically extracting pertinent structural and descriptive metadata.
- Vectorization Pipeline: Employs leading foundation models, mediated via AWS Bedrock, to generate high-dimensional vector representations of content chunks.
- Asynchronous Job Orchestration: Provides robust mechanisms for monitoring and auditing the entire ingestion lifecycle via job tracking.
- IDE Interoperability: Features native support for the MCP protocol, enabling seamless invocation from AI agents embedded within coding platforms.
Future Enhancements (V2 Focus)
- Single-Page Application (SPA) Support: Current web traversal logic requires updating to properly handle modern, client-side rendered documentation portals.
- Data Persistence Optimization: Refactoring to leverage database write-behind caching strategies to reduce immediate DB insertion latency during bulk operations.
Initial Deployment Guide (Local Development)
Prerequisites Checklist
- Docker Engine and Docker Compose installed.
- Node.js environment (version 16 or newer).
- Git source control utility.
- Active AWS Credential Set with access permissions for Amazon Bedrock services.
- Configured AWS Command Line Interface (CLI).
Setup Sequence
-
Repository Cloning: Secure a local copy of the repository: bash git clone https://github.com/visheshd/docmcp.git cd docmcp
-
Environment Variable Configuration:
-
Duplicate the template file: bash cp .env.example .env
-
Database Connection: Adjust
DATABASE_URLto point to the local PostgreSQL instance (port 5433):postgresql://postgres:postgres@localhost:5433/docmcp - AWS Bedrock Credentials: Configure regional settings and access keys within
.env(e.g.,AWS_REGION,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY), or ensure the environment satisfies AWS CLI credential resolution. - Review other operational parameters like logging verbosity.
-
-
Environment Launch: bash # Ensure execution rights chmod +x dev-start.sh
Initiate the stack
./dev-start.sh
This script automates the instantiation of the necessary PostgreSQL service (with pgvector enabled), dependency resolution, schema migration execution, and the loading of initial configuration data. The database endpoint is exposed externally on port 5433.
-
Data Ingestion Command: Utilize the dedicated ingestion script to populate the vector index: bash # Fundamental crawl operation npm run add-docs -- --url https://example.com/docs --max-depth 3
Advanced ingestion with metadata tagging
npm run add-docs -- \ --url https://example.com/docs \ --max-depth 3 \ --tags library,api-v2 \ --package toolkit \ --version 2.1.0 \ --wait
Argument reference: -
--url: Source documentation endpoint (Mandatory). ---max-depth: Defines traversal limits (Default is 3). ---tags: Keywords for semantic grouping. ---package/--version: Schema identifiers for version control. ---wait: Blocks until the entire processing workflow is concluded. -
Query Execution: After ingestion, semantic lookups are performed using the established MCP interfaces. Refer to the section detailing MCP tool invocation.
-
Environment Shutdown: bash docker-compose -f docker-compose.dev.yml down
This setup prioritizes a rapid development iteration cycle using containerized infrastructure for the database backend.
AI IDE Integration (Cursor Example)
To facilitate immediate access within the Cursor IDE, integrate the following transport configuration into the IDE's settings file:
{
"docmcp-local-stdio": {
"transport": "stdio",
"command": "node",
"args": [
"
Ensure <DOCMCP_INSTALL_ROOT> is substituted with the absolute filesystem path where the DocVectorIndex project resides (e.g., /home/user/code/docmcp). Cursor must be restarted for this configuration to become active.
System Topology
The architecture is decomposed into specialized operational modules:
- CrawlerEngine: Responsible for adherence to
robots.txtand link graph traversal. - TransformationEngine: Manages content sanitization (HTML to Markdown) and metadata abstraction.
- WorkflowManager: Oversees the state transitions and progress reporting for all background processing activities.
- VectorStore: Handles the persistent storage and nearest-neighbor search execution utilizing vector similarity metrics.
- MCP Bridge: The standardized communication layer facilitating AI agent interaction.
Document Indexing Workflow
- Initiation: A documentation ingestion request is submitted via the
add_documentationMCP utility, generating a transient Job entity marked as 'pending'. - Acquisition: The CrawlerEngine traverses the specified domain, respecting depth constraints, and materializes raw Document records.
- Refinement: The TransformationEngine cleans the raw content, normalizes formatting, and annotates records with contextual attributes (version, package).
- Embedding Generation: The VectorStore component segments documents into semantically meaningful units, uses Bedrock to generate numerical embeddings for each segment, and commits them to the pgvector index.
- Completion: The WorkflowManager finalizes the Job record, updating its status to 'complete' and ensuring all derived artifacts are query-ready.
- Information Retrieval: User queries trigger a reverse process: the query is vectorized, similarity search is run against the index, and highly relevant text snippets, tied back to their source documents, are returned via the
query_documentationMCP function.
This systematic pipeline ensures data integrity, traceability, and high-performance semantic retrieval capabilities.
