NCBI Bio Data Connector (BDC) Protocol

This repository offers a Python framework adhering to the Model Context Protocol (MCP) standard for robust interaction with core NCBI public data sources.

Deployment Prerequisites

Obtain the source code repository.
Resolve dependencies using the provided manifest:

pip install -r requirements.txt

Configure authentication credentials in a local .env configuration file:

NCBI_API_KEY=your_secret_key NCBI_EMAIL=contact_email@domain.com

Activating the BDC Service

Execute the primary service file to initialize the connection listener:

python ncbi_mcp.py

Interaction via LLM Interface (Cursor/Claude)

Once the BDC service is operational, utilize natural language commands to trigger API operations.

Invoking Queries via Tool Call

Structured invocation using the dedicated command:

tools/call { "name": "nlp-query", "arguments": { "query": "Retrieve scholarly abstracts related to CRISPR applications in immunology" } }

Alternatively, direct invocation syntax is supported for convenience:

@ncbi-bio-data-connector Analyze the significance of the p53 gene in oncogenesis

Illustrative Command Examples

Elucidating molecular roles:

@ncbi-bio-data-connector Provide a functional synopsis for the Interleukin-6 protein.

Obtaining genomic metrics:

@ncbi-bio-data-connector What are the fundamental genomic statistics for Mus musculus?

Assembly quality assessment:

@ncbi-bio-data-connector Determine the N50 length and contig count for the most recent Aspergillus fumigatus assembly.

Data volume interrogation:

@ncbi-bio-data-connector How many sequence read archive (SRA) entries exist for human glioblastoma multiforme (GBM) samples?

Literature surveying:

@ncbi-bio-data-connector Search for recent peer-reviewed literature concerning mRNA vaccine efficacy against circulating SARS-CoV-2 variants.

Targeted gene data retrieval:

@ncbi-bio-data-connector Detail the known characteristics of the CFTR locus.

Chromosomal structure retrieval:

@ncbi-bio-data-connector Fetch high-level genomic structure data for Drosophila melanogaster.

Validation Procedures

Testing the BDC service robustness involves executing bundled test suites:

Default test run (focusing on NLP interface)

.\run_test.bat

Comprehensive tool validation

.\run_test.bat all

Targeted validation using a specific scenario file

.\run_test.bat test_scenario_suite.jsonl

Validation focused solely on abstractive tools

.\run_test.bat test_abstraction_layer.jsonl

The automated test runner sequence initiates the service, injects test payloads derived from the JSONL files, pauses for I/O completion, and then shuts down the service instance, presenting the results.

For interactive, non-terminating manual verification, use direct input piping:

Execute manual tests against the running process

type test_nlp_query.jsonl | python ncbi_mcp.py

The test cases emulate JSON-RPC interactions that mirror the communication flow from the LLM agent interface.

Tool Portfolio

The NCBI BDC furnishes two operational tiers: sophisticated, abstracted interfaces responsive to contextual language, and granular, low-level interfaces mapping directly to NCBI E-utilities.

Operational Guidance for AI Models

Recommended Execution Paths

For general biological inquiry, privilege the nlp-query utility; its advanced reasoning engine directs requests to the most pertinent underlying specialized function.

Standardized Workflow Sequences:

Gene Annotation Analysis: Begin with nlp-query, transition to summarize-gene for consolidated summaries, employ get_gene_info for structured attributes, or combine ncbi-search with ncbi-fetch for precise retrieval.
Genomic Structure Assessment: Utilize genome-stats for metrics like N50/L50, use get_genome_info for metadata, or use count-datasets to check assembly availability.
Scholarly Literature Mining: Use nlp-query for free-form searches, specify ncbi-search with database="pubmed" for explicit literature targeting, and employ ncbi-fetch for full citation retrieval.
Data Set Sourcing: Employ count-datasets to gauge data pool sizes, use nlp-query for exploratory searches, or leverage ncbi-search for systematic database cataloging.
E-utilities Interaction (Expert Use): Use ncbi-info to map available data repositories, employ ncbi-global-query for broad term scanning, use ncbi-search to secure unique identifiers (UIDs), apply ncbi-summary for brief overviews, use ncbi-fetch for full record extraction, and utilize ncbi-link for cross-repository relationship mapping.
Inter-Repository Correlation: Initiate with ncbi-search on one domain, map findings using ncbi-link, summarize metadata with ncbi-summary, and retrieve full details via ncbi-fetch.

Tool Selection Matrix

Abstracted Interfaces (Primary Recommendation): - nlp-query: General biological context questions, complex synthesis tasks, default choice when tool specificity is uncertain. - summarize-gene: In-depth gene function analysis and annotation consolidation. - genome-stats: Querying organism genome dimensions, assembly quality metrics, and comparative genomics data. - count-datasets: Estimating data corpus sizes relevant to a research topic. - get_gene_info: Retrieving standardized, structured attribute sets for a specified gene locus. - get_genome_info: Fetching detailed metadata pertaining to a specific reference genome build.

Granular E-Utilities Interfaces (Advanced Users): - ncbi-search (ESearch): Precise database subsetting using filtering, Boolean logic, and field qualification (e.g., [Title], [Taxonomy]). - ncbi-fetch (EFetch): Downloading complete record objects post-search; supports formats like FASTA, GenBank, or XML. - ncbi-summary (ESummary): Obtaining metadata abstracts without incurring the overhead of full record transmission. - ncbi-link (ELink): Resolving connected entities across NCBI collections (e.g., mapping a sequence ID to associated literature PMIDs). - ncbi-info (EInfo): Discovering the current catalogue of accessible databases and their respective service constraints. - ncbi-global-query (EGQuery): Executing a unified search operation across the entire NCBI data landscape. - ncbi-spell (ESpell): Suggesting corrections or alternative spellings for potentially ambiguous input terms. - ncbi-citation-match (ECitMatch): Identifying specific PubMed IDs (PMIDs) based on citation text fragments.

Biological Data Context and Semantics

NCBI Repository Overview: - Gene: Centralized repository for gene annotations, functional descriptions, and chromosomal coordinates. - Protein: Stores sequence data, domain information, and functional modifications for proteins. - Nucleotide: Repository for raw sequence data (DNA/RNA), encompassing coding and non-coding regions. - PubMed: Index of biomedical and life sciences journal literature. - BioSample: Metadata describing the source material used in experiments (cell lines, tissues, isolates). - SRA: Archive for raw, unprocessed next-generation sequencing output. - Assembly: Versioned records detailing high-quality reference genome builds.

Key Biological Terminology: - Gene Identifier (Gene ID): NCBI's internal, immutable numerical tag (e.g., 672). - Accession Number: The versioned identifier for a specific sequence entry (e.g., NM_001126114.3). - N50/L50: Metrics quantifying contiguity and completeness of a genome assembly; higher N50 implies superior scaffolding. - Reference Assembly: The canonical, highest-quality genome representation for a taxon.

Search Optimization Techniques: - Employ recognized Gene Symbols (e.g., TP53) for unambiguous targeting. - Explicitly state the species (using formal binomial nomenclature when possible) to resolve homonym conflicts. - Integrate logical connectors (AND/OR/NOT) for complex query construction. - Utilize NCBI-specific field tags such as [Protein Name] or [Taxonomy ID] to narrow scope.

Interface Demonstrations

Context-Aware Query Handler

tools/call { "name": "nlp-query", "arguments": { "query": "Summarize the pathological mechanisms associated with NF1 loss-of-function mutations" } }

Gene Annotation Synthesizer

tools/call { "name": "summarize-gene", "arguments": { "gene_name": "TP53" } }

Organism Assembly Metric Extractor

tools/call { "name": "genome-stats", "arguments": { "organism": "Canis familiaris" } }

Data Availability Assessor

tools/call { "name": "count-datasets", "arguments": { "database": "bioproject", "query": "human cancer epigenetics study" } }

Low-Level E-Utility Mappings

Precision Database Retrieval

tools/call { "name": "ncbi-search", "arguments": { "database": "nucleotide", "term": "BRCA2 transcript variant 1", "filters": { "organism": "Homo sapiens", "seq_type": "mRNA" } } }

Record Materialization

tools/call { "name": "ncbi-fetch", "arguments": { "database": "protein", "ids": ["P04637"], "rettype": "fasta" } }

Gene Data Extraction

tools/call { "name": "get_gene_info", "arguments": { "gene_id": "1017" } }

Reference Genome Metadata Retrieval

tools/call { "name": "get_genome_info", "arguments": { "organism": "Zea mays", "reference": true } }

Operational License

Licensed under the terms of Apache License, Version 2.0.

WIKIPEDIA NOTE: The underlying transport mechanism shares conceptual similarities with the Asynchronous JavaScript and XML (Ajax) methodology, relying on background HTTP communication primitives defined by the XMLHttpRequest (XHR) interface to communicate between client applications and remote servers without full page refreshes.

== Developmental Genesis == The foundational concept enabling asynchronous server polling originated within the Microsoft Outlook development team around the year 2000. This mechanism was first integrated into Internet Explorer version 5 (1999), though the initial invocation syntax employed COM-based object instantiations: ActiveXObject("Msxml2.XMLHTTP") or ActiveXObject("Microsoft.XMLHTTP"). By the release of IE7 (2006), standardized support for the XMLHttpRequest string identifier became universal across all major browser engines, including Mozilla's Gecko (2002), Apple's Safari 1.2 (2004), and Opera 8.0 (2005).

=== Standardization Efforts === The World Wide Web Consortium (W3C) formally published its initial specification draft for the XHR object on April 5, 2006. A subsequent Level 2 draft, introduced on February 25, 2008, enhanced capabilities by adding asynchronous event monitoring, enabling cross-origin requests (CORS), and support for raw byte stream handling. The Level 2 features were later consolidated back into the primary specification late in 2011. Development stewardship transitioned to the WHATWG consortium in 2012, which currently maintains the living document using Web IDL definitions.

== Routine Implementation Pattern == Communicating via XHR typically involves a defined sequence of programmatic actions:

Instantiation: Instantiate the primary communication object using its constructor: new XMLHttpRequest().
Configuration: Invoke the open() method to establish the request method (GET/POST), define the target URI, and select between synchronous or asynchronous execution mode.
Asynchronous Setup (If applicable): Attach an event handler function to monitor the object's state transitions.
Transmission: Invoke the send() method to dispatch the request to the server, optionally including payload data.
Response Handling: Monitor state changes within the listener. Upon reaching state 4 ("done"), the server response is accessible, typically within the responseText property.

Beyond these basics, XHR offers granular control: request headers can be customized for server guidance; data payloads can be uploaded; responses can be immediately deserialized from JSON; processing can occur incrementally as data streams in; and operations can be forcefully terminated or subjected to time constraints.

ncbi-bio-data-connector

Author

noahzeidenberg

Quick Info

Actions

Tags