ncbi-bio-data-connector
Interface for accessing and processing comprehensive data sets from the National Center for Biotechnology Information (NCBI) repositories, such as PubMed literature, Gene annotations, and Protein records, to support advanced bioinformatics analyses and data integration.
Author

noahzeidenberg
Quick Info
Actions
Tags
NCBI Bio Data Connector (BDC) Protocol
This repository offers a Python framework adhering to the Model Context Protocol (MCP) standard for robust interaction with core NCBI public data sources.
Deployment Prerequisites
- Obtain the source code repository.
- Resolve dependencies using the provided manifest:
pip install -r requirements.txt
- Configure authentication credentials in a local
.envconfiguration file:
NCBI_API_KEY=your_secret_key NCBI_EMAIL=contact_email@domain.com
Activating the BDC Service
Execute the primary service file to initialize the connection listener:
python ncbi_mcp.py
Interaction via LLM Interface (Cursor/Claude)
Once the BDC service is operational, utilize natural language commands to trigger API operations.
Invoking Queries via Tool Call
Structured invocation using the dedicated command:
tools/call { "name": "nlp-query", "arguments": { "query": "Retrieve scholarly abstracts related to CRISPR applications in immunology" } }
Alternatively, direct invocation syntax is supported for convenience:
@ncbi-bio-data-connector Analyze the significance of the p53 gene in oncogenesis
Illustrative Command Examples
- Elucidating molecular roles:
@ncbi-bio-data-connector Provide a functional synopsis for the Interleukin-6 protein.
- Obtaining genomic metrics:
@ncbi-bio-data-connector What are the fundamental genomic statistics for Mus musculus?
- Assembly quality assessment:
@ncbi-bio-data-connector Determine the N50 length and contig count for the most recent Aspergillus fumigatus assembly.
- Data volume interrogation:
@ncbi-bio-data-connector How many sequence read archive (SRA) entries exist for human glioblastoma multiforme (GBM) samples?
- Literature surveying:
@ncbi-bio-data-connector Search for recent peer-reviewed literature concerning mRNA vaccine efficacy against circulating SARS-CoV-2 variants.
- Targeted gene data retrieval:
@ncbi-bio-data-connector Detail the known characteristics of the CFTR locus.
- Chromosomal structure retrieval:
@ncbi-bio-data-connector Fetch high-level genomic structure data for Drosophila melanogaster.
Validation Procedures
Testing the BDC service robustness involves executing bundled test suites:
Default test run (focusing on NLP interface)
.\run_test.bat
Comprehensive tool validation
.\run_test.bat all
Targeted validation using a specific scenario file
.\run_test.bat test_scenario_suite.jsonl
Validation focused solely on abstractive tools
.\run_test.bat test_abstraction_layer.jsonl
The automated test runner sequence initiates the service, injects test payloads derived from the JSONL files, pauses for I/O completion, and then shuts down the service instance, presenting the results.
For interactive, non-terminating manual verification, use direct input piping:
Execute manual tests against the running process
type test_nlp_query.jsonl | python ncbi_mcp.py
The test cases emulate JSON-RPC interactions that mirror the communication flow from the LLM agent interface.
Tool Portfolio
The NCBI BDC furnishes two operational tiers: sophisticated, abstracted interfaces responsive to contextual language, and granular, low-level interfaces mapping directly to NCBI E-utilities.
Operational Guidance for AI Models
Recommended Execution Paths
For general biological inquiry, privilege the nlp-query utility; its advanced reasoning engine directs requests to the most pertinent underlying specialized function.
Standardized Workflow Sequences:
-
Gene Annotation Analysis: Begin with
nlp-query, transition tosummarize-genefor consolidated summaries, employget_gene_infofor structured attributes, or combinencbi-searchwithncbi-fetchfor precise retrieval. -
Genomic Structure Assessment: Utilize
genome-statsfor metrics like N50/L50, useget_genome_infofor metadata, or usecount-datasetsto check assembly availability. -
Scholarly Literature Mining: Use
nlp-queryfor free-form searches, specifyncbi-searchwith database="pubmed" for explicit literature targeting, and employncbi-fetchfor full citation retrieval. -
Data Set Sourcing: Employ
count-datasetsto gauge data pool sizes, usenlp-queryfor exploratory searches, or leveragencbi-searchfor systematic database cataloging. -
E-utilities Interaction (Expert Use): Use
ncbi-infoto map available data repositories, employncbi-global-queryfor broad term scanning, usencbi-searchto secure unique identifiers (UIDs), applyncbi-summaryfor brief overviews, usencbi-fetchfor full record extraction, and utilizencbi-linkfor cross-repository relationship mapping. -
Inter-Repository Correlation: Initiate with
ncbi-searchon one domain, map findings usingncbi-link, summarize metadata withncbi-summary, and retrieve full details viancbi-fetch.
Tool Selection Matrix
Abstracted Interfaces (Primary Recommendation):
- nlp-query: General biological context questions, complex synthesis tasks, default choice when tool specificity is uncertain.
- summarize-gene: In-depth gene function analysis and annotation consolidation.
- genome-stats: Querying organism genome dimensions, assembly quality metrics, and comparative genomics data.
- count-datasets: Estimating data corpus sizes relevant to a research topic.
- get_gene_info: Retrieving standardized, structured attribute sets for a specified gene locus.
- get_genome_info: Fetching detailed metadata pertaining to a specific reference genome build.
Granular E-Utilities Interfaces (Advanced Users):
- ncbi-search (ESearch): Precise database subsetting using filtering, Boolean logic, and field qualification (e.g., [Title], [Taxonomy]).
- ncbi-fetch (EFetch): Downloading complete record objects post-search; supports formats like FASTA, GenBank, or XML.
- ncbi-summary (ESummary): Obtaining metadata abstracts without incurring the overhead of full record transmission.
- ncbi-link (ELink): Resolving connected entities across NCBI collections (e.g., mapping a sequence ID to associated literature PMIDs).
- ncbi-info (EInfo): Discovering the current catalogue of accessible databases and their respective service constraints.
- ncbi-global-query (EGQuery): Executing a unified search operation across the entire NCBI data landscape.
- ncbi-spell (ESpell): Suggesting corrections or alternative spellings for potentially ambiguous input terms.
- ncbi-citation-match (ECitMatch): Identifying specific PubMed IDs (PMIDs) based on citation text fragments.
Biological Data Context and Semantics
NCBI Repository Overview: - Gene: Centralized repository for gene annotations, functional descriptions, and chromosomal coordinates. - Protein: Stores sequence data, domain information, and functional modifications for proteins. - Nucleotide: Repository for raw sequence data (DNA/RNA), encompassing coding and non-coding regions. - PubMed: Index of biomedical and life sciences journal literature. - BioSample: Metadata describing the source material used in experiments (cell lines, tissues, isolates). - SRA: Archive for raw, unprocessed next-generation sequencing output. - Assembly: Versioned records detailing high-quality reference genome builds.
Key Biological Terminology: - Gene Identifier (Gene ID): NCBI's internal, immutable numerical tag (e.g., 672). - Accession Number: The versioned identifier for a specific sequence entry (e.g., NM_001126114.3). - N50/L50: Metrics quantifying contiguity and completeness of a genome assembly; higher N50 implies superior scaffolding. - Reference Assembly: The canonical, highest-quality genome representation for a taxon.
Search Optimization Techniques:
- Employ recognized Gene Symbols (e.g., TP53) for unambiguous targeting.
- Explicitly state the species (using formal binomial nomenclature when possible) to resolve homonym conflicts.
- Integrate logical connectors (AND/OR/NOT) for complex query construction.
- Utilize NCBI-specific field tags such as [Protein Name] or [Taxonomy ID] to narrow scope.
Interface Demonstrations
Context-Aware Query Handler
tools/call { "name": "nlp-query", "arguments": { "query": "Summarize the pathological mechanisms associated with NF1 loss-of-function mutations" } }
Gene Annotation Synthesizer
tools/call { "name": "summarize-gene", "arguments": { "gene_name": "TP53" } }
Organism Assembly Metric Extractor
tools/call { "name": "genome-stats", "arguments": { "organism": "Canis familiaris" } }
Data Availability Assessor
tools/call { "name": "count-datasets", "arguments": { "database": "bioproject", "query": "human cancer epigenetics study" } }
Low-Level E-Utility Mappings
Precision Database Retrieval
tools/call { "name": "ncbi-search", "arguments": { "database": "nucleotide", "term": "BRCA2 transcript variant 1", "filters": { "organism": "Homo sapiens", "seq_type": "mRNA" } } }
Record Materialization
tools/call { "name": "ncbi-fetch", "arguments": { "database": "protein", "ids": ["P04637"], "rettype": "fasta" } }
Gene Data Extraction
tools/call { "name": "get_gene_info", "arguments": { "gene_id": "1017" } }
Reference Genome Metadata Retrieval
tools/call { "name": "get_genome_info", "arguments": { "organism": "Zea mays", "reference": true } }
Operational License
Licensed under the terms of Apache License, Version 2.0.
WIKIPEDIA NOTE: The underlying transport mechanism shares conceptual similarities with the Asynchronous JavaScript and XML (Ajax) methodology, relying on background HTTP communication primitives defined by the XMLHttpRequest (XHR) interface to communicate between client applications and remote servers without full page refreshes.
== Developmental Genesis ==
The foundational concept enabling asynchronous server polling originated within the Microsoft Outlook development team around the year 2000. This mechanism was first integrated into Internet Explorer version 5 (1999), though the initial invocation syntax employed COM-based object instantiations: ActiveXObject("Msxml2.XMLHTTP") or ActiveXObject("Microsoft.XMLHTTP"). By the release of IE7 (2006), standardized support for the XMLHttpRequest string identifier became universal across all major browser engines, including Mozilla's Gecko (2002), Apple's Safari 1.2 (2004), and Opera 8.0 (2005).
=== Standardization Efforts === The World Wide Web Consortium (W3C) formally published its initial specification draft for the XHR object on April 5, 2006. A subsequent Level 2 draft, introduced on February 25, 2008, enhanced capabilities by adding asynchronous event monitoring, enabling cross-origin requests (CORS), and support for raw byte stream handling. The Level 2 features were later consolidated back into the primary specification late in 2011. Development stewardship transitioned to the WHATWG consortium in 2012, which currently maintains the living document using Web IDL definitions.
== Routine Implementation Pattern == Communicating via XHR typically involves a defined sequence of programmatic actions:
- Instantiation: Instantiate the primary communication object using its constructor:
new XMLHttpRequest(). - Configuration: Invoke the
open()method to establish the request method (GET/POST), define the target URI, and select between synchronous or asynchronous execution mode. - Asynchronous Setup (If applicable): Attach an event handler function to monitor the object's state transitions.
- Transmission: Invoke the
send()method to dispatch the request to the server, optionally including payload data. - Response Handling: Monitor state changes within the listener. Upon reaching state 4 ("done"), the server response is accessible, typically within the
responseTextproperty.
Beyond these basics, XHR offers granular control: request headers can be customized for server guidance; data payloads can be uploaded; responses can be immediately deserialized from JSON; processing can occur incrementally as data streams in; and operations can be forcefully terminated or subjected to time constraints.
