mcp-databricks-unified-access-service
A service facilitating advanced interaction with Databricks environments, centered on Unity Catalog (UC) metadata interrogation, deep lineage tracing (including code artifacts), and secure SQL execution for empowering autonomous AI agents in data operations.
Author

RafaelCartenet
Quick Info
Actions
Tags
Databricks Unified Data Access Service (MCP)
This document details the Model Context Protocol (MCP) server engineered to bridge Large Language Models (LLMs) with comprehensive Databricks environments, focusing intensely on leveraging Unity Catalog (UC) for contextual awareness.
- Rationale and Value Proposition
- Architectural Overview
- Tangible Benefits of Enriched UC Metadata for Agentic Systems
- Tooling Capabilities and Interfaces
- Deployment and Configuration Instructions
- Prerequisites
- Installation Procedure
- Security and Access Prerequisites
- Execution Modes
- Direct Execution Launch
- Integration with Cursor IDE
- Exemplary Agentic Workflow Sequence
- Governance through Infrastructure as Code (IaC) for Metadata
- Considerations for Extended Query Latency
- Software Dependencies
Rationale and Value Proposition
Unity Catalog (UC) offers sophisticated governance and documentation capabilities for data assets (catalogs, schemas, tables, columns). Realizing the full return on investment for meticulously documenting these assets often requires external systems to consume this context intelligently. This MCP server serves as that crucial conduit, allowing sophisticated reasoning engines (LLMs) direct, programmatic access to the totality of your UC documentation.
The central justification is performance: the richer the descriptive metadata input into UC, the more proficient an AI handler becomes at autonomously formulating precise, contextually accurate SQL commands to satisfy complex data inquiries.
Architectural Overview
This specialized MCP apparatus is architected for deep integration with Databricks, prioritizing the utilization of Unity Catalog metadata and fostering exhaustive data origin and transformation pathway analysis. The core objective is to furnish an AI agent with a holistic operational toolkit, thereby achieving high levels of autonomy in resolving data-centric requests. By programmatically traversing UC structures, comprehending data artifact relationships (including data lineage linking notebooks and scheduled jobs), and executing data manipulation language (DML) statements, the agent minimizes dependency on explicit human instruction for incremental steps.
Furthermore, moving beyond simple metadata enumeration, this service permits agents to investigate the underlying computational artifacts that govern data transformations. Through advanced lineage mapping, agents can pinpoint associated execution units (notebooks or jobs) acting upon specific data entities, subsequently examining the source code therein to discern embedded transformation logic, adherence to business mandates, and quality validation procedures. This establishes a potent reinforcement cycle where the agent understands not merely what data exists, but how that data is methodologically conditioned and structured.
When operating within an established Agent framework, this component excels at tackling sequences of objectives, spanning from pure data discovery and dependency impact assessment to exhaustive source code examination.
Tangible Benefits of Enriched UC Metadata for Agentic Systems
The tooling exposed by this access mechanism is specifically tuned to interpret and materialize the narratives embedded within Unity Catalog descriptions, while simultaneously enabling rigorous scrutiny of the computation workflows impacting the data. This yields measurable advantages for LLM-driven agents, directly amplifying their capability to synthesize utility-driven SQL and interpret the broader data operations topology:
- Enhanced Contextual Clarity: Agents rapidly assimilate the intended purpose of data constructs (tables/columns), drastically lowering semantic ambiguity. This forms the bedrock for accurate query construction.
- Superior Query Synthesis: Availability of schema documentation, data typings, and inferred relationships elevates the precision and semantic fidelity of generated SQL statements.
- Optimized Data Exploration for Planning: Metadata facilitates efficient traversal across the catalog hierarchy, enabling agents to select the correct data entities and attributes for inclusion in resultant queries.
- Comprehensive Data Provenance: Extending beyond tabular relationships, agents can map data flows to the originating notebooks and orchestration jobs, critical for dependency charting and debugging pipeline anomalies.
- Source Code Interpretability: Direct access to notebook contents allows agents to dissect explicit transformation logic, underlying business rules, and integrated data validation mechanisms.
- End-to-End Flow Tracing: Agents gain the ability to trace data provenance from initial ingress through complex pipelines to final consumption layers, understanding both structure and procedural logic at every juncture.
Leveraging richly documented UC assets via this service furnishes an LLM agent with superior informational capital, leading to more judicious decision-making and, ultimately, the production of more effective data retrieval commands. For instance, schema documentation directly informs the agent about viable data origins for a specific analytical query:
Fig 1: A schema in Unity Catalog with user-provided descriptions. This MCP server makes this information directly accessible to an LLM, informing its query strategy.
Conversely, granular comments attached to individual columns clarify field semantics, which is paramount for formulating precise SQL conditional clauses and selections:
Fig 2: Column-level descriptions in Unity Catalog. These details are passed to the LLM, aiding its understanding of the data structure for precise SQL generation.
Tooling Capabilities and Interfaces
This MCP service furnishes a curated set of computational instruments designed to augment an LLM agent's interaction capabilities with the Databricks runtime:
Fundamental Operations:
* SQL Command Dispatch: Execute arbitrary SQL using the Databricks SDK via the execute_sql_query(sql: str) function. Excellent for granular data extraction or executing complex procedures.
* LLM-Optimized Output Format: All descriptive functions systematically serialize their output into Markdown syntax, which is pre-optimized for efficient syntactic parsing and comprehension by Large Language Models.
Unity Catalog Navigation Interfaces: These tools empower agentic exploration to build context before query formulation:
-
list_uc_catalogs() -> str- Function: Enumerates all discoverable Unity Catalogs, including their assigned identifiers and descriptive text.
- Usage Context: Initiates the discovery phase when catalog names are unknown, providing a high-level inventory of accessible environments.
-
describe_uc_catalog(catalog_name: str) -> str- Function: Delivers a summary profile of a designated UC catalog, detailing its constituent schemas along with their respective identifiers and documentation.
- Usage Context: Employed subsequent to catalog identification, preceding detailed schema inspection.
- Parameters:
catalog_name: The unique identifier for the catalog (e.g.,enterprise_analytics,sandbox).
-
describe_uc_schema(catalog_name: str, schema_name: str, include_columns: Optional[bool] = False) -> str- Function: Renders detailed metadata for a specific schema within a catalog. The response lists all contained tables, with an option to incorporate full column specifications.
- Usage Context: Essential for inventorying schema contents. Set
include_columns=Truefor query preparation, though this increases output verbosity. Defaulting toFalseyields a quicker table summary. - Parameters:
catalog_name: Parent catalog identifier.schema_name: Target schema identifier.include_columns: Boolean flag to embed column-level details. Defaults toFalse.
-
describe_uc_table(full_table_name: str, include_lineage: Optional[bool] = False) -> str- Function: Provides an exhaustive description of a singular UC table, featuring advanced capabilities for dependency mapping.
- Usage Context: Prerequisite for accurate SQL construction against the table. When lineage is enabled, it yields rich context far beyond simple structural attributes:
- Tabular Dependency Tracing: Upstream data sources and downstream consumers.
- Artifact Lineage: Mapping of associated processing notebooks and scheduled jobs (including job identifiers, task definitions, and notebook path references).
- Source Code Insight: The lineage output includes paths that guide the agent to read and analyze the actual transformation logic implemented in repository-backed code.
- Parameters:
full_table_name: The FQN:catalog.schema.tableformat.include_lineage: If set toTrue, triggers the retrieval of comprehensive lineage information, which may incur higher latency but unlocks code exploration capabilities.
-
execute_sql_query(sql: str) -> str- Function: Dispatches the provided SQL string to the configured Databricks SQL endpoint and returns the resulting dataset presentation.
- Usage Context: Invoked subsequent to metadata verification, for data retrieval or manipulation.
- Parameters:
sql: The complete, validated SQL command string.
Deployment and Configuration Instructions
Prerequisites
- A system running Python version 3.10 or newer.
- If utilizing the
uvpackage manager, ensure it is properly established.
Installation Procedure
- Acquire necessary packages:
bash pip install -r requirements.txt
Alternatively, employing uv:
bash uv pip install -r requirements.txt
-
Establish operational credentials via environment configuration:
Method A: Utilizing a configuration file (Recommended Practice)
Create a
.envfile in the root directory of this service repository, populating it with your Databricks endpoint details:env DATABRICKS_HOST="your-databricks-instance.cloud.databricks.com" DATABRICKS_TOKEN="your-databricks-personal-access-token" DATABRICKS_SQL_WAREHOUSE_ID="your-sql-warehouse-id"
Method B: Setting system environment variables directly
bash export DATABRICKS_HOST="your-databricks-instance.cloud.databricks.com" export DATABRICKS_TOKEN="your-databricks-personal-access-token" export DATABRICKS_SQL_WAREHOUSE_ID="your-sql-warehouse-id"
The
DATABRICKS_SQL_WAREHOUSE_IDis vital for executing queries and fetching lineage data. Basic metadata navigation relies on broader UC API permissions and may not strictly mandate this ID unless lineage retrieval is part of the request.
Security and Access Prerequisites
Prior to initialization, verify that the security principal identified by the DATABRICKS_TOKEN possesses the requisite privileges within the Databricks workspace:
- Unity Catalog Authorization:
USE CATALOGprivilege across all targeted catalogs.USE SCHEMAprivilege for relevant schemas.SELECTprivilege on tables slated for detailed inspection or query execution.- For comprehensive catalog listing, permissions at the metastore level may be required, or the service will only report on catalogs where the principal holds at least
USE CATALOGrights.
- SQL Endpoint Authorization (Mandatory for
execute_sql_queryand lineage retrieval):- The
CAN_USEentitlement on the SQL Warehouse identified byDATABRICKS_SQL_WAREHOUSE_ID.
- The
- Token Scoping:
- The generated token (PAT or Service Principal credential) must possess the minimal scopes necessary for workspace interaction. SQL execution necessitates SQL access rights.
- Security guideline dictates the use of narrowly scoped Service Principals in automated, production environments.
For robust security posture, implement periodic credential rotation policies and utilize audit logging mechanisms (both query history and UC audit logs) to maintain oversight of service activities.
Execution Modes
Direct Execution Launch
To initiate the service in isolation (e.g., for testing compatibility with Agent Composer):
bash python main.py
This command activates the MCP service utilizing the standard input/output (stdio) transport mechanism, which is compatible with various MCP consumers.
Integration with Cursor IDE
To integrate this functionality within the Cursor IDE environment, configure the service endpoint in your user settings file (~/.cursor/mcp.json):
- Ensure the
.cursordirectory exists in your home directory. - Create or modify
mcp.jsonwithin that directory:
bash mkdir -p ~/.cursor touch ~/.cursor/mcp.json
- Populate
mcp.jsonwith the command execution context, substituting the directory path with the actual location of this service's root:
{ "mcpServers": { "databricks": { "command": "uv", "args": [ "--directory", "/path/to/your/mcp-databricks-server", "run", "main.py" ] } } }
Alternatively, using the standard Python launcher:
{ "mcpServers": { "databricks": { "command": "python", "args": [ "/path/to/your/mcp-databricks-server/main.py" ] } } }
Cursor requires a restart to register these new server definitions, after which the databricks agent becomes available for utilization.
Exemplary Agentic Workflow Sequence
This service empowers an LLM agent to navigate the Databricks landscape without external guidance. The subsequent illustration depicts a typical sequence where the agent iteratively probes schemas and tables, dynamically adjusting its exploration strategy until the required data is successfully localized and retrieved.
Fig 3: An LLM agent using the Databricks MCP tools, demonstrating iterative exploration and query refinement to locate specific page view data.
A hypothetical agent execution flow might proceed as follows:
- Initial Discovery: Invoke
list_uc_catalogs()- Agent identifies
platform_prodas the catalog likely containing target data.
- Agent identifies
- Catalog Deep Dive: Execute
describe_uc_catalog(catalog_name="platform_prod")- Agent notes the existence of schemas named
web_eventsanduser_profiles.
- Agent notes the existence of schemas named
- Schema Quick Scan: Initiate
describe_uc_schema(catalog_name="platform_prod", schema_name="web_events")- Agent observes tables like
page_viewsandsession_logs.
- Agent observes tables like
- Structure Inspection (Detailed): Call
describe_uc_schema(catalog_name="platform_prod", schema_name="web_events", include_columns=True)- Alternatively, for a single entity:
describe_uc_table(full_table_name="platform_prod.web_events.page_views")
- Alternatively, for a single entity:
- Dependency & Code Tracing: Request
describe_uc_table(full_table_name="platform_prod.web_events.page_views", include_lineage=True)- Agent discovers upstream dependencies, downstream consumers, and critically, associated ETL scripts (e.g., identifies that
/Repos/data_pipelines/web_aggregator.pywrites data here). - Agent utilizes the notebook path to fetch and examine the source code for inherent business logic and quality checks.
- Agent discovers upstream dependencies, downstream consumers, and critically, associated ETL scripts (e.g., identifies that
- Final Query Formulation and Execution: Dispatch
execute_sql_query(sql="SELECT page_url, COUNT(*) FROM platform_prod.web_events.page_views WHERE event_timestamp >= date_sub(current_date(), 7) GROUP BY 1 ORDER BY 2 DESC LIMIT 50")
Governance through Infrastructure as Code (IaC) for Metadata
While manual metadata entry via the Databricks interface is feasible, a superior, more robust methodology involves defining Unity Catalog governance objects declaratively using Infrastructure as Code tools like Terraform. This approach yields substantial advantages:
- Version Control Integration: Governance definitions reside in Git, facilitating tracking and auditing alongside application code.
- Environment Homogeneity: Guarantees that metadata documentation remains consistent across development, testing, and production tiers.
- Automated Management: Allows metadata lifecycle management to be integrated seamlessly into established CI/CD toolchains.
- Durability for Core Assets: Defining persistent entities (catalogs, schemas) as code ensures their documentation structure is resilient to manual drift.
An illustrative example of defining catalog structure via the Terraform Databricks provider:
terraform resource "databricks_catalog" "data_core" { name = "data_core" comment = "Primary enterprise catalog for governed production datasets." storage_root = var.global_storage_prefix force_destroy = false }
Schemas within the 'data_core' catalog
resource "databricks_schema" "data_ingest" { catalog_name = databricks_catalog.data_core.name name = "bronze_raw" comment = "Landing zone for immutable, raw ingested data. No transformations applied." }
resource "databricks_schema" "data_curated" { catalog_name = databricks_catalog.data_core.name name = "silver_trusted" comment = "Cleaned, conformed, and quality-checked data sets. Schema enforced." }
Importing existing UC assets into Terraform management is straightforward using the terraform import command. Once imported, you can safely modify descriptive attributes, such as the comment field, within your configuration files, ensuring that terraform apply commits only that metadata update without altering the underlying data structure.
Adopting this metadata-as-code philosophy, particularly for foundational governance layers, substantially elevates the dependability and quality of the contextual input consumed by this MCP service, thereby maximizing the efficacy of automated agent operations.
For exhaustive details on Terraform governance with Databricks UC, consult the official HashiCorp documentation: * Databricks Provider: Catalog Resource (https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/catalog) * Databricks Provider: Schemas Data Source (https://registry.terraform.io/providers/databricks/databricks/latest/docs/data-sources/schemas)
Considerations for Extended Query Latency
The execute_sql_query utility relies on the Databricks SDK's synchronous execution model, which incorporates a finite waiting period (wait_timeout configured internally, often around '50s'). Queries exceeding this threshold may result in the tool returning a statement identifier for asynchronous polling, although the current synchronous implementation might time out waiting for a complete response. Users must be aware of this constraint when submitting operations demanding extensive processing time.
Software Dependencies
databricks-sdk: Core library for interfacing with Databricks REST APIs and the Unity Catalog ecosystem.python-dotenv: Utility for sourcing environmental configuration variables from.envfiles.mcp[cli]: The foundational Model Context Protocol framework.asyncio: Required for managing asynchronous operations inherent to the MCP server structure.httpx: HTTP client, typically pulled in as a transitive dependency bydatabricks-sdkormcp.
