logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

mcp-databricks-unified-access-service

A service facilitating advanced interaction with Databricks environments, centered on Unity Catalog (UC) metadata interrogation, deep lineage tracing (including code artifacts), and secure SQL execution for empowering autonomous AI agents in data operations.

Author

mcp-databricks-unified-access-service logo

RafaelCartenet

MIT License

Quick Info

GitHub GitHub Stars 24
NPM Weekly Downloads 0
Tools 1
Last Updated 2026-02-19

Tags

databricksautomationcloudqueries databricksdatabricks servermcp databricks

Databricks Unified Data Access Service (MCP)

This document details the Model Context Protocol (MCP) server engineered to bridge Large Language Models (LLMs) with comprehensive Databricks environments, focusing intensely on leveraging Unity Catalog (UC) for contextual awareness.

Rationale and Value Proposition

Unity Catalog (UC) offers sophisticated governance and documentation capabilities for data assets (catalogs, schemas, tables, columns). Realizing the full return on investment for meticulously documenting these assets often requires external systems to consume this context intelligently. This MCP server serves as that crucial conduit, allowing sophisticated reasoning engines (LLMs) direct, programmatic access to the totality of your UC documentation.

The central justification is performance: the richer the descriptive metadata input into UC, the more proficient an AI handler becomes at autonomously formulating precise, contextually accurate SQL commands to satisfy complex data inquiries.

Architectural Overview

This specialized MCP apparatus is architected for deep integration with Databricks, prioritizing the utilization of Unity Catalog metadata and fostering exhaustive data origin and transformation pathway analysis. The core objective is to furnish an AI agent with a holistic operational toolkit, thereby achieving high levels of autonomy in resolving data-centric requests. By programmatically traversing UC structures, comprehending data artifact relationships (including data lineage linking notebooks and scheduled jobs), and executing data manipulation language (DML) statements, the agent minimizes dependency on explicit human instruction for incremental steps.

Furthermore, moving beyond simple metadata enumeration, this service permits agents to investigate the underlying computational artifacts that govern data transformations. Through advanced lineage mapping, agents can pinpoint associated execution units (notebooks or jobs) acting upon specific data entities, subsequently examining the source code therein to discern embedded transformation logic, adherence to business mandates, and quality validation procedures. This establishes a potent reinforcement cycle where the agent understands not merely what data exists, but how that data is methodologically conditioned and structured.

When operating within an established Agent framework, this component excels at tackling sequences of objectives, spanning from pure data discovery and dependency impact assessment to exhaustive source code examination.

Tangible Benefits of Enriched UC Metadata for Agentic Systems

The tooling exposed by this access mechanism is specifically tuned to interpret and materialize the narratives embedded within Unity Catalog descriptions, while simultaneously enabling rigorous scrutiny of the computation workflows impacting the data. This yields measurable advantages for LLM-driven agents, directly amplifying their capability to synthesize utility-driven SQL and interpret the broader data operations topology:

  • Enhanced Contextual Clarity: Agents rapidly assimilate the intended purpose of data constructs (tables/columns), drastically lowering semantic ambiguity. This forms the bedrock for accurate query construction.
  • Superior Query Synthesis: Availability of schema documentation, data typings, and inferred relationships elevates the precision and semantic fidelity of generated SQL statements.
  • Optimized Data Exploration for Planning: Metadata facilitates efficient traversal across the catalog hierarchy, enabling agents to select the correct data entities and attributes for inclusion in resultant queries.
  • Comprehensive Data Provenance: Extending beyond tabular relationships, agents can map data flows to the originating notebooks and orchestration jobs, critical for dependency charting and debugging pipeline anomalies.
  • Source Code Interpretability: Direct access to notebook contents allows agents to dissect explicit transformation logic, underlying business rules, and integrated data validation mechanisms.
  • End-to-End Flow Tracing: Agents gain the ability to trace data provenance from initial ingress through complex pipelines to final consumption layers, understanding both structure and procedural logic at every juncture.

Leveraging richly documented UC assets via this service furnishes an LLM agent with superior informational capital, leading to more judicious decision-making and, ultimately, the production of more effective data retrieval commands. For instance, schema documentation directly informs the agent about viable data origins for a specific analytical query:

Fig 1: A schema in Unity Catalog with user-provided descriptions. This MCP server makes this information directly accessible to an LLM, informing its query strategy.

Conversely, granular comments attached to individual columns clarify field semantics, which is paramount for formulating precise SQL conditional clauses and selections:

Fig 2: Column-level descriptions in Unity Catalog. These details are passed to the LLM, aiding its understanding of the data structure for precise SQL generation.

Tooling Capabilities and Interfaces

This MCP service furnishes a curated set of computational instruments designed to augment an LLM agent's interaction capabilities with the Databricks runtime:

Fundamental Operations: * SQL Command Dispatch: Execute arbitrary SQL using the Databricks SDK via the execute_sql_query(sql: str) function. Excellent for granular data extraction or executing complex procedures. * LLM-Optimized Output Format: All descriptive functions systematically serialize their output into Markdown syntax, which is pre-optimized for efficient syntactic parsing and comprehension by Large Language Models.

Unity Catalog Navigation Interfaces: These tools empower agentic exploration to build context before query formulation:

  1. list_uc_catalogs() -> str

    • Function: Enumerates all discoverable Unity Catalogs, including their assigned identifiers and descriptive text.
    • Usage Context: Initiates the discovery phase when catalog names are unknown, providing a high-level inventory of accessible environments.
  2. describe_uc_catalog(catalog_name: str) -> str

    • Function: Delivers a summary profile of a designated UC catalog, detailing its constituent schemas along with their respective identifiers and documentation.
    • Usage Context: Employed subsequent to catalog identification, preceding detailed schema inspection.
    • Parameters:
      • catalog_name: The unique identifier for the catalog (e.g., enterprise_analytics, sandbox).
  3. describe_uc_schema(catalog_name: str, schema_name: str, include_columns: Optional[bool] = False) -> str

    • Function: Renders detailed metadata for a specific schema within a catalog. The response lists all contained tables, with an option to incorporate full column specifications.
    • Usage Context: Essential for inventorying schema contents. Set include_columns=True for query preparation, though this increases output verbosity. Defaulting to False yields a quicker table summary.
    • Parameters:
      • catalog_name: Parent catalog identifier.
      • schema_name: Target schema identifier.
      • include_columns: Boolean flag to embed column-level details. Defaults to False.
  4. describe_uc_table(full_table_name: str, include_lineage: Optional[bool] = False) -> str

    • Function: Provides an exhaustive description of a singular UC table, featuring advanced capabilities for dependency mapping.
    • Usage Context: Prerequisite for accurate SQL construction against the table. When lineage is enabled, it yields rich context far beyond simple structural attributes:
      • Tabular Dependency Tracing: Upstream data sources and downstream consumers.
      • Artifact Lineage: Mapping of associated processing notebooks and scheduled jobs (including job identifiers, task definitions, and notebook path references).
      • Source Code Insight: The lineage output includes paths that guide the agent to read and analyze the actual transformation logic implemented in repository-backed code.
    • Parameters:
      • full_table_name: The FQN: catalog.schema.table format.
      • include_lineage: If set to True, triggers the retrieval of comprehensive lineage information, which may incur higher latency but unlocks code exploration capabilities.
  5. execute_sql_query(sql: str) -> str

    • Function: Dispatches the provided SQL string to the configured Databricks SQL endpoint and returns the resulting dataset presentation.
    • Usage Context: Invoked subsequent to metadata verification, for data retrieval or manipulation.
    • Parameters:
      • sql: The complete, validated SQL command string.

Deployment and Configuration Instructions

Prerequisites

  • A system running Python version 3.10 or newer.
  • If utilizing the uv package manager, ensure it is properly established.

Installation Procedure

  1. Acquire necessary packages:

bash pip install -r requirements.txt

Alternatively, employing uv:

bash uv pip install -r requirements.txt

  1. Establish operational credentials via environment configuration:

    Method A: Utilizing a configuration file (Recommended Practice)

    Create a .env file in the root directory of this service repository, populating it with your Databricks endpoint details:

    env DATABRICKS_HOST="your-databricks-instance.cloud.databricks.com" DATABRICKS_TOKEN="your-databricks-personal-access-token" DATABRICKS_SQL_WAREHOUSE_ID="your-sql-warehouse-id"

    Method B: Setting system environment variables directly

    bash export DATABRICKS_HOST="your-databricks-instance.cloud.databricks.com" export DATABRICKS_TOKEN="your-databricks-personal-access-token" export DATABRICKS_SQL_WAREHOUSE_ID="your-sql-warehouse-id"

    The DATABRICKS_SQL_WAREHOUSE_ID is vital for executing queries and fetching lineage data. Basic metadata navigation relies on broader UC API permissions and may not strictly mandate this ID unless lineage retrieval is part of the request.

Security and Access Prerequisites

Prior to initialization, verify that the security principal identified by the DATABRICKS_TOKEN possesses the requisite privileges within the Databricks workspace:

  1. Unity Catalog Authorization:
    • USE CATALOG privilege across all targeted catalogs.
    • USE SCHEMA privilege for relevant schemas.
    • SELECT privilege on tables slated for detailed inspection or query execution.
    • For comprehensive catalog listing, permissions at the metastore level may be required, or the service will only report on catalogs where the principal holds at least USE CATALOG rights.
  2. SQL Endpoint Authorization (Mandatory for execute_sql_query and lineage retrieval):
    • The CAN_USE entitlement on the SQL Warehouse identified by DATABRICKS_SQL_WAREHOUSE_ID.
  3. Token Scoping:
    • The generated token (PAT or Service Principal credential) must possess the minimal scopes necessary for workspace interaction. SQL execution necessitates SQL access rights.
    • Security guideline dictates the use of narrowly scoped Service Principals in automated, production environments.

For robust security posture, implement periodic credential rotation policies and utilize audit logging mechanisms (both query history and UC audit logs) to maintain oversight of service activities.

Execution Modes

Direct Execution Launch

To initiate the service in isolation (e.g., for testing compatibility with Agent Composer):

bash python main.py

This command activates the MCP service utilizing the standard input/output (stdio) transport mechanism, which is compatible with various MCP consumers.

Integration with Cursor IDE

To integrate this functionality within the Cursor IDE environment, configure the service endpoint in your user settings file (~/.cursor/mcp.json):

  1. Ensure the .cursor directory exists in your home directory.
  2. Create or modify mcp.json within that directory:

bash mkdir -p ~/.cursor touch ~/.cursor/mcp.json

  1. Populate mcp.json with the command execution context, substituting the directory path with the actual location of this service's root:

{ "mcpServers": { "databricks": { "command": "uv", "args": [ "--directory", "/path/to/your/mcp-databricks-server", "run", "main.py" ] } } }

Alternatively, using the standard Python launcher:

{ "mcpServers": { "databricks": { "command": "python", "args": [ "/path/to/your/mcp-databricks-server/main.py" ] } } }

Cursor requires a restart to register these new server definitions, after which the databricks agent becomes available for utilization.

Exemplary Agentic Workflow Sequence

This service empowers an LLM agent to navigate the Databricks landscape without external guidance. The subsequent illustration depicts a typical sequence where the agent iteratively probes schemas and tables, dynamically adjusting its exploration strategy until the required data is successfully localized and retrieved.

Fig 3: An LLM agent using the Databricks MCP tools, demonstrating iterative exploration and query refinement to locate specific page view data.

A hypothetical agent execution flow might proceed as follows:

  1. Initial Discovery: Invoke list_uc_catalogs()
    • Agent identifies platform_prod as the catalog likely containing target data.
  2. Catalog Deep Dive: Execute describe_uc_catalog(catalog_name="platform_prod")
    • Agent notes the existence of schemas named web_events and user_profiles.
  3. Schema Quick Scan: Initiate describe_uc_schema(catalog_name="platform_prod", schema_name="web_events")
    • Agent observes tables like page_views and session_logs.
  4. Structure Inspection (Detailed): Call describe_uc_schema(catalog_name="platform_prod", schema_name="web_events", include_columns=True)
    • Alternatively, for a single entity: describe_uc_table(full_table_name="platform_prod.web_events.page_views")
  5. Dependency & Code Tracing: Request describe_uc_table(full_table_name="platform_prod.web_events.page_views", include_lineage=True)
    • Agent discovers upstream dependencies, downstream consumers, and critically, associated ETL scripts (e.g., identifies that /Repos/data_pipelines/web_aggregator.py writes data here).
    • Agent utilizes the notebook path to fetch and examine the source code for inherent business logic and quality checks.
  6. Final Query Formulation and Execution: Dispatch execute_sql_query(sql="SELECT page_url, COUNT(*) FROM platform_prod.web_events.page_views WHERE event_timestamp >= date_sub(current_date(), 7) GROUP BY 1 ORDER BY 2 DESC LIMIT 50")

Governance through Infrastructure as Code (IaC) for Metadata

While manual metadata entry via the Databricks interface is feasible, a superior, more robust methodology involves defining Unity Catalog governance objects declaratively using Infrastructure as Code tools like Terraform. This approach yields substantial advantages:

  • Version Control Integration: Governance definitions reside in Git, facilitating tracking and auditing alongside application code.
  • Environment Homogeneity: Guarantees that metadata documentation remains consistent across development, testing, and production tiers.
  • Automated Management: Allows metadata lifecycle management to be integrated seamlessly into established CI/CD toolchains.
  • Durability for Core Assets: Defining persistent entities (catalogs, schemas) as code ensures their documentation structure is resilient to manual drift.

An illustrative example of defining catalog structure via the Terraform Databricks provider:

terraform resource "databricks_catalog" "data_core" { name = "data_core" comment = "Primary enterprise catalog for governed production datasets." storage_root = var.global_storage_prefix force_destroy = false }

Schemas within the 'data_core' catalog

resource "databricks_schema" "data_ingest" { catalog_name = databricks_catalog.data_core.name name = "bronze_raw" comment = "Landing zone for immutable, raw ingested data. No transformations applied." }

resource "databricks_schema" "data_curated" { catalog_name = databricks_catalog.data_core.name name = "silver_trusted" comment = "Cleaned, conformed, and quality-checked data sets. Schema enforced." }

Importing existing UC assets into Terraform management is straightforward using the terraform import command. Once imported, you can safely modify descriptive attributes, such as the comment field, within your configuration files, ensuring that terraform apply commits only that metadata update without altering the underlying data structure.

Adopting this metadata-as-code philosophy, particularly for foundational governance layers, substantially elevates the dependability and quality of the contextual input consumed by this MCP service, thereby maximizing the efficacy of automated agent operations.

For exhaustive details on Terraform governance with Databricks UC, consult the official HashiCorp documentation: * Databricks Provider: Catalog Resource (https://registry.terraform.io/providers/databricks/databricks/latest/docs/resources/catalog) * Databricks Provider: Schemas Data Source (https://registry.terraform.io/providers/databricks/databricks/latest/docs/data-sources/schemas)

Considerations for Extended Query Latency

The execute_sql_query utility relies on the Databricks SDK's synchronous execution model, which incorporates a finite waiting period (wait_timeout configured internally, often around '50s'). Queries exceeding this threshold may result in the tool returning a statement identifier for asynchronous polling, although the current synchronous implementation might time out waiting for a complete response. Users must be aware of this constraint when submitting operations demanding extensive processing time.

Software Dependencies

  • databricks-sdk: Core library for interfacing with Databricks REST APIs and the Unity Catalog ecosystem.
  • python-dotenv: Utility for sourcing environmental configuration variables from .env files.
  • mcp[cli]: The foundational Model Context Protocol framework.
  • asyncio: Required for managing asynchronous operations inherent to the MCP server structure.
  • httpx: HTTP client, typically pulled in as a transitive dependency by databricks-sdk or mcp.

See Also

`