youtube-caption-utility-mcp
Provides rapid, multilingual summarization and full timed-transcript extraction for YouTube media via integration with the DeepSRT processing backend. Optimized for consumption within MCP environments.
Author

DeepSRT
Quick Info
Actions
Tags
DeepSRT Utility Server for Video Content Analysis
This package operates as a Model Context Protocol (MCP) service endpoint, utilizing DeepSRT's infrastructure to process YouTube video captions, generate concise abstractive summaries, and extract structured, timestamped text records. It bypasses traditional API key requirements by accessing YouTube caption tracks directly via the InnerTube mechanism.
Core Features Overview
- Video Abstraction: Generates summaries in specified formats (e.g., narrative, bulleted lists).
- Transcript Retrieval: Outputs complete, time-coded transcripts from available caption sources.
- Multilingual Support: Handles processing and output localization for various languages (e.g.,
en,zh-tw,ja). - Direct Access: Leverages built-in mechanisms to fetch captions directly from YouTube without external credentials.
MCP Configuration Snippet
To enable this service within an MCP-aware application (like Claude Desktop):
{
"mcpServers": {
"deepsrt": {
"type": "stdio",
"command": "bunx",
"args": [
"@deepsrt/deepsrt-mcp@latest",
"--server"
]
}
}
}
System Blueprint
graph TD
subgraph ClientLayer [MCP Consumer Interface]
A[User Application (e.g., Cline/Desktop)]
end
subgraph ServerNode [Utility MCP Daemon]
B[Request Handler]
C[Summary Generator Tool]
D[Transcript Fetcher Tool]
B --> C
B --> D
end
subgraph ExternalServices [Remote Resources]
E[YouTube InnerTube API]
F[DeepSRT Computation Cluster]
G[YouTube Caption XML Source]
end
A --> B
C --> E
C --> F
D --> E
D --> G
Operational Sequence: Summarization
sequenceDiagram
participant U as User
participant C as Client
participant S as Daemon
participant Y as YouTube
participant D as DeepSRT
U->>C: Initiate summary request (URL/ID)
C->>S: Call get_summary(params)
S->>Y: Probe video ID via InnerTube (metadata/tracks)
Y-->>S: Return manifest & caption listings
S->>S: Select optimal caption stream (Prioritize human-authored)
par Processing Path
S->>D: Transmit data block for abstraction
D-->>S: Deliver synthesized summary
and Title Localization
S->>D: Request title translation if necessary
D-->>S: Return translated title string
end
S->>S: Apply required output formatting (Markdown)
S-->>C: Respond with structured summary data
C-->>U: Present result
Transcript Retrieval Flow
This path focuses on extracting raw timed text:
- Client Request: User requests transcript via
get_transcript(videoId, lang). - Metadata Fetch: Server queries YouTube InnerTube for video details and available caption tracks.
- Track Selection: The daemon identifies the highest quality caption source matching the requested language, favoring manually uploaded sets.
- XML Retrieval: The raw
<timedtext>XML file is fetched from its CDN location. - Local Parsing: The server parses the XML, decodes HTML entities, synchronizes timestamps into
[MM:SS]format, and filters irrelevant markers. - Output: Returns the finalized, time-aligned text block to the client.
Technical Highlights
Caption Selection Logic
Intelligent stream selection ensures data fidelity: 1. Manually Uploaded Captions (Highest fidelity). 2. Auto-Generated Captions (Fallback). 3. Any Available Stream (Lowest priority, if the preferred language is absent).
Data Transformation
Transcript data undergoes rigorous cleaning:
* Conversion from milliseconds to readable time codes.
* HTML entity resolution (e.g., & becomes &).
* Filtering of boilerplate text or audio descriptions (e.g., "(Music)").
Command Line Interface (CLI)
The utility package also exposes direct CLI functions when executed without the --server flag, allowing immediate interaction via bunx or globally installed binaries.
Example: Direct Summary Fetch
bunx @deepsrt/deepsrt-mcp get-summary <video_id> --lang=en --mode=bullet
Example: Direct Transcript Fetch
bunx @deepsrt/deepsrt-mcp get-transcript <video_url> --lang=fr
Version Log (Selected)
v0.1.9
- Refined API interaction logic to improve resilience against temporary rate limits (HTTP 429 responses).
- Significantly boosted stability of the end-to-end validation suite.
v0.1.3
- Corrected behavior for the
--mode bulletparameter. - Ensured seamless execution environment compatibility with the latest
bunxtoolchain.
Integration Methods
For Claude Desktop/Cline Users
Place the provided JSON configuration block into your respective application settings file. The system defaults to invoking the latest version of the package via bunx for automatic updates and dependency management.
For Standalone Node/Bun Execution
Development workflow favors Bun for speed:
# Install dependencies
npm install
# Start server in development mode (live reloading)
npm run dev
Frequently Encountered Issues (FAQ)
Q: Why might I receive a 404 error from the DeepSRT backend?
A: A 404 often indicates that the requested summary content has not yet been indexed or cached by the DeepSRT CDN edge network. To pre-warm the cache, users should access the video URL once using the associated DeepSRT browser extension. Verification can be done via curl, checking for cache-status: HIT.
