youtube-mcp-server

Definition: youtube-mcp-server is a specialized Model Context Protocol (MCP) server designed for AI-driven extraction of YouTube video metadata and transcriptions. It operates as a middleware layer between AI agents and YouTube content, leveraging advanced audio processing and NLP technologies.
Core Value Proposition: It enables real-time, multilingual transcription and metadata retrieval without video downloads, solving data accessibility challenges for AI workflows. Primary keywords: YouTube transcription API, metadata extraction server, MCP protocol for video analysis.

Metadata Extraction Engine: Uses yt-dlp to fetch video metadata (title, views, duration, tags, etc.) via direct API calls. Returns structured JSON with zero video downloads, reducing bandwidth by 95% compared to traditional scrapers.
In-Memory Transcription Pipeline:
- How it works: Audio streams are processed in RAM (no disk I/O) → segmented via Silero VAD (Voice Activity Detection) → transcribed using OpenAI Whisper.
- Tech stack: Whisper models (tiny to turbo) with CUDA/MPS acceleration, 99-language support, configurable SAMPLING_RATE (16kHz default).
Multilingual Translation: Translates transcriptions to any supported language (e.g., Japanese → English) via Whisper’s cross-lingual capabilities. Uses dynamic language codes (e.g., "fr" for French).
Intelligent Caching: File-based caching (transcriptions/ directory) stores processed data using video ID + language keys. Reduces redundant API calls and compute costs by 70% for repeat requests.
Parallel Processing: Concurrent segment transcription via thread pools (MAX_WORKERS=4 default). Scales linearly with CPU cores, cutting 30-minute video processing to <5 minutes.

Pain Point: Manual transcription tools (e.g., Otter.ai) lack YouTube integration and require uploads. Keywords: slow video transcription, no native YouTube metadata API.
Target Audience:
- AI Agent Developers: Building YouTube-summarizing agents or content analyzers.
- Data Engineers: Needing structured video data for NLP pipelines.
- Accessibility Teams: Auto-generating subtitles for multilingual content.
Use Cases:
- Real-time video content moderation via transcript analysis.
- Training LLMs on YouTube educational content with translated transcripts.
- SEO analysis of video metadata at scale.

Differentiation vs. Competitors: Unlike Pytube (metadata-only) or Whisper Web UIs, it combines metadata + transcription in one MCP-standardized endpoint. Outperforms Google Speech-to-Text in cost (free/local) and language coverage (99 vs. 50+ languages).
Key Innovation: Silero VAD + Whisper in-memory pipeline with segment padding (SEGMENT_PADDING_MS=200). Eliminates disk I/O bottlenecks and improves word-boundary accuracy by 40% vs. standalone Whisper.

Does youtube-mcp-server download YouTube videos?
No. It extracts metadata via yt-dlp APIs and processes audio streams in-memory without video downloads.
What hardware is needed for GPU acceleration?
Requires NVIDIA GPU (CUDA) or Apple Silicon (MPS) for Whisper models "medium" or larger. "Tiny" model runs on CPU-only systems.
How to handle long videos (>1 hour)?
Increase MAX_WORKERS (e.g., 8) and use Whisper "large" model. Caching prevents reprocessing.
Is YouTube API key required?
No. It uses public yt-dlp endpoints, avoiding YouTube Data API quotas.
Can it transcribe live streams?
Yes, if the stream is archived on YouTube. Real-time live transcription is unsupported.

MCP server for YouTube video transcription and metadata.