Product Introduction
- Dubbing 3.0 by Sieve is a studio-quality API designed to automate video and audio localization through AI-powered translation, voice cloning, and lip synchronization. It processes content in 30+ languages while preserving speaker voices, accents, and timing through advanced machine learning models.
- The core value lies in its ability to replace manual dubbing workflows with scalable, enterprise-grade outputs that maintain linguistic accuracy, multi-speaker consistency, and lip movement alignment. It enables businesses to adapt global content without sacrificing production quality or operational efficiency.
Main Features
- Studio-quality voice cloning with accent preservation replicates original speaker voices using spectral analysis and noise suppression, maintaining vocal characteristics even in dynamic environments. This includes tone matching for regional dialects and handling overlapping speakers in multi-party conversations.
- Multi-speaker segmentation leverages proprietary diarization models to isolate individual voices, translate their speech, and redub each segment with synchronized timing. The system automatically detects speaker changes and processes them as independent audio tracks.
- Frame-accurate lip synchronization uses viseme prediction algorithms to align generated speech with on-screen mouth movements, achieving under 100ms audio-visual drift. This integrates with video processing pipelines to maintain sync across variable playback speeds.
- Customizable translation rules allow users to enforce brand terminology through safe word lists, regional style guides (e.g., "Brazilian Portuguese"), and phrase-level overrides via JSON dictionaries. API parameters enable strict literal translations or context-aware adaptations.
- Modular audio engines support switching between voice cloning, generic TTS, and hybrid modes, with adjustable parameters for speaking rate (50-200% of original speed) and pitch modulation (±20 semitones). Each engine optimizes for either voice similarity or computational efficiency.
- Enterprise-grade infrastructure provides SOC 2 Type 2 compliant processing, sub-30-minute SLA guarantees for batch jobs, and configurable data retention policies. The API scales linearly across distributed GPU clusters for petabyte-level workloads.
Problems Solved
- Eliminates manual dubbing workflows by automating translation, voice synthesis, and lip synchronization through a unified API, reducing production time from weeks to hours. Traditional methods requiring separate teams for transcription, translation, and voice acting are replaced with a single integration.
- Addresses inconsistent localization quality in global markets through linguist-reviewed translation models trained on 15M+ parallel text pairs, achieving 98.6% BLEU scores for major language pairs. Regional slang and idioms are preserved using context-aware NMT architectures.
- Solves speaker voice discontinuity in multi-language content by maintaining vocal fingerprints across translations, using 256-dimensional speaker embeddings that remain stable through language transitions. This prevents jarring voice changes when switching between dubbed versions.
Unique Advantages
- Outperforms competitors in human evaluations, with 89% preference rates for lip sync accuracy and voice naturalness across third-party benchmarks. The system uses a patented phoneme duration predictor that reduces timing errors by 62% compared to industry averages.
- Only solution offering simultaneous multi-speaker diarization and per-voice translation parameters, enabling complex scenarios like overlapping dialogue localization. The diarization model achieves 92% accuracy on 8-speaker recordings through contrastive learning techniques.
- Provides atomic control over localization pipelines via 43 API parameters, including
force_pronunciationfor IPA overrides andmax_speech_gapfor speaker segmentation thresholds. This contrasts with black-box competitors offering only preset localization profiles.
Frequently Asked Questions (FAQ)
- How many languages are supported? The API supports 32 languages with full voice cloning capabilities and 100+ languages using non-cloning TTS engines, covering 93% of global GDP markets. Language packs include regional variants like Latin American Spanish and Canadian French.
- Can Sieve preserve the original speaker's voice across translations? Yes, the default voice cloning engine uses a 12-layer convolutional network to extract speaker embeddings, maintaining vocal characteristics with <1.5% spectral centroid deviation across language transitions.
- How does Sieve handle videos with multiple speakers? The system first applies speaker diarization using contrastive predictive coding, then processes each speaker's segments through separate translation and voice cloning pipelines. Outputs are recombined with frame-accurate alignment.
- Are outputs editable after processing? Users can edit translations via the
edit_segmentsendpoint, which accepts JSON arrays containing revised text, timestamps, and pronunciation guides. Edits propagate automatically through voice synthesis and lip sync stages. - Does Sieve train on user-provided content? No, all processing occurs ephemerally with zero data retention by default. Custom vocabulary terms are hashed before model ingestion and purged post-processing.
- Is real-time dubbing supported? Batch processing currently requires minimum 2x playback speed, with real-time capabilities slated for Q4 2024 release. Current latency averages 18 seconds per minute of audio on A100 GPUs.
