The localization industry has always had a blind spot: songs. Dialogue can be dubbed relatively cheaply by hiring voice actors in the target language. Songs require singers, studios, producers, and often rewrites of the lyrics to preserve meter and rhyme. The cost per minute of localized song content has historically been 5 to 20 times the cost per minute of localized dialogue.
AI singing voice synthesis is changing that math. Not by replacing singers entirely (the quality ceiling still favors humans for hero content) but by dramatically reducing the cost of secondary and tertiary song localization: background tracks in games, musical moments in ads, licensed music in narrative video, and lyrical content in educational media. A dubbing studio that can generate a passable localized version of a song in a new language for a fraction of the cost of hiring a session singer has a lot of new business.
The bottleneck, as usual, is training data. Cross-lingual singing synthesis requires multilingual vocal datasets, and almost all of the public options are monolingual (and mostly Mandarin). This post is a guide for dubbing and localization teams on how to source licensed multilingual vocal training data.
The specific problem: why cross-lingual singing is hard
Cross-lingual voice synthesis is an established area of research for speech. Models like XTTS can clone a voice from one language and generate speech in another language with acceptable quality. Singing is significantly harder for a few reasons.
Phonetic inventory mismatches
Each language has a different phonetic inventory (set of phonemes). When a singer trained in English attempts to sing in Japanese, phonemes that do not exist in English (certain vowels, pitch-accented syllables) are reproduced as approximations. A model trained only on English singing data will produce similarly approximate outputs when asked to sing in other languages.
True cross-lingual singing requires training data that covers the full phonetic inventory of the target languages. Either you train a monolingual model per language, or you train a multilingual model on data that includes multiple languages.
Prosodic differences
Singing prosody (the timing, pitch contour, and dynamics of a vocal phrase) varies by language. Romance languages tend to favor smooth legato phrasing. Asian tonal languages (Mandarin, Vietnamese, Thai) carry pitch information at the lexical level that interacts with the musical melody. Germanic languages allow heavier consonant clustering that affects phrasing. A model trained on one language's prosody does not automatically generalize to others.
Meter and rhyme preservation
When a song is localized, the lyrics must be rewritten to match the original melody's meter and (ideally) rhyme scheme. This is a creative translation task that is hard for humans and harder for AI. A cross-lingual singing model that can only sing whatever lyrics it is given still needs a localization step to produce the lyrics in the first place.
Cultural specificity
Some vocal styles are inseparable from cultural context. A Spanish flamenco vocal technique does not translate cleanly to Swedish. A Bollywood vocal style does not translate cleanly to French. AI singing models can produce technically acceptable outputs that feel culturally off, which is often worse than a lower-fidelity output that feels right.
What a multilingual vocal dataset needs to include
For a dubbing or localization use case, the relevant dataset attributes are:
Multilingual dataset requirements
- Multiple target languages represented with sufficient depth per language (typically 10+ hours per language for single-speaker work, 30+ hours for multi-speaker)
- Phonetic annotations in each language, ideally using IPA (International Phonetic Alphabet) for cross-language compatibility
- Native speakers for each language, not approximations by foreign speakers
- Culturally representative vocal styles within each language
- Genre diversity within each language if the use case is broad
- Consent agreements that contemplate cross-lingual use (some performers may have language-specific concerns about voice use)
The open data situation
As of 2026, open-source multilingual singing datasets are scarce. The public options are:
- OpenCpop: Single speaker, Mandarin only.
- M4Singer: 20 speakers, Mandarin only.
- OpenSinger: 66 singers, Mandarin only.
- PopBuTFy: Mandarin and English, but limited in scope.
- GTSinger: Multi-language but research-only license.
- JVS-MuSiC: Japanese only, mixed licensing.
Notice the pattern. The vast majority of open singing datasets are Mandarin. There are a few Japanese and Korean datasets. English singing data in open repositories is almost nonexistent at scale, and other European and Latin American languages are essentially absent.
This is a structural gap in the open data landscape. Any localization team that needs English singing plus at least one non-English language for training (so basically every localization team) cannot assemble a training corpus from public datasets alone.
The licensing alternatives
The realistic sources of licensed multilingual vocal data are:
Commercial stock libraries with multilingual filters
Shutterstock, Pond5, and Epidemic Sound all have catalogs that include non-English vocals. The usable portion for AI training is smaller than their total catalogs because not every track has AI-training-specific licensing. Query these platforms specifically for multilingual vocal training packages.
Dedicated multilingual vocal datasets
A growing category of purpose-built datasets that explicitly cover multiple languages. These are typically more focused than general stock libraries and better suited to ML training because the metadata and alignment are built for the use case. The Vocal Market's enterprise dataset is one example, covering 4 languages at present with more planned.
Direct commissioning
For languages not covered by existing datasets, direct commissioning (paying singers to record a custom dataset) is always an option. It is expensive but produces the highest-quality language-specific training material. Some dubbing studios that operate in 20+ languages do this regularly for their internal tools.
Dubbing studio archives
Larger dubbing studios have built up libraries of localized song recordings over years. These can be repurposed for AI training if the original contracts with the singers and labels permit it. In practice, most old contracts do not permit AI training use, so these archives are usable only after contract amendments or re-licensing.
The specific use cases for localization teams
Dubbing and localization teams typically want to train or use AI singing models for a few specific tasks:
Song translation and melodic rewriting
Given a source song in language A, produce a lyrical translation in language B that preserves the melody's meter and ideally its rhyme scheme. This is a language task, not an audio task, but it is the precursor to everything else.
Cross-lingual voice synthesis
Given a source vocal performance in language A and a target language B, produce a new vocal performance of the same song in language B, ideally preserving the singer's timbre and style. This is the core technical capability that makes cheap song localization possible.
Voice cloning with language switching
Given a target singer's voice (cloned from a reference sample) and target lyrics in a specific language, produce singing in that singer's voice and that language. This is the most demanding variant because it combines voice identity preservation with cross-lingual generation.
Quality tier matching
Different content tiers need different quality levels. Hero content (lead vocal in a musical number) still needs human singers most of the time. Secondary content (background vocal, ambient song) can use AI at current quality levels. Localization teams need models that can handle multiple quality tiers from the same pipeline, with the ability to upgrade specific outputs to human singers when necessary.
Legal considerations specific to dubbing and localization
Song localization has a complex rights structure even before AI is added. For a typical localized song in a dubbed film or game, the rights to clear include:
- The original composition (melody and lyrics in the source language)
- The translated lyrics (a derivative work)
- The original sound recording (if sampled or used as a reference)
- The new localized recording (a new sound recording copyright)
- The vocalist's performance rights in the new recording
Adding AI singing adds a sixth consideration: the rights to use the training data that produced the singing model. The training data license must permit the AI to generate outputs that will be embedded in derivative works of copyrighted content (the original song), in multiple languages, for commercial distribution.
A dubbing studio using an AI singing model for localization needs to ensure that:
- The training data license permits commercial use in localization work
- The output rights allow the localized song to be distributed through the studio's normal channels
- The training data contributors consented to their voices being used for cross-lingual synthesis, which may involve outputs in languages they do not speak
The last point is subtle but important. A vocalist who signed a standard "AI training" agreement for English singing may have a reasonable objection to being used as a training source for Japanese singing outputs. Thoughtful licensing agreements address this explicitly.
How The Vocal Market handles multilingual licensing
Our enterprise vocal dataset currently covers 4 languages with native vocalists in each, and the catalog is growing. Our vocalist agreements include explicit authorization for cross-lingual synthesis use cases, which means a dubbing studio licensing our data can train models that produce outputs in languages our vocalists did not themselves record.
The current language coverage leans toward European markets (English, Spanish, German, Dutch), with expansion to additional markets planned as the vocalist roster grows. If your localization needs include a language not yet represented, we can discuss direct commissioning arrangements alongside the standard catalog licensing.
Request a sample dataset and specify the target languages for your dubbing or localization use case. We will include samples from each available language and share the specific licensing language that covers cross-lingual synthesis, so your legal team can confirm alignment before we proceed.



