Free vs Licensed Vocal Datasets: What Academic Research Datasets Won't Give You

Before we built The Vocal Market's enterprise licensing program, the default answer to "where do I get vocal training data?" for most ML teams was "start with the academic datasets." MUSDB18, OpenCpop, VCTK, OpenSinger, M4Singer, and a handful of others are freely available, well-documented, and widely used in published research.

They are also, in almost every case, not actually usable for commercial AI training. Not because the data is bad, but because the licenses that attach to the data either prohibit commercial use entirely or impose restrictions that make commercial use legally fragile. This post walks through the major academic vocal and music datasets, explains exactly what each one's license permits, and shows where the gap between research-grade availability and commercial-grade availability actually lives.

The pattern: research datasets are free but not commercial

Most academic vocal and music datasets are released under one of three licensing patterns:

Creative Commons Non-Commercial (CC BY-NC, CC BY-NC-SA): Free for non-commercial use, explicitly prohibits commercial use. Training a commercial AI model is commercial use.
Custom research-only licenses: Access granted only to researchers for academic purposes, often with a signed agreement. Commercial use is either prohibited or requires a separate deal.
Mixed or unclear licensing: The dataset is published without a clear license, or aggregates material from multiple sources with different licenses. Commercial use is legally ambiguous and should be treated as prohibited by default.

A small number of datasets (VCTK, VocalSet) are released under CC BY 4.0 (commercial use permitted with attribution). But these tend to be speech datasets or technique-demonstration datasets, not full singing corpora suitable for training commercial models.

The major datasets, by license

MUSDB18 and MUSDB18-HQ

The standard benchmark for music source separation research. Contains 150 full-track songs (100 train + 50 test), with 4-stem taxonomy (vocals, drums, bass, other). Used extensively in Signal Separation Evaluation Campaigns.

License: Mixed. 100 tracks from the DSD100 subset are from Mike Senior's Mixing Secrets library. 46 are from MedleyDB under CC BY-NC-SA 4.0. 2 are from The Easton Ellises under CC BY-NC-SA 3.0. The net effect is that most of the data is non-commercial.
Commercial use: Not permitted for the majority of the catalog.
Usability for commercial training: No.

DSD100 and DSD100-HQ

100 tracks, 50 train / 50 test. Predecessor to MUSDB18, derived from the same Mixing Secrets library.

License: Research use under Mike Senior's permission. Not a clean commercial license.
Commercial use: Not permitted.
Usability for commercial training: No. Now considered legacy; MUSDB18-HQ is preferred.

VCTK (CSTR VCTK Corpus)

44 hours of speech data from 110 English speakers with varied accents.

License: CC BY 4.0 (commercial use permitted with attribution).
Commercial use: Yes, but important caveat: VCTK is SPEECH, not singing. It is frequently misremembered as a singing dataset. Using it for singing voice synthesis produces speech-like outputs.
Usability for commercial training: Yes for speech, no for singing.

OpenCpop

5.2 hours of Mandarin singing from a single professional female singer, with phoneme boundaries and note annotations. Standard benchmark for Mandarin singing voice synthesis.

License: CC BY-NC 4.0 (non-commercial only).
Commercial use: Not permitted.
Usability for commercial training: No.

OpenSinger

50 hours, 1,146 songs, 66 singers. Multi-singer Mandarin corpus, one of the larger open singing datasets.

License: Not clearly published. Treated as research-only by default.
Commercial use: Ambiguous; assume not permitted.
Usability for commercial training: No.

M4Singer

700 Mandarin pop songs from 20 professional singers covering all SATB voice types. NeurIPS 2022 Datasets and Benchmarks track.

License: Custom research license with acceptance terms. Typically research-only.
Commercial use: Not permitted under standard terms.
Usability for commercial training: No.

NUS-48E

169 minutes (2.8 hours) of singing from 12 singers. Annotated with phoneme-level transcriptions.

License: Research-only, request via NUS.
Commercial use: Not permitted.
Usability for commercial training: No.

VocalSet

10.1 hours of professional vocalists demonstrating 17 different vocal techniques (vibrato, belt, breathy, vocal fry, etc.).

License: CC BY 4.0 (commercial use permitted).
Commercial use: Yes.
Usability for commercial training: Limited. VocalSet contains technique demonstrations, not songs. It is useful as supplementary training data for technique conditioning but is not sufficient as a primary singing corpus.

Children's Song Dataset (CSD)

100 children's songs (50 Korean + 50 English), each recorded in 2 keys. Single female vocalist.

License: CC BY-NC-SA 4.0 (non-commercial, share-alike).
Commercial use: Not permitted.
Usability for commercial training: No.

JVS-MuSiC

Japanese singing corpus. 100 singers singing a common Japanese children's song plus one unique song each.

License: Tags under CC BY-SA 4.0. Audio is free for personal use but commercial redistribution is prohibited under the standard terms.
Commercial use: Restricted.
Usability for commercial training: No under standard terms.

GTSinger

Multi-language, multi-technique singing dataset with realistic music scores. NeurIPS 2024 Spotlight.

License: Research license.
Commercial use: Not permitted.
Usability for commercial training: No.

The total open singing data ceiling

If you sum up every major open-source clean singing dataset, the total comes to roughly 230 hours of clean solo vocals. That includes OpenCpop (5 hours), OpenSinger (50 hours), M4Singer (approximately 29 hours), PopBuTFy (50 hours), PopCS (5 hours), GTSinger (tens of hours), VocalSet (10 hours), and a few smaller corpora.

That 230-hour total has two problems. First, it is heavily weighted toward Mandarin Chinese. English clean singing data at scale essentially does not exist in the open-source pool. Second, almost none of it is commercially licensable. The portion that is commercially licensable (VocalSet, VCTK) is either not singing or not songs.

For an ML team that wants commercial-grade clean English singing data in training volumes, open data cannot meet the need. This is a structural gap that is not going to be filled by any single academic release in the foreseeable future.

What "non-commercial" actually means

Some teams assume that "non-commercial" means "cannot sell the dataset" but "can use for training commercial products." This is not correct. Creative Commons defines non-commercial use as use "not primarily intended for or directed toward commercial advantage or monetary compensation." Training a model that will be sold, licensed, or embedded in a commercial product is commercial use.

There is a gray area around whether academic research followed by commercialization constitutes commercial use of the original training data. The conservative reading is that once the trained model is deployed commercially, the underlying training data use becomes commercial retroactively. The aggressive reading is that research use is separate from productionization. Courts have not definitively resolved this question.

The safe posture for enterprise teams is to treat non-commercial licenses as prohibiting any use that contributes to a commercial product. This is what acquirer legal teams will assume during due diligence, and it is what rightsholders will argue in a dispute.

Why academic datasets exist despite the limitations

The point is not to criticize academic datasets. They serve their purpose: they enable research, benchmark comparisons, and reproducible results. The authors who release them typically do so on research-only terms because commercial licensing is not their business model and dealing with commercial licensees would slow them down.

The issue is that enterprise teams treat academic datasets as if they were a plausible commercial source, which they are not. The solution is not to expect academic datasets to change their licensing terms. It is to recognize that commercial training data requires commercial licensing.

What commercial licensing adds

A commercial vocal dataset license provides several things that academic datasets cannot:

Commercial use permission. The core legal difference.
AI training-specific rights. Academic datasets rarely contemplate AI training explicitly in their licensing terms, even when the license is permissive. Commercial licenses can address this directly.
Explicit performer consent for AI use. Academic datasets sometimes come from performers who never specifically consented to AI training (they consented to research use). Commercial datasets built for AI training collect explicit consent for that specific use.
Indemnification. Academic datasets come with no warranty. Commercial licenses can include indemnification against third-party IP claims.
Scale matched to production needs. Academic datasets are sized for research papers (typically 5-50 hours). Commercial datasets can scale to production requirements (hundreds to thousands of hours).
Metadata optimized for production. Academic datasets have metadata optimized for the authors' specific research questions. Commercial datasets have metadata optimized for downstream ML pipelines.
Ongoing updates. Academic datasets are typically snapshot releases. Commercial datasets can include ongoing updates, new recordings, and expanded coverage as part of the license term.

A hybrid strategy

The best training data strategy for most teams combines academic datasets (for research experiments, ablations, and technique comparisons) with commercial datasets (for production training and deployment). Academic data is used during R&D phases where commercial exposure is low. Commercial data is used for the production model that actually ships.

This approach respects the licensing terms of each dataset and also matches the economic realities: academic data is free but limited, commercial data is paid but unrestricted. Using the right tool for the right phase of the project is usually cheaper overall than trying to force one approach to cover both phases.

What The Vocal Market offers that academic datasets cannot

Our enterprise vocal dataset licensing program provides what academic datasets structurally cannot: explicit commercial use rights, explicit AI training authorization, performer consent collected specifically for AI training, indemnification against third-party IP claims, and a catalog sized for production training (500+ recordings, 150+ vocalists, 16 genres, 4 languages, with both dry and wet versions and full metadata).

We also complement academic datasets rather than replacing them. If your research pipeline uses OpenCpop and VocalSet for technique benchmarking, you can continue to use them for that purpose while licensing our catalog for production training. The two are not mutually exclusive, and the licensing allows clean separation between the research and production phases of your project.

If you are currently running research on academic datasets and are starting to plan the commercial training phase, request a sample dataset and tell us what your research corpus looks like. We will help you size a production-grade license that picks up where the academic data leaves off.

The pattern: research datasets are free but not commercial

Most academic vocal and music datasets are released under one of three licensing patterns:

Creative Commons Non-Commercial (CC BY-NC, CC BY-NC-SA): Free for non-commercial use, explicitly prohibits commercial use. Training a commercial AI model is commercial use.
Custom research-only licenses: Access granted only to researchers for academic purposes, often with a signed agreement. Commercial use is either prohibited or requires a separate deal.
Mixed or unclear licensing: The dataset is published without a clear license, or aggregates material from multiple sources with different licenses. Commercial use is legally ambiguous and should be treated as prohibited by default.