"High quality" is one of those phrases that marketing pages love and ML engineers distrust. If you are evaluating vocal datasets for training a singing voice synthesis model, the word is meaningless unless you can decompose it into measurable attributes. This post does exactly that.
The quality of a vocal dataset has six orthogonal dimensions. A dataset can score well on some and badly on others. A dataset that is excellent on five out of six can still be unusable if it fails on the sixth. Below we walk through each dimension, describe what "good" looks like, and explain how to test a sample before committing to a full licensing deal.
Dimension 1: Signal quality
The first and most basic question is whether the audio itself is clean. This is not about whether the vocalist can sing. It is about whether the recording signal carries the vocalist's performance without contamination.
Sample rate and bit depth
For modern singing voice synthesis, the practical options are 44.1 kHz and 48 kHz. 44.1 kHz is the consumer music standard and captures frequency content up to 22.05 kHz (the Nyquist limit). 48 kHz is the broadcast standard and captures up to 24 kHz. For the purposes of vocal training, 44.1 kHz is usually sufficient because the human voice rarely contains useful harmonic content above 18 kHz.
Bit depth should be 24-bit for studio recordings. 16-bit indicates either consumer-grade source material or recordings that have been through a reduction step, which loses dynamic range headroom. If a vendor is offering 16-bit material for enterprise AI training, ask specifically whether the recordings were captured natively at 16-bit (legacy material) or reduced from 24-bit at some point (lossy processing).
One counterintuitive point: higher sample rates are not automatically better for training. The HiFiSinger paper (Arxiv 2009.01776) showed that moving from 24 kHz to 48 kHz created wider spectrum bands and longer waveforms that made acoustic models and vocoders struggle to converge. Dedicated architectures are needed to actually benefit from the extra bandwidth. For most production use cases, 44.1 kHz is the sweet spot.
Signal-to-noise ratio
Studio recordings should have a noise floor below -60 dBFS. Anything higher introduces background hiss that the model will learn as legitimate signal, producing outputs with baked-in noise that is impossible to remove downstream. When you receive a sample dataset, run a quick noise-floor measurement on the silent intros and outros of a few tracks. If the noise floor is inconsistent across recordings, the dataset was captured in different environments and will need normalization.
Dynamic range and clipping
Clipped recordings are common in material that has been through streaming mastering or consumer processing. Look for peaks that hit exactly 0 dBFS with flat tops in the waveform view. A clipped recording has lost information that cannot be recovered, and the model will learn the clipping as part of the signal.
Dynamic range should be wide enough to capture both quiet passages and loud crescendos. If every recording has been hit with heavy compression (DR under 6 dB), the dataset will produce outputs that sound compressed by default, which is fine for pop but limiting for classical or jazz training.
Dimension 2: Isolation
The second dimension is whether the vocal signal is actually isolated or whether it contains bleed from other sources. This is the single biggest quality differentiator between truly studio-recorded datasets and datasets built from source-separated stems.
The problem with source-separated stems
Modern source separation models like HTDemucs can extract a vocal stem from a mixed track with a signal-to-distortion ratio around 9 dB. That is impressive, but it is not equivalent to a dry studio recording. The extracted stem still contains:
- Reverb tails from the room or hall in which the mix was produced
- Harmony bleed from background vocals that share frequency content with the lead
- Phase artifacts from the separation process itself
- Transient smearing around consonants and fast vocal passages
- Spectral bleed from instrumental elements that overlap the vocal range
All of these contaminations get learned by the model. You train on separated stems, you get a model that produces outputs with inherited separation artifacts. The outputs may be indistinguishable to a casual listener but will be audibly degraded to a producer or engineer.
The research literature on singing voice synthesis is explicit about this. As the DiffSinger paper notes, research datasets use "solo vocals in controlled environments with limited effects" for a reason. The signal is cleaner and the model learns voice rather than voice-plus-room-plus-processing.
How to test for isolation
Take a few tracks from the sample dataset and do the following:
- Load them into a spectrogram viewer and look for horizontal bands of energy in the silent passages. True silence should look black. Bleed shows up as faint horizontal lines.
- Listen with headphones to the tails of vocal phrases. A dry studio recording cuts cleanly when the vocalist stops singing. Reverb tails and bleed from other instruments are audible.
- Run the tracks through a phase-invert comparison with the original mix (if available). A cleanly isolated stem should not phase-cancel with anything else in the mix.
Dimension 3: Dry vs wet, processed vs unprocessed
This is a close cousin of isolation but worth treating separately. Isolation asks "does the vocal have bleed from other sources?" Processed/unprocessed asks "has the vocal itself been altered?"
Dry stems are unprocessed vocal recordings straight from the microphone, possibly with basic gain staging but nothing else. They are the rawest form of the performance and the most flexible training material because any effect you want (reverb, compression, EQ) can be added later and the model's outputs will match your processing chain.
Wet stems are vocals with effects already applied, usually including reverb, compression, EQ, and de-essing. They are ready-to-use in a professional mix context and reflect a specific production aesthetic. A model trained exclusively on wet stems will produce outputs that sound pre-processed, which may be desirable (pop production context) or undesirable (research application that wants to apply custom processing downstream).
The best enterprise datasets include both versions of each recording: a dry version for flexible training and a wet version for production-aesthetic training. This doubles the effective dataset size without requiring additional recording sessions and gives downstream users the ability to choose their training target.
Dimension 4: Metadata
A vocal dataset is only as useful as the metadata that describes it. Without metadata, every training run requires manual labeling or automated extraction, which adds cost and introduces errors. With rich metadata, the same dataset can support multiple model architectures and conditioning strategies.
The baseline metadata for enterprise-grade vocal training data includes:
| Field | What it enables | Priority |
|---|---|---|
| Genre | Conditional generation, genre-specific fine-tuning | Essential |
| BPM | Tempo alignment, temporal conditioning | Essential |
| Key | Key-aware generation, harmonic conditioning | Essential |
| Vocalist gender | Gender-balanced training, conditional generation | Essential |
| Vocal type (lead, harmony, adlib) | Role-specific training | Essential |
| Language | Multilingual training, language filtering | Essential |
| Phoneme alignment | Controllable SVS, lyric-to-voice modeling | High |
| F0 (pitch) contour | Pitch-aware generation, expression transfer | High |
| Vocal range (low/high note) | Range-matched generation | Medium |
| Vocal technique (belt, mix, head voice) | Style transfer, technique conditioning | Medium |
| MIDI score (if applicable) | Score-to-audio training, DiffSinger-style models | Medium |
| Recording conditions (mic, room) | Acoustic filtering, robustness training | Nice-to-have |
| Lyrics (text) | Text-to-singing, lyric-conditioned generation | High |
| Licensing and consent ID | Compliance, withdrawal tracking | Essential |
Notice the last row. Consent and licensing metadata are as important as technical metadata because they are what make the dataset defensible if a vocalist ever withdraws consent or if your legal team needs to audit the source of a specific recording.
Dimension 5: Diversity
A dataset with high signal quality, perfect isolation, and rich metadata is still unusable if every recording is from the same three vocalists singing the same genre in the same language.
Diversity matters across several axes:
- Vocalist count. A dataset with 200 unique vocalists is structurally different from a dataset with 20. More vocalists means better generalization and less overfitting to specific voices.
- Gender distribution. Most commercial vocal datasets skew female because female lead vocals dominate pop. A balanced dataset has roughly equal male and female contributions, ideally with some non-binary representation.
- Language distribution. Open-source singing datasets are overwhelmingly Mandarin-heavy. If your target market is English-speaking, a Mandarin dataset is poorly matched. Multilingual datasets are rare and valuable.
- Genre distribution. Pop, R&B, hip-hop, rock, electronic, folk, classical, jazz, country, reggae, Latin. A dataset with coverage across multiple genres produces more flexible models.
- Age and tonality diversity. A dataset that is 95% Gen Z pop will produce outputs that sound Gen Z pop. If your target audience is broader, the training data needs to be broader.
Enterprise buyers often ask for a "distribution sheet" showing the count and percentage across each dimension. Any vendor running a serious dataset operation can produce this within a day.
Dimension 6: Alignment
The final dimension is whether the training material is properly aligned for the model architecture you plan to use. Alignment is the process of matching audio frames to linguistic or musical annotations. It is invisible to most users but it determines whether a model can be trained in days or weeks.
For singing voice synthesis, the relevant alignments are:
- Phoneme alignment. Each syllable in the lyric is mapped to a time range in the audio. This is essential for controllable SVS (DiffSinger, VISinger2) and significantly reduces training complexity.
- Note alignment. If the dataset includes MIDI or musical scores, each note should be mapped to a time range in the audio. This enables score-to-audio training.
- F0 contour. The pitch contour should be extracted per-frame using a robust estimator like RMVPE or CREPE. Hand-corrected F0 is the gold standard but rarely available.
Hand-aligned datasets are expensive to produce. Most vendors use automated alignment tools (Montreal Forced Aligner, Whisper-based tools) and offer the alignments "as-is." The question for a buyer is how accurate the automated alignments are and whether there is any hand-correction pass. For research-grade SVS, hand-corrected alignment is typically required. For commercial fine-tuning on top of a pre-trained base model, automated alignment is often sufficient.
The high-quality dataset scorecard
When evaluating a sample dataset, score it against these six dimensions on a simple 1-to-5 scale. A dataset that scores 4 or 5 on every dimension is production-ready. A dataset that scores below 3 on any dimension is a risk.
Dataset scorecard
- Signal quality: Sample rate, bit depth, SNR, dynamic range
- Isolation: Dry vs separated, bleed level, phase integrity
- Processing state: Dry version available, wet version available, unprocessed option
- Metadata completeness: Essential fields, high-priority fields, consent tracking
- Diversity: Vocalists, gender balance, languages, genres
- Alignment: Phoneme, note, F0, hand-correction presence
A failure on signal quality is unrecoverable. A failure on isolation can sometimes be papered over with careful fine-tuning but shows up in outputs. A failure on metadata is recoverable but expensive. A failure on diversity limits the model's applicability. A failure on alignment means you are going to spend weeks generating your own alignments before training can even start.
How The Vocal Market handles these six dimensions
Our enterprise vocal dataset is built to score high on all six dimensions. Every recording is captured by a professional vocalist at 44.1 kHz / 24-bit in a studio environment with a noise floor below -65 dBFS. Every recording is available in both dry (unprocessed) and wet (produced) formats. Every recording has full metadata including genre, BPM, key, vocal type, gender, and language, along with a unique consent ID tied to the vocalist's agreement. The dataset includes over 500 recordings from more than 150 unique vocalists across 16 genres and 4 languages, with roughly balanced gender distribution.
If you want to evaluate how the dataset scores against the six-dimension framework above, request a sample dataset. We will send you a representative subset along with the metadata files and the measurement numbers for signal quality. You can run your own tests before deciding whether to proceed with a full licensing agreement.



