If you are new to audio data, the terms "dry" and "wet" in the context of vocal stems might sound like production jargon with no ML relevance. They are not. The choice between training on dry stems or wet stems is one of the most consequential decisions in building a voice or music AI model, and it affects everything from dataset sizing to downstream mix compatibility.
This post explains what dry and wet stems actually are, which use cases each serves, and why the best enterprise datasets include both.
Definitions
A vocal stem is the isolated vocal track from a recording, separated from the instrumental accompaniment. In a professional production, the vocal is recorded onto its own track and then processed through a chain of effects before being mixed into the final song. Depending on where in that chain you capture the audio, the resulting stem is either dry, wet, or something in between.
- Dry stem: The raw vocal signal captured by the microphone, with no reverb, no EQ, no compression, no effects of any kind. Just the voice as it sounded in the recording room.
- Wet stem: The vocal signal after the full production chain has been applied. Typically includes reverb, compression, EQ, de-essing, possibly pitch correction, and any creative effects the producer added.
- Hybrid stem: Stem captured partway through the chain. For example, "dry with compression" or "EQ applied but no reverb." These are less common in commercial datasets.
The distinction is not just aesthetic. Each type contains different information and teaches a model different things.
What dry stems teach a model
A dry stem contains the voice itself, nothing else. A model trained on dry stems learns:
- The acoustic characteristics of the vocalist's voice (formant structure, spectral envelope, timbral variation)
- Pitch control and F0 contour over time
- Articulation of phonemes and consonants
- Breath patterns and microdynamics
- Vocal technique variations (vibrato, belt, head voice transitions)
What a model trained on dry stems does not learn is what the voice "sounds like in a mix." It does not learn reverb tails, compression character, or stereo placement. Those are post-processing decisions, and the model's outputs will reflect that: dry, intimate, and completely unprocessed.
When dry stems are the right choice
Dry stems are the correct training target when:
- You plan to apply your own processing downstream. If the model's output is going into a production pipeline where a mixing engineer (human or AI) will add reverb and effects, you want the input to that stage to be clean.
- You are building a voice cloning or TTS-for-singing model. These applications care about voice identity, not production aesthetic. Dry stems give the model the cleanest signal for learning the voice.
- You are building a research-grade SVS model. Research systems like DiffSinger, VISinger2, and NaturalSpeech are typically evaluated on dry singing data because the community has agreed that that is the cleanest basis for comparison.
- You are training a model for high-end production use. Professional producers want to hear the raw voice so they can apply their own processing, not inherit someone else's choices.
What wet stems teach a model
A wet stem contains the voice plus the entire production chain applied to it. A model trained on wet stems learns everything a dry-trained model learns, plus:
- The reverb character of the production space
- The compression and dynamic envelope of the final mix
- The EQ curve used to fit the vocal into the instrumental
- Creative effects like delay throws, chorus, or distortion
- The stereo imaging and spatial placement of the vocal
The model's outputs will reflect all of this. Trained exclusively on wet stems, a voice model will produce vocals that already sound "finished" — compressed, reverb-wet, EQ-shaped. If your downstream application wants that aesthetic, wet training is a shortcut to it. If your downstream application wants flexibility, wet training is a constraint.
When wet stems are the right choice
Wet stems are the correct training target when:
- You want the model's output to sound "production-ready" without post-processing. Consumer-facing AI music tools often want this: the user should not have to apply their own reverb.
- You are matching a specific production aesthetic. Training on wet stems from 80s pop will produce a model that generates 80s-sounding vocals out of the box, including the characteristic reverb and compression of that era.
- You are building a mix-ready singing model for karaoke or backing-track generation. The output needs to sit in a mix immediately.
- You are doing style transfer research. Wet stems with metadata about the production chain let you train models that can apply or remove specific processing styles.
The subtler problem: what "wet" hides
There is a structural issue with wet stems that dry stems do not have. Wet stems combine the voice signal with the processing chain in a way that the model cannot disentangle without help.
When the model learns a wet stem, it learns "this voice with this reverb with this compression." It cannot cleanly separate those three components. If you later want to generate new vocals with different reverb, you cannot easily ask the model to "use the same voice but dry." The voice and the processing are entangled in the weights.
Some research approaches try to disentangle timbre and effects using factorized latent spaces (NaturalSpeech 3 is a recent example). But these approaches require disentanglement-specific architectures and are still an active research area. For most production use cases, you get the entanglement whether you want it or not.
Dry stems avoid this problem entirely. The voice is the voice. Reverb and effects are added at inference time by a separate stage. The separation is clean.
The case for including both
The best enterprise vocal datasets include both dry and wet versions of every recording. This doubles the effective training material (from a diversity perspective) and gives the downstream user a choice about which format to train on.
The practical workflow looks like this:
- Record the vocal in a studio. This is the raw captured signal.
- Save the raw captured signal as the dry stem. Done.
- Run the dry stem through the full production chain (reverb, compression, EQ, de-essing, any creative processing).
- Save the processed signal as the wet stem.
- Deliver both versions of the same recording in the dataset, labeled clearly.
A downstream user who wants dry training data uses the dry files. A user who wants production-ready training data uses the wet files. A user who wants to experiment with paired training (dry input, wet output, teaching the model an effects chain) uses both.
Paired training: the hidden third use case
Paired dry-and-wet stems enable a training mode that neither format alone can support: teaching a model the transformation itself. If the same recording exists in both dry and wet form, you can train a model to learn the mapping from dry to wet. That is effectively a learned effects chain, and it is a recent area of interest in audio ML research.
This use case is only possible when the dataset includes paired versions. It is one of the reasons serious enterprise datasets include both.
A quick note on "stems vs recordings"
There is a semantic issue worth clarifying. In professional music production, "stems" sometimes refers to submixed groups of related tracks (all drums, all keys, all vocals) and sometimes refers to individual tracks. In the AI training context, "vocal stems" almost always means individual vocal tracks, not a submix.
If you are evaluating a dataset and the vendor refers to "stems," confirm whether they mean individual tracks or submixes. For training purposes, individual tracks are usually what you want. A submix containing lead, harmonies, and adlibs bundled together is less useful because the model cannot cleanly learn any of the components.
The stem separation trap
We should say this explicitly because it is a source of confusion for teams new to audio ML. Some vendors market "stems" that are actually the output of a source separation model (typically HTDemucs, Spleeter, or a similar tool) applied to fully-mixed tracks.
These are not true stems. They are reconstructions. The separation process leaves behind bleed, phase artifacts, and residual harmonic content from other sources. A dataset built from separated stems will teach your model those artifacts as legitimate signal, and the model's outputs will inherit them.
When a vendor offers "stems," ask one question: "Were these recorded to isolated tracks in a studio, or extracted from mixed audio?" If the answer is extracted, the quality ceiling is capped by the separation model's SDR, which is currently around 9 dB for the best open-source options. That is not the same as true studio isolation.
Summary: which to use when
| Use case | Recommended training format |
|---|---|
| Voice cloning (identity preservation) | Dry |
| Singing voice synthesis (research) | Dry |
| Singing voice synthesis (production) | Dry with downstream processing |
| Consumer AI music generator (mix-ready output) | Wet |
| Style transfer / aesthetic conditioning | Wet |
| Learned effects chain / dry-to-wet mapping | Paired dry + wet |
| Karaoke / backing track generation | Wet |
| Emotion / prosody modeling | Dry |
What The Vocal Market offers
Our enterprise vocal dataset includes both dry and wet versions of every recording. The dry versions are captured directly from the microphone with no processing applied. The wet versions go through a professional production chain including reverb, compression, EQ, and de-essing, producing mix-ready vocals. Both versions share the same metadata (genre, BPM, key, vocalist, language) and are linked by a common identifier so paired training is straightforward.
If you are evaluating a training approach and want to test both formats before committing, request a sample dataset and ask specifically for the paired dry-and-wet package. We will include representative samples from multiple genres so you can compare training results directly.



