Dry Stems vs Wet Stems: Which Do You Need for Training Vocal AI Models

If you are new to audio data, the terms "dry" and "wet" in the context of vocal stems might sound like production jargon with no ML relevance. They are not. The choice between training on dry stems or wet stems is one of the most consequential decisions in building a voice or music AI model, and it affects everything from dataset sizing to downstream mix compatibility.

This post explains what dry and wet stems actually are, which use cases each serves, and why the best enterprise datasets include both.

Definitions

A vocal stem is the isolated vocal track from a recording, separated from the instrumental accompaniment. In a professional production, the vocal is recorded onto its own track and then processed through a chain of effects before being mixed into the final song. Depending on where in that chain you capture the audio, the resulting stem is either dry, wet, or something in between.

Dry stem: The raw vocal signal captured by the microphone, with no reverb, no EQ, no compression, no effects of any kind. Just the voice as it sounded in the recording room.
Wet stem: The vocal signal after the full production chain has been applied. Typically includes reverb, compression, EQ, de-essing, possibly pitch correction, and any creative effects the producer added.
Hybrid stem: Stem captured partway through the chain. For example, "dry with compression" or "EQ applied but no reverb." These are less common in commercial datasets.

The distinction is not just aesthetic. Each type contains different information and teaches a model different things.

What dry stems teach a model

A dry stem contains the voice itself, nothing else. A model trained on dry stems learns:

The acoustic characteristics of the vocalist's voice (formant structure, spectral envelope, timbral variation)
Pitch control and F0 contour over time
Articulation of phonemes and consonants
Breath patterns and microdynamics
Vocal technique variations (vibrato, belt, head voice transitions)

What a model trained on dry stems does not learn is what the voice "sounds like in a mix." It does not learn reverb tails, compression character, or stereo placement. Those are post-processing decisions, and the model's outputs will reflect that: dry, intimate, and completely unprocessed.

When dry stems are the right choice

Dry stems are the correct training target when:

You plan to apply your own processing downstream. If the model's output is going into a production pipeline where a mixing engineer (human or AI) will add reverb and effects, you want the input to that stage to be clean.
You are building a voice cloning or TTS-for-singing model. These applications care about voice identity, not production aesthetic. Dry stems give the model the cleanest signal for learning the voice.
You are building a research-grade SVS model. Research systems like DiffSinger, VISinger2, and NaturalSpeech are typically evaluated on dry singing data because the community has agreed that that is the cleanest basis for comparison.
You are training a model for high-end production use. Professional producers want to hear the raw voice so they can apply their own processing, not inherit someone else's choices.

What wet stems teach a model

A wet stem contains the voice plus the entire production chain applied to it. A model trained on wet stems learns everything a dry-trained model learns, plus:

The reverb character of the production space
The compression and dynamic envelope of the final mix
The EQ curve used to fit the vocal into the instrumental
Creative effects like delay throws, chorus, or distortion
The stereo imaging and spatial placement of the vocal

The model's outputs will reflect all of this. Trained exclusively on wet stems, a voice model will produce vocals that already sound "finished" — compressed, reverb-wet, EQ-shaped. If your downstream application wants that aesthetic, wet training is a shortcut to it. If your downstream application wants flexibility, wet training is a constraint.

When wet stems are the right choice

Wet stems are the correct training target when:

You want the model's output to sound "production-ready" without post-processing. Consumer-facing AI music tools often want this: the user should not have to apply their own reverb.
You are matching a specific production aesthetic. Training on wet stems from 80s pop will produce a model that generates 80s-sounding vocals out of the box, including the characteristic reverb and compression of that era.
You are building a mix-ready singing model for karaoke or backing-track generation. The output needs to sit in a mix immediately.
You are doing style transfer research. Wet stems with metadata about the production chain let you train models that can apply or remove specific processing styles.

The subtler problem: what "wet" hides

There is a structural issue with wet stems that dry stems do not have. Wet stems combine the voice signal with the processing chain in a way that the model cannot disentangle without help.

When the model learns a wet stem, it learns "this voice with this reverb with this compression." It cannot cleanly separate those three components. If you later want to generate new vocals with different reverb, you cannot easily ask the model to "use the same voice but dry." The voice and the processing are entangled in the weights.

Some research approaches try to disentangle timbre and effects using factorized latent spaces (NaturalSpeech 3 is a recent example). But these approaches require disentanglement-specific architectures and are still an active research area. For most production use cases, you get the entanglement whether you want it or not.

Dry stems avoid this problem entirely. The voice is the voice. Reverb and effects are added at inference time by a separate stage. The separation is clean.

The case for including both

The best enterprise vocal datasets include both dry and wet versions of every recording. This doubles the effective training material (from a diversity perspective) and gives the downstream user a choice about which format to train on.

The practical workflow looks like this:

Record the vocal in a studio. This is the raw captured signal.
Save the raw captured signal as the dry stem. Done.
Run the dry stem through the full production chain (reverb, compression, EQ, de-essing, any creative processing).
Save the processed signal as the wet stem.
Deliver both versions of the same recording in the dataset, labeled clearly.

A downstream user who wants dry training data uses the dry files. A user who wants production-ready training data uses the wet files. A user who wants to experiment with paired training (dry input, wet output, teaching the model an effects chain) uses both.

Paired training: the hidden third use case

Paired dry-and-wet stems enable a training mode that neither format alone can support: teaching a model the transformation itself. If the same recording exists in both dry and wet form, you can train a model to learn the mapping from dry to wet. That is effectively a learned effects chain, and it is a recent area of interest in audio ML research.

This use case is only possible when the dataset includes paired versions. It is one of the reasons serious enterprise datasets include both.

A quick note on "stems vs recordings"

There is a semantic issue worth clarifying. In professional music production, "stems" sometimes refers to submixed groups of related tracks (all drums, all keys, all vocals) and sometimes refers to individual tracks. In the AI training context, "vocal stems" almost always means individual vocal tracks, not a submix.

If you are evaluating a dataset and the vendor refers to "stems," confirm whether they mean individual tracks or submixes. For training purposes, individual tracks are usually what you want. A submix containing lead, harmonies, and adlibs bundled together is less useful because the model cannot cleanly learn any of the components.

The stem separation trap

We should say this explicitly because it is a source of confusion for teams new to audio ML. Some vendors market "stems" that are actually the output of a source separation model (typically HTDemucs, Spleeter, or a similar tool) applied to fully-mixed tracks.

These are not true stems. They are reconstructions. The separation process leaves behind bleed, phase artifacts, and residual harmonic content from other sources. A dataset built from separated stems will teach your model those artifacts as legitimate signal, and the model's outputs will inherit them.

When a vendor offers "stems," ask one question: "Were these recorded to isolated tracks in a studio, or extracted from mixed audio?" If the answer is extracted, the quality ceiling is capped by the separation model's SDR, which is currently around 9 dB for the best open-source options. That is not the same as true studio isolation.

Summary: which to use when

Use case	Recommended training format
Voice cloning (identity preservation)	Dry
Singing voice synthesis (research)	Dry
Singing voice synthesis (production)	Dry with downstream processing
Consumer AI music generator (mix-ready output)	Wet
Style transfer / aesthetic conditioning	Wet
Learned effects chain / dry-to-wet mapping	Paired dry + wet
Karaoke / backing track generation	Wet
Emotion / prosody modeling	Dry

What The Vocal Market offers

Our enterprise vocal dataset includes both dry and wet versions of every recording. The dry versions are captured directly from the microphone with no processing applied. The wet versions go through a professional production chain including reverb, compression, EQ, and de-essing, producing mix-ready vocals. Both versions share the same metadata (genre, BPM, key, vocalist, language) and are linked by a common identifier so paired training is straightforward.

If you are evaluating a training approach and want to test both formats before committing, request a sample dataset and ask specifically for the paired dry-and-wet package. We will include representative samples from multiple genres so you can compare training results directly.

Definitions

Dry stem: The raw vocal signal captured by the microphone, with no reverb, no EQ, no compression, no effects of any kind. Just the voice as it sounded in the recording room.
Wet stem: The vocal signal after the full production chain has been applied. Typically includes reverb, compression, EQ, de-essing, possibly pitch correction, and any creative effects the producer added.
Hybrid stem: Stem captured partway through the chain. For example, "dry with compression" or "EQ applied but no reverb." These are less common in commercial datasets.

The distinction is not just aesthetic. Each type contains different information and teaches a model different things.

What dry stems teach a model

A dry stem contains the voice itself, nothing else. A model trained on dry stems learns:

The acoustic characteristics of the vocalist's voice (formant structure, spectral envelope, timbral variation)
Pitch control and F0 contour over time
Articulation of phonemes and consonants
Breath patterns and microdynamics
Vocal technique variations (vibrato, belt, head voice transitions)

When dry stems are the right choice

Dry stems are the correct training target when:

You plan to apply your own processing downstream. If the model's output is going into a production pipeline where a mixing engineer (human or AI) will add reverb and effects, you want the input to that stage to be clean.
You are building a voice cloning or TTS-for-singing model. These applications care about voice identity, not production aesthetic. Dry stems give the model the cleanest signal for learning the voice.
You are building a research-grade SVS model. Research systems like DiffSinger, VISinger2, and NaturalSpeech are typically evaluated on dry singing data because the community has agreed that that is the cleanest basis for comparison.
You are training a model for high-end production use. Professional producers want to hear the raw voice so they can apply their own processing, not inherit someone else's choices.

What wet stems teach a model

A wet stem contains the voice plus the entire production chain applied to it. A model trained on wet stems learns everything a dry-trained model learns, plus:

The reverb character of the production space
The compression and dynamic envelope of the final mix
The EQ curve used to fit the vocal into the instrumental
Creative effects like delay throws, chorus, or distortion
The stereo imaging and spatial placement of the vocal

When wet stems are the right choice

Wet stems are the correct training target when:

You want the model's output to sound "production-ready" without post-processing. Consumer-facing AI music tools often want this: the user should not have to apply their own reverb.
You are matching a specific production aesthetic. Training on wet stems from 80s pop will produce a model that generates 80s-sounding vocals out of the box, including the characteristic reverb and compression of that era.
You are building a mix-ready singing model for karaoke or backing-track generation. The output needs to sit in a mix immediately.
You are doing style transfer research. Wet stems with metadata about the production chain let you train models that can apply or remove specific processing styles.

The subtler problem: what "wet" hides

There is a structural issue with wet stems that dry stems do not have. Wet stems combine the voice signal with the processing chain in a way that the model cannot disentangle without help.

Dry stems avoid this problem entirely. The voice is the voice. Reverb and effects are added at inference time by a separate stage. The separation is clean.

The case for including both

The practical workflow looks like this:

Record the vocal in a studio. This is the raw captured signal.
Save the raw captured signal as the dry stem. Done.
Run the dry stem through the full production chain (reverb, compression, EQ, de-essing, any creative processing).
Save the processed signal as the wet stem.
Deliver both versions of the same recording in the dataset, labeled clearly.

Paired training: the hidden third use case

This use case is only possible when the dataset includes paired versions. It is one of the reasons serious enterprise datasets include both.

A quick note on "stems vs recordings"

The stem separation trap

Summary: which to use when

Use case	Recommended training format
Voice cloning (identity preservation)	Dry
Singing voice synthesis (research)	Dry
Singing voice synthesis (production)	Dry with downstream processing
Consumer AI music generator (mix-ready output)	Wet
Style transfer / aesthetic conditioning	Wet
Learned effects chain / dry-to-wet mapping	Paired dry + wet
Karaoke / backing track generation	Wet
Emotion / prosody modeling	Dry

Dry Stems vs Wet Stems: Which Do You Need for Training Vocal AI Models

Definitions

What dry stems teach a model

When dry stems are the right choice

What wet stems teach a model

When wet stems are the right choice

The subtler problem: what "wet" hides

The case for including both

Paired training: the hidden third use case

A quick note on "stems vs recordings"

The stem separation trap

Summary: which to use when

What The Vocal Market offers

Ready to start creating?

Related articles

Is It Legal to Train AI on Scraped Music? A 2026 Guide for ML Teams

Copyright-Cleared Vocal Datasets: What "Cleared" Actually Means

GDPR Article 9 and Voice Data: What AI Companies Training on Vocals Need to Know

Dry Stems vs Wet Stems: Which Do You Need for Training Vocal AI Models

Definitions

What dry stems teach a model

When dry stems are the right choice

What wet stems teach a model

When wet stems are the right choice

The subtler problem: what "wet" hides

The case for including both

Paired training: the hidden third use case

A quick note on "stems vs recordings"

The stem separation trap

Summary: which to use when

What The Vocal Market offers

Ready to start creating?

Related articles

Is It Legal to Train AI on Scraped Music? A 2026 Guide for ML Teams

Copyright-Cleared Vocal Datasets: What "Cleared" Actually Means

GDPR Article 9 and Voice Data: What AI Companies Training on Vocals Need to Know