If your ML background is in computer vision, NLP, or tabular data, the audio world comes with its own vocabulary and a few genuinely confusing distinctions. Most of the terms are borrowed from music production, where they have been stable for decades. A few of them have been reused by the ML community in ways that do not perfectly match the original meaning. This post is a short glossary to help you navigate.
The glossary is organized roughly by topic: sources and types, processing states, container formats, and audio signal properties. For each term we give the definition, how ML engineers typically use it, and common sources of confusion.
Sources and types
A cappella
A vocal performance or recording without instrumental accompaniment. The Italian phrase literally means "in the chapel style," referring to unaccompanied choral music. In music production the term is used more loosely: any vocal recording without instruments is "a cappella," whether solo or multi-part, whether recorded dry or processed.
In the ML context, "a cappella" usually means "vocal-only audio," which makes it useful training data for singing voice synthesis, voice cloning, and vocal analysis tasks. Confusion arises because a cappella tracks released commercially are usually wet (processed), whereas the cleanest training material is dry (unprocessed). See the entries for "dry" and "wet" below.
Stem
In music production, a stem is a track or group of tracks that forms a specific component of a mix. A song might have a vocal stem, a drum stem, a bass stem, and a keys stem. Each stem is the audio for that component, isolated from the others.
The word is used inconsistently. Sometimes "stem" means an individual track (just the lead vocal). Sometimes it means a submix (all vocals together: lead, harmonies, adlibs). For ML training purposes, you almost always want individual tracks, not submixes. If a vendor says "vocal stems," ask whether that means individual vocal tracks or a submix of all vocal-related tracks bundled together.
Isolated vocals
A vocal recording with no other instruments present. The term is functionally the same as "a cappella" but with stronger connotations of being a production-ready standalone recording.
Critical distinction: "isolated vocals" can mean either (a) a vocal that was recorded alone in the studio (genuinely isolated) or (b) a vocal that was extracted from a full mix using source separation software (reconstructed isolation). The two are not equivalent. Genuinely isolated recordings have no bleed, no artifacts, no phase issues. Separated recordings have all of those, just at varying levels of severity.
Lead vocal
The primary vocal line in a song. Usually the one carrying the melody and lyrics. In a multi-layered vocal arrangement, the lead is distinguished from harmonies, backing vocals, and adlibs.
Harmony
Additional vocal lines that sing different notes simultaneously with the lead, creating harmonic intervals (thirds, fifths, octaves). In a recorded context, harmonies are usually separate tracks from the lead vocal.
Adlib
Improvised or semi-improvised vocal additions on top of the lead. Common in R&B, hip-hop, and pop. Usually short phrases, runs, or exclamations that add texture but do not carry the main melody.
Background vocal (BG vocal, bg, bgv)
A general term for any non-lead vocal. Includes harmonies, choir sections, and atmospheric vocal textures. Often bundled into a single "BG" stem in professional productions.
Processing states
Dry
Audio with no processing applied. For a vocal, "dry" means the raw microphone signal with no reverb, compression, EQ, or effects. Dry vocals sound close and unprocessed, often uncomfortably so because listeners are used to hearing vocals with some amount of production.
Dry is the cleanest training material for most ML use cases because the model learns only the voice, not the voice plus effects.
Wet
Audio with processing applied. For a vocal, "wet" means the signal has been through some combination of reverb, delay, compression, EQ, de-essing, and other effects. Wet vocals sound like what you hear in a finished song: polished, sitting in a space, shaped to fit a mix.
Wet is useful for training when you want the model to produce production-ready outputs without post-processing. It is limiting when you want flexibility, because the effects are baked into the training signal.
Raw
Sometimes used interchangeably with "dry," but can also mean "unedited" — the original recording with no cuts, comping, or cleanup applied. A raw vocal includes all the breaths, retakes, and mistakes. A dry vocal includes just the final take but without processing.
Mixed / mixdown
The final combination of all tracks in a song into a single stereo (or surround) audio file. A mixdown is the output of the mixing process and contains everything: vocals, instruments, effects, panning, levels. For ML training, mixdowns are the least useful format because there is no way to isolate components without source separation.
Master / mastered
A mixdown that has been through a final polishing stage called mastering. Mastering applies broad EQ, compression, loudness normalization, and format adjustments to prepare the mix for distribution. Mastered audio is even further from raw training material than a mixdown because it has had additional global processing applied.
Container formats
WAV
Uncompressed audio container format. The standard for studio recording and ML training because it preserves the full audio signal without any compression artifacts. WAV files are large (roughly 10 MB per minute at 44.1 kHz / 16-bit, 15 MB per minute at 44.1 kHz / 24-bit) but lossless.
FLAC
Lossless compressed audio format. Retains the full audio signal but uses compression to reduce file size by roughly half compared to WAV. FLAC is effectively equivalent to WAV for training purposes — no information is lost in the compression.
MP3, AAC, OGG, OPUS
Lossy compressed audio formats. Discard some audio information to achieve smaller file sizes. For training a model that will produce high-quality output, lossy-compressed source audio is a problem because the compression artifacts get learned as part of the signal. If a dataset is delivered in MP3 format, that is typically a sign that the original source was also lossy (consumer material, streaming rips, etc.) rather than studio masters.
AIFF
Another uncompressed format, historically associated with Apple systems. Equivalent to WAV for practical purposes.
Audio signal properties
Sample rate (Hz, kHz)
The number of audio samples per second. 44,100 Hz (44.1 kHz) is the CD and consumer music standard. 48,000 Hz is the broadcast and film standard. 96 kHz and 192 kHz are used in some high-end production contexts. For most ML training on music and voice, 44.1 or 48 kHz is sufficient.
Bit depth
The number of bits used to represent each audio sample. 16-bit is the CD standard and captures roughly 96 dB of dynamic range. 24-bit is the studio standard and captures roughly 144 dB of dynamic range, giving more headroom for quiet passages and preventing quantization noise. For training, 24-bit is preferred when available.
Dynamic range
The ratio between the loudest and quietest parts of a recording. High dynamic range means the recording has both quiet and loud passages. Low dynamic range means the recording is relatively uniform in loudness. Heavily compressed recordings have low dynamic range.
Signal-to-noise ratio (SNR)
The ratio of the desired signal (the voice) to the background noise (hiss, room tone, electrical interference). High SNR is good. Studio recordings should have SNR above 60 dB.
Headroom
The difference between the peak level of a recording and the maximum possible level (0 dBFS, or "digital full scale"). Positive headroom means the recording has not clipped. Zero headroom means the loudest moment is at the ceiling. Negative headroom means clipping has occurred.
Clipping
When a recording exceeds 0 dBFS, the waveform is truncated at the ceiling, producing a harsh distortion. Clipped recordings have lost information that cannot be recovered. For training, clipped source material introduces distortion that the model will learn as part of the signal.
F0 (fundamental frequency)
The primary pitch of a voice, measured in Hz. When a singer hits the note A4, the F0 of their voice is 440 Hz. F0 contour over time is the curve of the pitch as it moves through a vocal phrase. F0 tracking is a fundamental step in most singing voice synthesis pipelines.
Formant
Resonant frequencies of the vocal tract that shape the timbre of a voice. Formants are what make a singer's voice recognizable. Different vowel sounds have different formant patterns. Models learn formant structure implicitly from training data.
Phoneme
The smallest unit of sound in a language. "Cat" has three phonemes: /k/, /æ/, /t/. Phoneme alignment (mapping each phoneme to a time range in the audio) is required for controllable singing voice synthesis.
Source separation and its limitations
Source separation
The task of extracting individual sources (vocals, drums, bass, other) from a fully mixed audio file. Modern source separation models like HTDemucs can achieve signal-to-distortion ratios around 9 dB, which is impressive but not equivalent to a true isolated recording.
SDR (signal-to-distortion ratio)
A metric for evaluating source separation quality. Higher is better. An SDR above 6 dB is considered good; above 8 dB is excellent. The current state of the art for music source separation is approximately 9.2 dB on MUSDB18-HQ benchmarks. No separation model achieves perfect separation.
Bleed
Audio from one source leaking into another source. In a multi-track recording, bleed happens when a microphone picks up sound from a nearby instrument. In source separation, bleed is residual audio from one source that remains in the extracted stem of another source. Vocal stems extracted from a full mix via source separation typically contain some bleed from drums, bass, and other instruments.
Phase artifacts
Distortions that arise from imperfect separation in the frequency domain. Spectrogram-based separation methods (like early Spleeter versions) are particularly prone to phase artifacts because they reconstruct the time-domain signal from a modified spectrogram without perfectly handling phase relationships. Time-domain methods (like Demucs) handle phase better but still leave residual artifacts.
Terms that get conflated
Some terms in the audio and ML worlds get used interchangeably when they shouldn't. Here are the most common sources of confusion.
- "A cappella" vs "isolated vocal" vs "vocal stem": All three can refer to vocal-only audio, but the provenance matters. A cappella can be a standalone vocal performance. An isolated vocal can be a studio recording or a separation output. A vocal stem is a track from a multi-track production. When in doubt, ask how the audio was produced.
- "Dry" vs "raw": Dry means unprocessed. Raw can mean unprocessed or it can mean unedited (including takes, mistakes, breaths). Ask specifically which is meant.
- "Stems" (individual tracks) vs "stems" (submixes): Both usages exist. Confirm whether the vendor means individual tracks or submix groups.
- "High quality" vs "high fidelity": "High quality" is a marketing term with no fixed meaning. "High fidelity" originally referred to playback equipment but in audio production usually means high sample rate and high bit depth. Neither term tells you anything about the actual signal characteristics.
- "Isolated" (genuinely recorded alone) vs "isolated" (separated from mix): Critical distinction for training. Ask how the file was created.
Why this matters for training
The vocabulary in this glossary is not just jargon. Each term implies a specific set of properties about the audio, and those properties determine what the audio is useful for. A "wet a cappella stem extracted from source separation" is a fundamentally different training resource from a "dry lead vocal recorded as an isolated studio track," even if both are described casually as "isolated vocals."
When you are evaluating a vocal dataset, the precision of the vendor's language is a strong signal. Vendors who use these terms precisely tend to have the underlying craft to go with them. Vendors who use the terms loosely often have datasets with loose properties to match.
What The Vocal Market provides
Our enterprise vocal dataset consists of genuinely isolated vocal recordings captured in professional studios by professional vocalists. Every recording is a lead vocal (with harmonies and adlibs delivered as separate tracks where applicable), captured dry at 44.1 kHz / 24-bit as uncompressed WAV files. Wet versions are processed from the dry originals and delivered alongside. Nothing in the catalog was extracted from a full mix via source separation, so there is no bleed, no phase artifact, no SDR ceiling to worry about.
If you want to hear the difference between a true studio isolate and a source-separated stem, request a sample and we will include a comparison pair: one of our dry studio vocals and an equivalent file extracted from a full mix using HTDemucs. The quality gap is immediately audible on headphones.



