Every few months we get an email from a founder asking the same question in slightly different words. It goes something like: "We are a small team, we have a clever idea for a music model, and we need training data. Can we just scrape YouTube? It's free, it's massive, and everybody does it."
The short answer is that you can, but you should not. The longer answer is that the companies that have already tried are now paying for it in court, and the mechanism they are getting sued under is not the one most people expect. This post walks through why stream-ripping from YouTube is a particularly bad legal strategy for an AI music startup, what the Suno amended complaint actually alleges, and what the DMCA's anti-circumvention provisions do to your exposure.
The obvious problem: copyright
Let's dispatch the obvious part first. Every song on YouTube that was not uploaded by its rightsholder is there either because the rightsholder allows it, because the rightsholder tolerates it, or because the rightsholder has not yet sent a takedown notice. None of those scenarios constitute a license to copy the audio file for commercial AI training.
Copying a copyrighted sound recording without authorization is copyright infringement under 17 U.S.C. §106(1), which grants the copyright holder the exclusive right to reproduce the work. Training a machine learning model requires reproducing the work into the training pipeline, typically multiple times. That is copying, full stop.
The common defense is fair use. Fair use is a four-factor analysis, and the factor that most often goes against AI defendants is Factor 4, the effect on the potential market for the original work. In Thomson Reuters v. Ross Intelligence, decided February 11, 2025, Judge Stephanos Bibas called Factor 4 "undoubtedly the single most important element of fair use" and found it weighed against Ross. In Bartz v. Anthropic, Judge William Alsup held on June 23, 2025 that training on pirated books was not fair use, which led to a $1.5 billion settlement.
If your AI model generates music, and the training data was music, Factor 4 is always going to be a problem. You are building something that competes in the same market as the inputs. That is exactly what Factor 4 is designed to stop.
The non-obvious problem: the DMCA
Here is the part most founders miss. Even if you won the fair use argument, the DMCA would still be a problem.
Section 1201 of the Digital Millennium Copyright Act, 17 U.S.C. §1201, prohibits circumventing technological protection measures that control access to copyrighted works. This is the "anti-circumvention" provision, and it is entirely separate from the copyright infringement analysis. Even if the underlying copying qualified as fair use, circumventing the protection measure is a separate violation with its own statutory penalties.
YouTube implements access controls. The specific mechanism is a rolling cipher that obfuscates the video URLs and requires a client to execute JavaScript to decrypt them. When you use a stream-ripping tool or download script to pull audio from YouTube, that tool is bypassing the rolling cipher. That is circumvention under Section 1201.
Why this matters for your exposure
Section 1201 carries its own statutory damages, capped at $2,500 per act of circumvention under 17 U.S.C. §1203(c)(3)(A). "Per act" is ambiguous, and rightsholders tend to argue that each ripped file counts as a separate act. If you scraped 100,000 audio files, the theoretical DMCA exposure is $250 million before you even get to the underlying copyright claim.
In practice, courts do not always award the statutory maximum. But the mere fact that the exposure exists gives rightsholders enormous settlement leverage. It is the reason the RIAA's complaints against Suno and Udio included DMCA claims alongside the copyright claims. The DMCA claims are the cudgel that makes the settlement conversation serious.
What the Suno amended complaint actually alleges
On September 22, 2025, the plaintiffs in UMG v. Suno filed an amended complaint that materially changed the case. The original complaint, filed in June 2024, alleged copyright infringement based on the assumption that Suno had trained on major-label recordings. The September 2025 amendment added a specific factual allegation: Suno obtained its training data by stream-ripping YouTube.
The significance of this amendment is subtle but important. Stream-ripping is a circumvention activity under Section 1201. By alleging stream-ripping specifically, the plaintiffs added DMCA claims to the case. That is a different legal theory with different damages, different defenses, and a different settlement calculus.
It also changed the public narrative around Suno's defense. Before the amendment, Suno was arguing that its training was transformative fair use. After the amendment, Suno had to defend against a theory that its training data was obtained by breaking through access controls in the first place, which is a much harder story to tell a jury.
The settlement with Warner Music followed on November 25, 2025. The terms were not disclosed, but the settlement included a commitment by Suno to phase out its current models and replace them with new, licensed models. You do not phase out your core models voluntarily. You do it because the alternative is worse.
The "but I didn't rip it myself" defense
A common strategy in the AI training space is to use a third-party dataset that was already scraped, arguing that the training company itself did not perform the circumvention and therefore is not liable under Section 1201.
This defense is weaker than it looks. Section 1201 prohibits both direct circumvention and the trafficking in circumvention tools. Using a dataset that you knew was obtained by circumvention can be analyzed as contributory infringement or as use of a tool whose "primary purpose" was to circumvent, which is itself a 1201 violation under §1201(a)(2) and §1201(b).
More practically, the dataset provider's knowledge can be imputed to you. During discovery in an AI copyright case, the plaintiffs' lawyers will subpoena the dataset provider, obtain their source documentation, and establish that the training data was obtained through scraping. Your defense then becomes "we did not know," which is not a legal defense to copyright infringement (strict liability) and is a difficult factual defense under the DMCA if your engineers discussed the data source in internal messages.
We know from discovery in AI cases that internal discussions about training data are routinely produced. If any engineer ever wrote "we got this from YouTube scrapes, right?" in Slack, that message is going to end up in a plaintiff's summary judgment motion.
The "everyone does it" defense
The other common founder argument is that scraping is industry standard practice and therefore not actually risky. It's worth naming the logical structure of this argument: it assumes that because the biggest AI companies have not yet been fully litigated to resolution, their practices are safe to copy.
This is the wrong inference. The biggest AI companies are being litigated to resolution right now. The OpenAI cases, the Anthropic cases, the Suno cases, the Udio cases, the Concord Music case, the Kadrey case, the NYT case, the Authors Guild case, the Stability AI cases, the GEMA case. What looked like a stable "industry practice" in 2023 is currently being priced in the form of multi-hundred-million and multi-billion dollar settlements.
The $1.5 billion Bartz v. Anthropic settlement is the clearest signal in the current market. It is the largest copyright settlement in U.S. history, and it was triggered specifically by the distinction between legally acquired and pirated training data. Your startup does not have $1.5 billion. Your startup probably does not even have $15 million for legal fees to defend a lawsuit to conclusion.
Why this matters even if you are small
A founder might object: "Fine, but the RIAA is going after Suno because Suno is big. They are not going to notice a small startup."
There is a kernel of truth here. The RIAA prioritizes enforcement against high-visibility targets because those actions have deterrent effect. A two-person AI startup is not getting a complaint filed against it tomorrow.
But the framing misses three things:
- Acquirers notice. When you go to sell your company, the acquirer's legal team is going to ask where the training data came from. If the answer is "YouTube," the deal either dies or the acquirer demands a massive indemnification carve-out that functionally transfers the liability to you personally.
- Investors notice. Institutional VCs now routinely ask about training data provenance during due diligence. If your answer is unclear, your valuation drops. If your answer is "we scraped," you may not be fundable at all by the top funds.
- Small companies become big. Suno was a two-person startup once. The exposure you accrue today compounds as your company grows. By the time you are big enough to be noticed, you are also big enough to be sued, and your training data has been in production for three years.
The time to decide on training data is when you are small. Decisions made in year one are effectively locked in for the lifetime of the product because retraining is expensive and model weights are path-dependent.
What the settled cases tell us about the price of fixing it later
The two cleanest settlement data points in the current environment are Warner-Suno and UMG-Udio. Both deals closed in late 2025. Neither has published financial terms, but the structural elements are instructive.
- Warner-Suno: Suno committed to phase out current models and launch new licensed models. Warner became a commercial partner. Suno acquired Songkick from WMG as part of the deal.
- UMG-Udio: UMG and Udio agreed to build a joint AI music platform launching in 2026. Artists opt in and are compensated when their work is used in training. The platform is structured as a "walled garden" where outputs cannot be freely exported.
In both cases, the settlement includes a commitment to move from scraped to licensed data. That move is not free. It requires retraining models, rebuilding datasets, negotiating artist-level economics, and in some cases giving up product features that depended on the scraped data (downloads, style imitation, etc.).
The cost of that transition is something you can amortize over years if you plan it from the start. It becomes a crisis when you are forced into it mid-litigation.
The alternative
The alternative to scraping is licensing. Licensing has become significantly more accessible and significantly cheaper in the last 18 months as the market has matured. Stable Audio 2.0, for example, is trained exclusively on licensed data from AudioSparx, a stock music library with over 800,000 audio files. Meta's MusicGen was trained on 20,000 hours of licensed music from Shutterstock, Pond5, and internal Meta sources. These are not boutique arrangements — they are functional supply chains for commercial AI.
For vocal-specific training data, the supply chain is still developing, which creates both a risk (fewer vendors, less standardization) and an opportunity (cleaner deal structures, stronger positioning). We built The Vocal Market's enterprise licensing program specifically to be the licensed alternative for vocal stems. Every recording in the dataset was contributed by a professional vocalist under a timestamped, GDPR-compliant agreement that explicitly authorizes AI training use. The documentation is auditable, the consent chain is specific, and the dataset is structured to survive the due diligence phase of any future acquisition or litigation.
If you are weighing "scrape now, deal with it later" against "license from day one," run the numbers honestly. Include the retraining cost if you get sued. Include the legal fees. Include the acquirer discount. Include the VC valuation drop. When you add those costs up, the scraping path is almost always more expensive in expected value, and it is always more volatile. Request a sample dataset and we will walk through the licensing options with your team.



