If you are building a music or voice AI product, the legal status of your training data is no longer a theoretical question. It is a cap-table question. It is a due-diligence question. It is the difference between a clean acquisition and a nine-figure settlement.
In the last twelve months, a federal court rejected the AI industry's favorite fair use defense for the first time, a major AI startup signed a $1.5 billion settlement over pirated training data, and two of the three major record labels settled lawsuits against AI music generators in exchange for licensing deals and equity stakes. The rules are being written in real time, and the direction is unambiguous: scraping is getting more expensive, licensing is getting cheaper by comparison, and the companies that can prove where their training data came from are going to win the next cycle.
This guide walks through the current state of the law as of April 2026. Every claim links to a primary source. If your legal team reads this and wants to verify something, they can. That is the bar we are holding ourselves to, because the teams reading this are the ones whose lawyers will read it next.
The short answer
There is no single statute that makes scraping music for AI training illegal in the United States. There is also no statute that makes it legal. The question is governed by copyright law, which asks whether the use was authorized or qualifies for an exception like fair use. In 2026, three developments have made the answer a lot less friendly to scrapers than it was in 2023.
- Courts are rejecting fair use defenses for commercial AI training. The first federal ruling on the question went against the AI defendant. Subsequent rulings have been mixed, but the trend line is clear.
- Pirated training data carries nine-figure settlement exposure. The Bartz v. Anthropic settlement set the price tag at $1.5 billion for one model trained on pirated books.
- The EU AI Act now requires disclosure of training data sources. If you are building a general-purpose model and selling it in Europe, you will have to publish a summary of what you trained on. That disclosure becomes its own liability surface.
If your product touches the EU market, if you ever plan to raise from a top-tier VC, if you intend to be acquired by a strategic buyer, or if you want to sleep at night, the cost-benefit of scraping has flipped. The rest of this post explains why.
The Suno and Udio lawsuits: what actually happened
On June 24, 2024, the Recording Industry Association of America, on behalf of Sony Music Entertainment, Universal Music Group, and Warner Records, filed two parallel lawsuits: one against Suno in the U.S. District Court for the District of Massachusetts, and one against Udio (operated by Uncharted Labs) in the Southern District of New York.
The complaints alleged direct copyright infringement of sound recordings plus violations of the Digital Millennium Copyright Act's anti-circumvention provisions. The plaintiffs sought statutory damages of up to $150,000 per infringed work under 17 U.S.C. §504(c)(2), which is the standard maximum for willful copyright infringement, plus an additional $2,500 per act of circumventing technological protection measures.
The lawsuits were widely covered as a $500 million exposure or "billions in damages," but those totals never appeared in the complaints themselves. The concrete, citable number is $150,000 per work. Given that the complaints referenced thousands of recordings, the theoretical exposure is large, but no specific total has been ordered by any court.
What happened next
Suno initially defended on fair use grounds, arguing that training a model is transformative use and that the outputs are new creative works. That position started to erode in September 2025 when the plaintiffs filed an amended complaint alleging that Suno obtained its training data by stream-ripping YouTube videos, which would implicate not only copyright but also DMCA anti-circumvention claims. Stream-ripping involves bypassing technological protection measures, and the DMCA carries separate statutory penalties regardless of how the underlying fair use analysis comes out.
Then the settlements started.
The 2025 settlement cascade
- October 29, 2025: Universal Music Group settled its suit against Udio. Terms were not disclosed, but the deal includes a new licensing agreement and a joint AI music platform launching in 2026 where artists opt in and are compensated when their work is used for training.
- November 25, 2025: Warner Music Group settled with Suno. Warner became the first major label to strike a licensing partnership with Suno. Suno acquired Songkick from Warner as part of the deal and committed to phasing out its current models in favor of new, licensed models.
- As of April 2026: Sony Music continues to litigate against both Suno and Udio. UMG continues to litigate against Suno. Warner continues to litigate against Udio. Separate class actions brought by independent artists are pending against both defendants.
The lesson here is not "everything got resolved." The lesson is that major label plaintiffs extracted licensing deals and equity relationships as the price of settlement, and the remaining plaintiffs are still pursuing damages. For anyone building on scraped data, the settlement arithmetic is getting worse month by month.
The precedent that should scare you: Bartz v. Anthropic
The single most citable number in the current AI copyright landscape is $1.5 billion. That is the size of the preliminarily approved settlement in Bartz v. Anthropic, and it is the largest copyright settlement in U.S. history.
The case was decided by Judge William Alsup in the Northern District of California. On June 23, 2025, Judge Alsup issued a split summary judgment:
- Training on legally purchased books was fair use. Judge Alsup called it "exceedingly transformative" and granted summary judgment for Anthropic on that claim.
- Training on pirated books was not fair use. Anthropic had acquired over 7 million books from pirate sites. The judge held that this portion of the training data was not protected by fair use and ordered a trial on damages.
Rather than go to trial on the pirated-books claim, Anthropic agreed to settle for $1.5 billion. Judge Alsup preliminarily approved the settlement on September 25, 2025. For reference, the previous largest copyright settlement in U.S. history was in the low hundreds of millions.
The teaching of Bartz is precise and it is brutal. Even if a court is inclined to hold that AI training is transformative under fair use doctrine, the moment your training data came from an unauthorized source, you lose the fair use shield on that portion of the corpus. And the measure of damages is set per work, not per dataset.
The first ruling to reject AI fair use: Thomson Reuters v. Ross Intelligence
Before Bartz, there was Thomson Reuters v. Ross Intelligence, decided February 11, 2025, in the District of Delaware. Judge Stephanos Bibas granted summary judgment for Thomson Reuters on the question of whether Ross Intelligence could use 2,243 Westlaw headnotes to train its AI legal search tool.
It was the first federal court ruling to reject an AI fair use defense. Judge Bibas weighed all four fair use factors and concluded that Factor 4, the effect on the potential market for the original work, weighed decisively against Ross. He called Factor 4 "undoubtedly the single most important element of fair use."
The usual defense-side response to Ross is that it involved non-generative AI and therefore does not directly apply to models like Suno or ChatGPT. That is technically correct. But the market-substitution logic Judge Bibas applied does carry over. If your AI model can generate outputs that compete in the same market as the training inputs, Factor 4 weighs against you, and the fair use analysis gets much harder.
The mixed signal: Kadrey v. Meta
Two days after Bartz, on June 25, 2025, Judge Vince Chhabria granted summary judgment for Meta in Kadrey v. Meta, a case brought by Richard Kadrey, Sarah Silverman, Ta-Nehisi Coates, Junot Díaz, and other authors over the use of their books to train Llama.
This is where AI defense lawyers like to stop the story. But Judge Chhabria's opinion contained a warning that every plaintiff's lawyer reading this guide has memorized. He wrote that the plaintiffs had failed to develop the right evidentiary record, and that on a properly argued "market dilution" theory, they "very well might have won."
Kadrey is a defensive procedural win for Meta. It is not a precedent endorsing AI training on copyrighted books. When the next plaintiff shows up with a proper market-dilution record, Judge Chhabria has already signaled which way he leans.
Europe: the first copyright ruling against OpenAI happened in Germany
In November 2025, the Munich Regional Court issued the first worldwide copyright ruling against OpenAI. The plaintiff was GEMA, Germany's music rights collecting society. GEMA argued that ChatGPT reproduced copyrighted German song lyrics in its outputs without authorization. The court agreed and granted an injunction. OpenAI is appealing.
The significance of the GEMA ruling is not just the outcome. It is the venue. Europe is moving faster on AI copyright than the United States, and European courts are more willing to treat AI outputs that reproduce copyrighted material as prima facie infringement. If your product is available in the EU, a copyright ruling against you in one Member State can trigger injunctions across the entire single market.
The EU AI Act: disclosure as a liability surface
The EU Artificial Intelligence Act, Regulation (EU) 2024/1689, entered into force on August 1, 2024. Its general-purpose AI provisions became applicable on August 2, 2025, and enforcement begins on August 2, 2026.
The provision that every ML team should be reading is Article 53(1)(d). It requires providers of general-purpose AI models to publish a "sufficiently detailed summary about the content used for training." The European Commission's AI Office published the mandatory Training Data Summary Template on July 24, 2025. The template requires disclosure of:
- Large public datasets used in training
- Licensed and private data sources
- Scraped content, including the most relevant domains
- User-contributed data
- Synthetic data
- The copyright policy the provider follows, including compliance with the Article 4(3) text-and-data-mining opt-out under the CDSM Directive
Penalties for non-compliance can reach €15 million or 3% of global annual turnover, whichever is higher. For a large AI company, 3% of global turnover can eclipse any fair use damages award.
There is a subtler point here that is not being discussed enough. The required disclosure itself becomes a liability surface. Once you publish a summary saying "we scraped 120 million audio files from YouTube," every rightsholder in every jurisdiction where you operate has received a written admission they can cite in a complaint. The AI Act is not just a compliance regime. It is a discovery pipeline.
What the U.S. Copyright Office said
On May 9, 2025, the U.S. Copyright Office released Part 3 of its report on AI and copyright. The report is not binding law, but it is persuasive authority and it has already been cited in active litigation.
The Office's bottom-line conclusion: using pirated works to train commercial AI models that compete in the source market is "unlikely to qualify as fair use." The report also rejected the idea of a compulsory licensing regime, which is the AI industry's preferred backstop. The Office's position is that the answer to training data questions is a market licensing system, not statutory immunity.
When the Copyright Office, the District of Delaware, and Judge Alsup in the Northern District of California all converge on the same basic view in the same twelve-month period, the legal weather is changing.
Japan is the outlier
One jurisdiction bucks the trend. Article 30-4 of Japan's Copyright Act, in effect since January 1, 2019, permits the use of copyrighted works for "information analysis," including AI training, for both commercial and non-commercial purposes. It is the most permissive text-and-data-mining exception in any major economy.
There is an important limitation. The exception does not apply when the purpose of training is "enjoyment of the thoughts or sentiments expressed" in the works. In practice, that means training a general-purpose model is probably fine under Japanese law, but training a style-imitation fine-tune designed to reproduce a specific artist's work is not.
For teams that operate exclusively in Japan and never serve users outside Japan, Article 30-4 is a real competitive advantage. For everyone else, the minute your product is accessible from the EU or the United States, you are back in the regulatory environment described above.
So what counts as safe training data in 2026?
Based on the cases above, here is the working definition enterprise legal teams are converging on for music and voice AI:
| Data source | Legal status | Typical posture |
|---|---|---|
| Scraped from YouTube, Spotify, streaming | High risk. Likely copyright infringement + DMCA circumvention. | Avoid |
| Public datasets (MUSDB18, OpenSinger, etc.) | Usually research-only or CC non-commercial. Commercial use requires separate clearance. | Research only |
| Licensed stock libraries (Audiosparx, Pond5) | Usable if the license specifically permits AI training and opt-outs were honored. | Check license terms |
| Dedicated AI training datasets with consent chain | Lowest risk. Performer consent + explicit AI training rights + auditable records. | Preferred |
| In-house recordings with work-for-hire agreements | Clean but expensive. Common for large tech companies. | Build in-house |
The cleanest position for a music or voice AI company is a combination of the last two rows: licensed datasets where performer consent was collected at the point of recording, plus in-house recordings for anything custom. The middle ground of "we scraped carefully" has collapsed.
What to do if you have already trained on scraped data
Many of the companies reading this are in the awkward middle position: they have already trained a model on data that would not survive discovery, and they are trying to figure out what to do next. Based on the posture of the settled cases, the patterns being used are:
- Phase out affected models. Warner's settlement with Suno included a commitment to phase out the current models and replace them with new licensed ones. Expect similar terms in future settlements.
- Backfill licenses. Post-hoc licensing deals are expensive but they exist. The UMG-Udio deal and the Warner-Suno deal both cover past use as well as future use.
- Rebuild on licensed data for new models. Model retraining is painful, but the cost of retraining is lower than the cost of operating a model that is the subject of active litigation in multiple jurisdictions.
- Document everything going forward. Start tracking data provenance now. If your current dataset will not survive discovery, make sure the next one will.
The practical takeaway
The question "is it legal to train AI on scraped music" is less useful than the question "what does the risk-adjusted cost of each training data strategy look like for my company in the next 24 months?" That is a calculation every ML team should be running right now, and for most companies the answer has already flipped in favor of licensing.
At The Vocal Market, we built our enterprise vocal dataset licensing program specifically for this moment. Every recording in our catalog was contributed by a professional vocalist who signed a timestamped, GDPR-compliant agreement explicitly authorizing AI training use. The dataset includes dry and wet stems, 16 genres, four languages, and full metadata (BPM, key, vocal type, gender, genre). The consent chain is auditable and documented. If you need clean training data that will survive the discovery phase of any future litigation, request a sample dataset and we will walk your legal team through the consent documentation.



