Will AI companies have to share where they got their training data?

A battle brews between AI companies in need of data and creators who want to protect their work.

An inevitable battle has been brewing over where and how companies like OpenAI and Google get the data they use to train their AI models.

A robot reads a document against a background of various magazine fragments.

Last week, Rep. Adam Schiff of California proposed a bill that would require companies to be transparent about training data. Typically, companies do not disclose this information.

What would this bill do, exactly?

Under the Generative AI Copyright Disclosure Act, anyone planning to release a generative AI model would have to submit paperwork to the US government sharing any copyrighted works in its training dataset at least 30 days prior to its public launch, or face a fine.

This matters because:

  • AI companies claim it’s fair use to train their models on such copyrighted materials, and that their tools wouldn’t work without it.
  • Plaintiffs including artists, authors, media companies, and musicians have filed lawsuits alleging AI models had been trained on their work without permission or compensation.

Unsurprisingly, Schiff’s bill is endorsed by myriad media and entertainment organizations, including several Hollywood unions. As you may recall, AI protections were a key part of both the actors’ and writers’ guild strikes.

But AI…

… is but a hungry, hungry caterpillar for data.

Large language models (LLMs) like ChatGPT train on text broken into tokens. Tokens can be single words, parts of words, or characters.

AI researcher Pablo Villalobos estimates that GPT-4 trained on up to 12T tokens, per The Wall Street Journal. That’s a lot, but he estimates a GPT-5 would need 60T-100T tokens.

Already, AI companies and experts have expressed concern that they’ll run out of public data within two years. And though some companies are working on ways to change this, AI typically can’t train on AI-generated content without deteriorating.

This bill…

… would add more legislative backing for content creators, if passed, giving them more power to negotiate their own terms.

Either way, the debate will continue as tech companies race to build the greatest model and creators try to prevent their work from being used to train their robot replacements.

Get the 5-minute news brief keeping 2.5M+ innovators in the loop. Always free. 100% fresh. No bullsh*t.