Home
News
Will AI companies have to share where they got their training data?

Will AI companies have to share where they got their training data?

Subscribe for your daily dose of unconventional business news 🚀

Please provide a valid email address.

Published: October 01, 2024

An inevitable battle has been brewing over where and how companies like OpenAI and Google get the data they use to train their AI models.

A robot reads a document against a background of various magazine fragments.

Last week, Rep. Adam Schiff of California proposed a bill that would require companies to be transparent about training data. Typically, companies do not disclose this information.

What would this bill do, exactly?

Under the Generative AI Copyright Disclosure Act, anyone planning to release a generative AI model would have to submit paperwork to the US government sharing any copyrighted works in its training dataset at least 30 days prior to its public launch, or face a fine.

This matters because:

AI companies claim it’s fair use to train their models on such copyrighted materials, and that their tools wouldn’t work without it.
Plaintiffs including artists, authors, media companies, and musicians have filed lawsuits alleging AI models had been trained on their work without permission or compensation.

Unsurprisingly, Schiff’s bill is endorsed by myriad media and entertainment organizations, including several Hollywood unions. As you may recall, AI protections were a key part of both the actors’ and writers’ guild strikes.

But AI…

… is but a hungry, hungry caterpillar for data.

Large language models (LLMs) like ChatGPT train on text broken into tokens. Tokens can be single words, parts of words, or characters.

AI researcher Pablo Villalobos estimates that GPT-4 trained on up to 12T tokens, per The Wall Street Journal. That’s a lot, but he estimates a GPT-5 would need 60T-100T tokens.

Already, AI companies and experts have expressed concern that they’ll run out of public data within two years. And though some companies are working on ways to change this, AI typically can’t train on AI-generated content without deteriorating.

This bill…

… would add more legislative backing for content creators, if passed, giving them more power to negotiate their own terms.

Either way, the debate will continue as tech companies race to build the greatest model and creators try to prevent their work from being used to train their robot replacements.

Will AI companies have to share where they got their training data?

What would this bill do, exactly?

But AI…

This bill…

Need the full story?

Want even more business resources? Checkout Trends.co to access exclusive research and connect with business builders from around the globe.

Thank you for subscribing!

Congrats on joining the best damn newsletter in the world

100% Free CRM

News

News Briefs

Hustle Originals

Past Newsletters

Videos

The Hustle

My First Million

Podcasts

The Hustle Daily Show

My First Million

Resources

How You Hustle

Products

The HubSpot Customer Platform

Marketing Hub

Sales Hub

Service Hub

Content Hub

Data Hub

Commerce Hub

Smart CRM

Breeze

AEO (Beta)

Will AI companies have to share where they got their training data?

What would this bill do, exactly?

But AI…

This bill…

Follow us on social media

Need the full story?

Thank you for subscribing!

Congrats on joining the best damn newsletter in the world

100% Free CRM