Datali logodatali

Main Menu

Running out of data

Published: 3/14/2025, 6:00:00 PM

Huston, we have a problem. We are running out of data. 

When it comes to training powerful AI models, we need data.

A lot of data.

For instance, to train chatGPT4 we needed around 10 trillion words.

It can't be words from any source, like social media posts.

Garbage in, garbage out - the lower data quality, the lower accuracy, reliability and effectiveness of the model.

So it's better to use high quality sources, like books, filtered web content or scientific articles.

That data stock grows much slower than we need for creating new training datasets.

In fact, there are predictions that we will run out of high quality data by 2026, and low quality sometime between 2030 and 2060.

Will it slow down the development of new LLM, or even stop it?

Can we find the way around it?

Want to know even more?

Join our AI newsletter!
You will unlock premium articles
and get the latest news!