AI Chatbots Like ChatGPT Could Soon Run Out of Data That Keeps Them Smarter, Study Finds

Artificial intelligence (AI) chatbots like ChatGPT may not be so smart in the near future due to the potential depletion of the online data that fuels their learning processes.

A recent study conducted by Epoch AI suggests that the available pool of publicly accessible training data for AI language models may be exhausted by the early 2030s, raising concerns about the sustainability of current AI advancements.

AI Chatbots Like ChatGPT Could Soon Run Out of Data That Keeps Them Smarter, Study Finds — Artificial intelligence (AI) chatbots like ChatGPT may not be so smart in the near future due to the potential depletion of the online data that fuels their learning processes. MARCO BERTORELLO/AFP via Getty Images

Will AI Run Out of Data Soon?

According to Epoch AI's findings, the exponential growth in AI capabilities in the past years has been primarily driven by the scaling of models and the expansion of training datasets.

However, there is a finite amount of high-quality human-generated text data available on the internet, which serves as the primary source for training AI language models like ChatGPT.

The study estimates that the adequate stock of such data amounts to approximately 300 trillion tokens, with a forecasted depletion timeline ranging from 2026 to 2032.

The study highlights the critical role of training data in scaling AI models and emphasizes the need for a sustainable approach to AI development. It says that the current trajectory of AI progress may be unsustainable if companies solely rely on the limited pool of publicly available text data for training purposes.

Furthermore, the study discusses the implications of different scaling strategies on the timeline for data depletion. It says that overtraining AI models with fewer parameters and more data could speed up the depletion of available training data as early as 2025.

The study also acknowledges recent advancements in AI training methodologies, such as the use of carefully filtered web data and the ability to train models without significant degradation.

However, the team warns that even with these advancements, the long-term sustainability of AI development remains uncertain. As the demand for AI capabilities grows, companies may face challenges in sourcing high-quality training data to fuel further advancements in AI technology.

New Innovations Needed to Sustain AI Progress

In response to the looming challenge of data depletion, tech companies like OpenAI and Google are seeking alternative sources of training data, including partnerships with online platforms like Reddit and news media outlets.

However, these efforts may only offer short-term solutions, as the supply of new publicly available text data is finite and subject to depletion over time.

The study suggests that new innovations will be required to sustain AI progress beyond publicly available training data depletion. These could involve the development of synthetic data generation techniques, leveraging alternative data modalities, and improving data efficiency.

While the future of AI development remains uncertain, the study anticipates continued investments in research and development to address these challenges and drive future advancements in AI technology. The research team's findings were published in arXiv.