AI Faces Text Data Shortage: Study Warns of Looming Human-Written Content Crisis by 2032
By 2026-2032, AI language models may have exhausted all publicly available training data. Tech businesses are competing to get high-quality data sources for AI training. In the long run, there may be insufficient new content to support AI development.
According to a new analysis by research organisation Epoch AI, tech companies will run out of publicly available training data for AI language models by the end of the decade, between 2026 and 2032.
The paper compares the current situation to a "literal gold rush" that depletes finite natural resources. Tamay Besiroglu, one of the study's authors, cautions that if the reservoirs of human-generated literature are depleted, the AI sector would struggle to continue its current rate of advancement.
In the short term, businesses such as OpenAI and Google are competing to get high-quality data sources for training their AI models. They are signing agreements to gain access to a continual flow of sentences from platforms such as Reddit and news media outlets.
However, in the long run, there will be insufficient new content, such as blogs, news stories, and social media discussion, to support AI progress. This puts pressure on businesses to get into sensitive data, such as emails or text messages, or to rely on less reliable "synthetic data" provided by chatbots themselves.
Besiroglu emphasises the gravity of the situation, claiming that data limits will impede the efficient scaling up of AI models. Scaling up models has been critical for increasing their capabilities and output quality.
The researchers made their original forecasts two years ago, projecting a cutoff for high-quality text data by 2026. Nevertheless, Epoch now foresees running out of public text data within the next two to eight years.
The study, which is peer-reviewed, will be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch is a nonprofit institute funded by proponents of effective altruism.
The exponential growth of text data fed into AI language models has outpaced computing power. While text data has grown 2.5 times per year, computing power has grown only 4 times per year. Meta Platforms, the parent company of Facebook, claims that their upcoming Llama 3 model has been trained on up to 15 trillion tokens.
There is a debate about the significance of the data bottleneck. Some argue that building more specialised AI models for specific tasks can be an alternative to training larger models. However, concerns exist about training generative AI systems on their own outputs, which can lead to degraded performance and the amplification of existing mistakes, biases, and unfairness.
As AI developers rely on human-crafted words for important data, platforms such as Reddit, Wikipedia, and news publishers are thinking about how their data is being used. While others limit access to their data, Wikipedia imposes few restrictions on AI businesses. However, there is hope that incentives for human contributions will persist, particularly as low-quality, machine created content floods the internet.
According to the study, paying millions of people to write text for AI models is not a cost-effective way to improve technical performance. Instead, businesses such as OpenAI are investigating the use of synthetic data for training, however reservations remain about depending too heavily on this method.
AI language models may run out of publicly available training data by 2026-2032.
Tech companies are racing to secure high-quality data sources for AI training.
In the long term, there may not be enough new content to sustain AI development.
Source: AP NEWS