AI trained on AI churns out gibberish garbage

Large language models like those offered by OpenAI and Google famously require vast troves of training data to work. The latest versions of these models have already scoured much of the existing internet which has led some to fear there may not be enough new data left to train future iterations. Some prominent voices in the industry, like Meta CEO Mark Zuckerberg have posited a solution to that data dilemma: simply train new AI systems on old AI outputs.

But new research suggests that cannibalizing of past model outputs would quickly result in strings of babbling AI gibberish and could eventually lead to what’s being called “model collapse.” In one example, researchers fed an AI a benign paragraph about church architecture only to have it rapidly degrade over generations. The final, most “advanced” model simply repeated the phrase “black@tailed jackrabbits” continuously.

A study published in Nature this week put that AI-trained-on-AI scenario to the test. The researchers made their own language model which they initially fed original, human-generated text. They then made nine more generations of models, each trained on the text output generated by the model before it. The end result in the final generation was nonessential surrealist-sounding gibberish that had essentially nothing to do with the original text. Over time and successive generations, the researchers say their model “becomes poisoned with its own projection of reality.”

AI models forget meaning the more they trains on themselves

The researchers refer to this odd case of AI seemingly imploding on itself as “model collapse,” a degenerative process that can present itself in early and late stage forms. On the early side of things, collapse begins to occur when AI models several generations removed from the original training data seemingly forgets outliers, or rarities in the original text. This has the effect of making the most likely outputs more and more common. That would be an issue in the real world, because it could result in a whittling down of minority views or expression. An LLM showing signs of early collapse could present a version of reality that lacks diversity and suffers from an overwhelming sameness.

Things get weirder in the later stages of collapse. In those last generations, the models trained on models are so far removed from the original training data that they begin to forget key aspects of the initial training and lose the plot entirely. It’s at this stage that models begin generating complete meaningless gibberish. When this happens, the researchers say the model’s “indiscriminate” self-cannibalizing of its own previous outputs “causes irreversible defects in the resulting model.”

The researchers claim this cascading effect and eventual model collapse are inevitable for large models trained on their own data. It’s important to note this research focused specifically on language models and does not weigh on what could happen if multimodal models like image and video generators were trained on themselves. This research also zeroes in on what should happen on a model training on its own data. It’s unclear exactly what would happen if one model, say from Meta, were to train on output generated from OpenAI.

Preserving original human text could stave off collapse

The prospect of real-world model collapse isn’t an unthinkable hypothetical. Right now, countless websites are up and running featuring articles and blog posts entirely generated by LLMs. In the race to build new models as fast as possible, it’s not unthinkable that much of that AI-generated slop could wind up seeping its way into training sets.

One possible solution to that inadvertently including AI generated content into training sets would be to encourage a watermarking standard across platforms that clearly marks the authenticity of content and whether or not it was produced by a machine. Google, Adobe, and big tech players are trying to do just that with a special “content credential” badge they are trying to standardize as part of the The Coalition for Content Provenance and Authenticity (C2PA).

But that would only apply to images. AI-generated text is also much more difficult to feasibly watermark or even accurately identify using available detection software. A more realistic approach may require AI developers to scrupulously vet material for signs of AI manipulation, and potentially pay reputable human sources for access to train on their high quality data. Without those safeguards of human training data, the internet risks being folded by a wave of AI vomit. Nobody wants that.

Top 10 Best Cars of 2025: Future-Ready Vehicles You Need to Know About

‘Ridiculous They’re Not Telling You’: Trump Promises To Explain Mysterious Drones ‘About One Day’ After He Takes Office

New Rules Allow Slurs on Facebook, Meta Platforms

Trump says son’s visit to Greenland was a ‘love fest’

We can all learn something from “It’s Always Sunny in Philadelphia” visiting “Abbott Elementary”

Los Angeles residents flee wildfire

Luis Arraez would obviously be an upgrade over what the Yankees have

Minneapolis to Vote on Police Reform Deal Spurred by George Floyd’s Death

South Korean anti-corruption agency asks police to arrest impeached President Yoon

Top 10 Best Cars of 2025: Future-Ready Vehicles You Need to Know About

‘Ridiculous They’re Not Telling You’: Trump Promises To Explain Mysterious Drones ‘About One Day’ After He Takes Office

New Rules Allow Slurs on Facebook, Meta Platforms

Trump says son’s visit to Greenland was a ‘love fest’

We can all learn something from “It’s Always Sunny in Philadelphia” visiting “Abbott Elementary”

Los Angeles residents flee wildfire

Luis Arraez would obviously be an upgrade over what the Yankees have

Minneapolis to Vote on Police Reform Deal Spurred by George Floyd’s Death

South Korean anti-corruption agency asks police to arrest impeached President Yoon

AI trained on AI churns out gibberish garbage

You Are Honest And I Believe You Can Do The Work Bawumia Has Given You – Okuapehene To NAPO | Politics

NASA says astronauts stuck at space station until troubled Boeing capsule can be fixed

Related News

X-class solar flares hit a new record in 2024 and could spike further this year — but the sun isn’t entirely to blame, experts say

How Did Humans Create Language? by Tumble Science Podcast for Kids

One Climate and Society Student’s Interdisciplinary Worldview – State of the Planet

What We Know About HMPV, the Virus Spreading in China

NASA says astronauts stuck at space station until troubled Boeing capsule can be fixed

Discussion about this post

Subscribe To Our Newsletters

Customer Support

Subscribe To Our Newsletters

Categories

Recent News

Insights from Dr. Isaac Newton

Azerbaijan to organize mobile ballot boxes for certain persons at upcoming municipal poll (UPDATE)

Teacher Tip Videos (Re)Launch – Blog

Ribbon cutting celebrates ASPSF’s membership with LR Chamber

Welcome Back!

Retrieve your password

Add New Playlist