Can One Chatbot Catch Another’s Lies?
A new approach uses language models to interrogate other language models and sniff out lies
If you ask an artificial intelligence system such as ChatGPT where the Eiffel Tower is, chances are the chatbot will correctly say “Paris.” But repeatedly ask that AI the same question, and you may eventually be told that no, actually, the answer is Rome. This mistake might seem trivial, but it signals a more serious issue plaguing generative AI: hallucination, or when AI creates content that’s unfaithful to reality.
Sometimes, as with the Eiffel Tower example, a hallucination is obvious and harmless. But there are times when a glitch could have dangerous repercussions: an AI could hallucinate when generating, say, medical advice. Because of the way cutting-edge chatbots are built, they tend to present all their claims with uniform confidence—regardless of subject matter or accuracy. “There is no difference to a language model between something that is true and something that’s not,” says AI researcher Andreas Kirsch, who was formerly at the University of Oxford.
Hallucinations have proved elusive and persistent, but computer scientists are refining ways to detect them in a large language model, or LLM (the type of generative AI system that includes ChatGPT and other chatbots). And now a new project aims to check an LLM’s output for suspected flubs—by running it through another LLM. This second AI system examines multiple answers from the first, assessing their consistency and determining the system’s level of uncertainty. It’s similar, in principle, to realizing that a certain person is prone to “inconsistent stories,” says Jannik Kossen, a Ph.D. student at the University of Oxford and an author of a new study published in Nature. The concept of AI systems cross-examining one another isn’t a new idea, but Kossen and his colleagues’ approach has surpassed previous benchmarks for spotting hallucinations.
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
AI Lie Detectors
The study authors focus on a form of LLM hallucinations they identify as “confabulations”—arbitrary and incorrect statements. Unlike other types of AI errors, which might arise from incorrect training data or failures in reasoning, confabulations result from the inherent randomness of a model’s generation process.
But using a computer to detect confabulations is tricky. “You can [correctly] say the same thing many different ways, and that’s a challenge for any system,” says Karin Verspoor, dean of the School of Computing Technologies at RMIT University in Australia, who was not involved in the study.
To pinpoint when a language model might be confabulating, the new method involves asking a question multiple times to produce several AI-generated answers. Then a second LLM groups these answers according to their meaning; for instance, “John drove his car to the store” and “John went to the store in his car” would be clustered together. This process is repeated for every generated answer.
To determine consistency within these AI-generated responses, Kossen and colleagues compute a new measure that they call “semantic entropy.” If an LLM answers a question in many ways that all mean roughly the same thing, indicating high certainty or agreement in the grouped responses, the LLM’s semantic entropy is deemed low. But if the answers vary widely in meaning, the semantic entropy is considered high—signaling that the model is unsure and may be confabulating responses. If a chatbot’s multiple statements include “The Eiffel Tower is in Paris,” “It’s in Rome,” “Paris is home to the Eiffel Tower” and “in France’s capital, Paris,” this approach could identify “Rome” as the outlier and a probable confabulation.
Other antihallucination methods have used LLMs to evaluate generated answers, through approaches such as asking a single model to double-check its own work. But the paired system improves on this, distinguishing correct from incorrect answers with about 10 percent more accuracy, according to the new study.
Evading Detection
Still, the new process isn’t a flawless way to spot AI hallucinations. For one thing, obtaining multiple answers to improve an LLM’s reliability amplifies such a system’s already-high energy consumption. “There’s always the cost-benefit trade-off,” Kirsch says. But he thinks it’s worth the effort to “sample a bit more and pay a little extra to make sure that we avoid as many hallucinations as possible.”
Another problem arises when a model lacks the data to answer a question correctly—which forces it to answer with its most probable guess. In this way, some hallucinations are simply unavoidable. Ask an LLM to summarize new papers on the topic of semantic entropy and it might point to this latest study if it has access to recent publications; if not, it might cite seemingly credible research with reasonable, yet fabricated, authors and titles.
Having new methods to detect confabulations is helpful, but “this particular paper only covers one little corner of this space,” Verspoor says. “We can trust [LLMs] to a certain extent. But there has to be a limit.”
Discussion about this post