After writing about LLM error rates, I wanted to talk about a specific kind of error: the hallucination. I am aware that there is a lot of research into this subject, so I decided to read a scholarly review:
“Survey of Hallucination in Natural Language Generation” by Ziwei Ji et al (2023), publicly accessible on arxiv.
I’m not aiming to summarize the entire subject, but rather to answer a specific question: Are hallucinations are an effectively solvable problem, or are they here to stay?
What is a hallucination?
“Hallucination” is a term used in the technical literature on AI, but it’s also entered popular usage. I’ve noticed some differences, and I’d like to put the two definitions in dialogue with each other.
The review defines a hallucination as an AI response that is not faithful to the source material. For example, if asked to summarize a paragraph about COVID-19 clinical trials, even though the paragraph never mentions China, the AI might claim that China has started clinical trials. Now, it might actually be true that China has started clinical trials–but if the paragraph didn’t say so, then the AI shouldn’t include the claim in its summary. The review notes the distinction between faithfulness (i.e. to a provided source) and factuality (i.e. aligning with the real world).
As for the public understanding, I think of all the interactions people have with ChatGPT, where it provides a wildly inaccurate answer with complete confidence. For example, my husband asked it for tips on Zelda: Tears of the Kingdom, and it provided a listicle full of actions that are not possible within the game.
One example that garnered public attention, was when someone asked Google what to do when cheese falls off a pizza. Google’s AI Overviews suggested that they “add some glue”. It’s summarizing a real answer that someone provided (in jest) in a Reddit thread. This was characterized as a hallucination in the media, although I would ask if it really counts by the technical definition. It is, after all, being faithful to a source.
Here are three major differences between the public definition and the scholarly review’s definition:
- The public does not distinguish between factuality and faithfulness. It is assumed that the source is factually correct, or else it is the responsibility of the LLM to provide factual information even when provided a source that is not factual. (The review observes that researchers also often fail to distinguish factuality from faithfulness. Researchers often make the dubious assumption that sources are factual.)
- The public talks about models being overconfident. It’s not just about models saying something wrong, it’s about the way they say it. People are used to expressing some degree of uncertainty when they feel uncertain, and used to picking up uncertainty in other people. AI models often lack these signs of uncertainty, and this can be a problem in natural conversation. However, this subject is not discussed at all in the review, and so it appears not to be a major research area.
- When we speak of faithfulness to a “source”, what is the source? We can imagine providing the AI with a paragraph and asking it to summarize that paragraph–here, the paragraph is the source. However, what if I ask it for general information without providing any source? Here, the source is the AI’s “memory”. That is to say, we expect the AI to retain information that it learned from training data. Research on hallucinations has primarily looked at the first kind of source, where public discussion has largely centered on the second kind of source.
Processing machines and knowledge machines
That last distinction is so important, because it gets to the heart of what LLMs are even for. Do we view an LLM as a processing machine, able to process information that is immediately in front of it? Or do we view an LLM as a knowledge machine, dispensing knowledge from its memory banks?
The scholarly review discusses hallucinations almost entirely in the context of processing tasks. For example, summarizing a paragraph, or answering a question about a paragraph, summarizing data in a table, or translating a sentence into another language.
However, most of the review is about the broader topic of natural language generation. When it shifts focus to the narrower topic of large language models, it also shifts focus from processing tasks to knowledge tasks. By virtue of their size, LLMs have a much greater capacity for parametric knowledge–knowledge that is stored in the model parameters during training. So when researchers explore hallucinations in LLMs, they focus on models’ faithfulness to parametric knowledge, rather than faithfulness to provided documents.
In my opinion, there are two completely distinct research questions: Do LLMs perform well as processing machines? Do LLMs perform well as knowledge machines?
Researchers seem to be aware of the distinction, but I think they put far too little emphasis on it. In a more recent paper titled “AI Hallucinations: a Misnomer Worth Clarifying”, the authors survey the definition of “hallucination” across the literature, and conclude that the term is too vague, and that it may be stigmatizing of mental illness. I don’t disagree with those points, but it’s telling that the authors don’t even mention the distinction between hallucinating from a document, and hallucinating from parametric knowledge. I would go so far as to say that AI researchers are dropping the ball.
This also has important consequences for communicating with the public. For example, let’s look at the Wikipedia article on hallucinations. Wikipedia describes several possible causes (drawing from the very same scholarly review that I am reading). Did you know, one of the causes of hallucinations is that the AI places too much emphasis on parametric knowledge? This only makes sense when we realize that it’s talking about hallucinations that occur during processing tasks, and not during parametric knowledge recall. When the general public reads this article, they are largely seeing it through the lens of parametric knowledge hallucinations, and the article will be very confusing and misleading.
Can we rely on parametric knowledge?
Let’s return to my opening question. Are hallucinations are an effectively solvable problem, or are they here to stay? More specifically, is it possible to solve parametric knowledge hallucinations? Because if not, then maybe we should focus on using LLMs as processing machines rather than knowledge machines. We should be educating the general public about it. We should be educating the CEOs who are deciding how to invest in AI.
I have good reason to ask the question. Fundamentally, parametric knowledge seems like an inefficient way to store knowledge. Imagine freezing and compressing all of human knowledge into a few hundred gigabytes of matrices. Imagine that whenever we want to look up one little thing, no matter how small, we need to load up all those hundreds of gigabytes and do some matrix multiplication with them. That’s effectively what LLMs are doing! Surely it would be more efficient to use a conventional database.
So what does the review say about it? There’s a short section discussing methods to mitigate hallucination in large language models. Most methods are variants on the theme of “fix the training data”. Getting rid of low quality data, getting rid of duplicate data, punishing answers that users mark as non-factual. Is it enough? I can’t say.
And then there’s retrieval augmented generation (RAG). This is a technique where the responses are enhanced by an external source of information. For example, if you query Google’s Gemini model, it may run a conventional Google search, and then summarize the results. Effectively, this transforms the task from a knowledge task to a processing task. Of course, even processing tasks are still be plagued by hallucinations. And then we have the problem of unreliable and inconsistent information–the metaphorical glue on our pizza.
So, I clearly have my prejudices, and I tend to think RAG-type solutions are the best. But I cannot say that my viewpoint is the one exclusively supported by the literature. RAG is just one way that researchers have used to mitigate hallucinations, and it’s a solution that introduces even more problems. I still think it’s the way to go.
This all leads back to our chronic question: what are LLMs good for? When I consider a potential use case, there are two main questions I ask myself. First, how error tolerant is the task? Second, is it a processing task or is it a knowledge task?
John Morales says
State of the art, at least for free.
Early days. S-Curve.
—
I found this to be an amusing story, and relevant to your post:
https://www.theguardian.com/australia-news/2024/nov/12/real-estate-listing-gaffe-exposes-widespread-use-of-ai-in-australian-industry-and-potential-risks
flex says
We’ve discussed this extensively at our workplace, and our top management is saying they are investing heavily in AI, but the best we can come up with is that it the LLM may provide a solution to the a fairly difficult data analysis problem, the problem of lessons-learned.
For years we have been solving issues, or non-compliances to requirements, in our products. The issues are reported by our customers, end users, found during testing, etc. And the team finds the root-cause and a solution. Then the team is supposed to (although many don’t bother) to enter the problem and solution into a lessons-learned database.
When a new project is awarded and designed, the intent is for the engineering team to review the lessons-learned database to avoid making the same mistakes again. However, this rarely is successful. First, the database is disjointed, a lot of the entries are only partially filled out, the terms used are not consistent (terms of art have changed over the decades), the description of the problems are not consistent, and the problems in the database may be design adjacent but not close enough to the design being reviewed to even be found.
Now a lot of problems found in the past have been implemented as DRCs, Design-Rule-Checks, and a lot of these are automated, but there are a lot of lessons in this database which cannot be implemented into a DRC. But, an LLM could be built which could make the connections based on similar (rather then identical) problems or conditions. And as it is used it could even get better at identifying design weaknesses, as well as offering suggestions based on previous solutions.
But while this type of LLM implementation would require some parametric knowledge, it would mainly be processing task.
That’s the best application we have been able to come up with.
Siggy says
@flex,
Not bad, as far as use cases go! Sounds like a case for RAG or semantic search. Could be unreliable in practice though, hard to say.
grahamjones says
Anil Seth is a psychologist and neuroscientist. In his book Being You, he says that our perceptions are ‘controlled hallucinations’. So in the sense that he means, yes, we’re stuck with hallucinations in all agents, natural or artificial.
Another way of putting it is to say that a perception is a working hypothesis which will be used by the agent to make decisions about actions until something goes wrong. When something does go wrong a new working hypothesis may be generated, usually but not necessarily in the direction of moving closer to reality.
I’m using voice access in Windows. When I said ‘closer to reality’ I got an error ‘We can’t close r to reality because it isn’t open.’ This struck me as serendipitous. In order to improve the factuality of LLMs, we need to open reality to them. Then, when something goes wrong they have a chance of controlling their hallucinations. The experiment might be dangerous of course.
Siggy says
I prefer not to make any comparisons to psychology. The fact that AI hallucinations are named after a psychological phenomenon is frankly a mistake, and scholars in the field know it. It’s only still called that because of inertia.