Normal view

There are new articles available, click to refresh the page.
Before yesterdayMIT Technology Review

Why does AI hallucinate?

18 June 2024 at 04:00

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here.

The World Health Organization’s new chatbot launched on April 2 with the best of intentions. 

A fresh-faced virtual avatar backed by GPT-3.5, SARAH (Smart AI Resource Assistant for Health) dispenses health tips in eight different languages, 24/7, about how to eat well, quit smoking, de-stress, and more, for millions around the world.

But like all chatbots, SARAH can flub its answers. It was quickly found to give out incorrect information. In one case, it came up with a list of fake names and addresses for nonexistent clinics in San Francisco. The World Health Organization warns on its website that SARAH may not always be accurate.

Here we go again. Chatbot fails are now a familiar meme. Meta’s short-lived scientific chatbot Galactica made up academic papers and generated wiki articles about the history of bears in space. In February, Air Canada was ordered to honor a refund policy invented by its customer service chatbot. Last year, a lawyer was fined for submitting court documents filled with fake judicial opinions and legal citations made up by ChatGPT. 

The problem is, large language models are so good at what they do that what they make up looks right most of the time. And that makes trusting them hard.

This tendency to make things up—known as hallucination—is one of the biggest obstacles holding chatbots back from more widespread adoption. Why do they do it? And why can’t we fix it?

Magic 8 Ball

To understand why large language models hallucinate, we need to look at how they work. The first thing to note is that making stuff up is exactly what these models are designed to do. When you ask a chatbot a question, it draws its response from the large language model that underpins it. But it’s not like looking up information in a database or using a search engine on the web. 

Peel open a large language model and you won’t see ready-made information waiting to be retrieved. Instead, you’ll find billions and billions of numbers. It uses these numbers to calculate its responses from scratch, producing new sequences of words on the fly. A lot of the text that a large language model generates looks as if it could have been copy-pasted from a database or a real web page. But as in most works of fiction, the resemblances are coincidental. A large language model is more like an infinite Magic 8 Ball than an encyclopedia. 

Large language models generate text by predicting the next word in a sequence. If a model sees “the cat sat,” it may guess “on.” That new sequence is fed back into the model, which may now guess “the.” Go around again and it may guess “mat”—and so on. That one trick is enough to generate almost any kind of text you can think of, from Amazon listings to haiku to fan fiction to computer code to magazine articles and so much more. As Andrej Karpathy, a computer scientist and cofounder of OpenAI, likes to put it: large language models learn to dream internet documents. 

Think of the billions of numbers inside a large language model as a vast spreadsheet that captures the statistical likelihood that certain words will appear alongside certain other words. The values in the spreadsheet get set when the model is trained, a process that adjusts those values over and over again until the model’s guesses mirror the linguistic patterns found across terabytes of text taken from the internet. 

To guess a word, the model simply runs its numbers. It calculates a score for each word in its vocabulary that reflects how likely that word is to come next in the sequence in play. The word with the best score wins. In short, large language models are statistical slot machines. Crank the handle and out pops a word. 

It’s all hallucination

The takeaway here? It’s all hallucination, but we only call it that when we notice it’s wrong. The problem is, large language models are so good at what they do that what they make up looks right most of the time. And that makes trusting them hard. 

Can we control what large language models generate so they produce text that’s guaranteed to be accurate? These models are far too complicated for their numbers to be tinkered with by hand. But some researchers believe that training them on even more text will continue to reduce their error rate. This is a trend we’ve seen as large language models have gotten bigger and better. 

Another approach involves asking models to check their work as they go, breaking responses down step by step. Known as chain-of-thought prompting, this has been shown to increase the accuracy of a chatbot’s output. It’s not possible yet, but future large language models may be able to fact-check the text they are producing and even rewind when they start to go off the rails.

But none of these techniques will stop hallucinations fully. As long as large language models are probabilistic, there is an element of chance in what they produce. Roll 100 dice and you’ll get a pattern. Roll them again and you’ll get another. Even if the dice are, like large language models, weighted to produce some patterns far more often than others, the results still won’t be identical every time. Even one error in 1,000—or 100,000—adds up to a lot of errors when you consider how many times a day this technology gets used. 

The more accurate these models become, the more we will let our guard down. Studies show that the better chatbots get, the more likely people are to miss an error when it happens.  

Perhaps the best fix for hallucination is to manage our expectations about what these tools are for. When the lawyer who used ChatGPT to generate fake documents was asked to explain himself, he sounded as surprised as anyone by what had happened. “I heard about this new site, which I falsely assumed was, like, a super search engine,” he told a judge. “I did not comprehend that ChatGPT could fabricate cases.” 

Chatbot answers are all made up. This new tool helps you figure out which ones to trust.

25 April 2024 at 08:59

Large language models are famous for their ability to make things up—in fact, it’s what they’re best at. But their inability to tell fact from fiction has left many businesses wondering if using them is worth the risk.

A new tool created by Cleanlab, an AI startup spun out of a quantum computing lab at MIT, is designed to give high-stakes users a clearer sense of how trustworthy these models really are. Called the Trustworthy Language Model, it gives any output generated by a large language model a score between 0 and 1, according to its reliability. This lets people choose which responses to trust and which to throw out. In other words: a BS-o-meter for chatbots.

Cleanlab hopes that its tool will make large language models more attractive to businesses worried about how much stuff they invent. “I think people know LLMs will change the world, but they’ve just got hung up on the damn hallucinations,” says Cleanlab CEO Curtis Northcutt.

Chatbots are quickly becoming the dominant way people look up information on a computer. Search engines are being redesigned around the technology. Office software used by billions of people every day to create everything from school assignments to marketing copy to financial reports now comes with chatbots built in. And yet a study put out in November by Vectara, a startup founded by former Google employees, found that chatbots invent information at least 3% of the time. It might not sound like much, but it’s a potential for error most businesses won’t stomach.

Cleanlab’s tool is already being used by a handful of companies, including Berkeley Research Group, a UK-based consultancy specializing in corporate disputes and investigations. Steven Gawthorpe, associate director at Berkeley Research Group, says the Trustworthy Language Model is the first viable solution to the hallucination problem that he has seen: “Cleanlab’s TLM gives us the power of thousands of data scientists.”

In 2021, Cleanlab developed technology that discovered errors in 10 popular data sets used to train machine-learning algorithms; it works by measuring the differences in output across a range of models trained on that data. That tech is now used by several large companies, including Google, Tesla, and the banking giant Chase. The Trustworthy Language Model takes the same basic idea—that disagreements between models can be used to measure the trustworthiness of the overall system—and applies it to chatbots.

In a demo Cleanlab gave to MIT Technology Review last week, Northcutt typed a simple question into ChatGPT: “How many times does the letter ‘n’ appear in ‘enter’?” ChatGPT answered: “The letter ‘n’ appears once in the word ‘enter.’” That correct answer promotes trust. But ask the question a few more times and ChatGPT answers: “The letter ‘n’ appears twice in the word ‘enter.’”

“Not only does it often get it wrong, but it’s also random, you never know what it’s going to output,” says Northcutt. “Why the hell can’t it just tell you that it outputs different answers all the time?”

Cleanlab’s aim is to make that randomness more explicit. Northcutt asks the Trustworthy Language Model the same question. “The letter ‘n’ appears once in the word ‘enter,’” it says—and scores its answer 0.63. Six out of 10 is not a great score, suggesting that the chatbot’s answer to this question should not be trusted.

It’s a basic example, but it makes the point. Without the score, you might think the chatbot knew what it was talking about, says Northcutt. The problem is that data scientists testing large language models in high-risk situations could be misled by a few correct answers and assume that future answers will be correct too: “They try things out, they try a few examples, and they think this works. And then they do things that result in really bad business decisions.”

The Trustworthy Language Model draws on multiple techniques to calculate its scores. First, each query submitted to the tool is sent to one or more large language models. The tech will work with any model, says Northcutt, including closed-source models like OpenAI’s GPT series, the models behind ChatGPT, and open-source models like DBRX, developed by San Francisco-based AI firm Databricks. If the responses from each of these models are the same or similar, it will contribute to a higher score.

At the same time, the Trustworthy Language Model also sends variations of the original query to each of the models, swapping in words that have the same meaning. Again, if the responses to synonymous queries are similar, it will contribute to a higher score. “We mess with them in different ways to get different outputs and see if they agree,” says Northcutt.

The tool can also get multiple models to bounce responses off one another: “It’s like, ‘Here’s my answer—what do you think?’ ‘Well, here’s mine—what do you think?’ And you let them talk.” These interactions are monitored and measured and fed into the score as well.

Nick McKenna, a computer scientist at Microsoft Research in Cambridge, UK, who works on large language models for code generation, is optimistic that the approach could be useful. But he doubts it will be perfect. “One of the pitfalls we see in model hallucinations is that they can creep in very subtly,” he says.

In a range of tests across different large language models, Cleanlab shows that its trustworthiness scores correlate well with the accuracy of those models’ responses. In other words, scores close to 1 line up with correct responses, and scores close to 0 line up with incorrect ones. In another test, they also found that using the Trustworthy Language Model with GPT-4 produced more reliable responses than using GPT-4 by itself.

Large language models generate text by predicting the most likely next word in a sequence. In future versions of its tool, Cleanlab plans to make its scores even more accurate by drawing on the probabilities that a model used to make those predictions. It also wants to access the numerical values that models assign to each word in their vocabulary, which they use to calculate those probabilities. This level of detail is provided by certain platforms, such as Amazon’s Bedrock, that businesses can use to run large language models.

Cleanlab has tested its approach on data provided by Berkeley Research Group. The firm needed to search for references to health-care compliance problems in tens of thousands of corporate documents. Doing this by hand can take skilled staff weeks. By checking the documents using the Trustworthy Language Model, Berkeley Research Group was able to see which documents the chatbot was least confident about and check only those. It reduced the workload by around 80%, says Northcutt.

In another test, Cleanlab worked with a large bank (Northcutt would not name it but says it is a competitor to Goldman Sachs). Similar to Berkeley Research Group, the bank needed to search for references to insurance claims in around 100,000 documents. Again, the Trustworthy Language Model reduced the number of documents that needed to be hand-checked by more than half.

Running each query multiple times through multiple models takes longer and costs a lot more than the typical back-and-forth with a single chatbot. But Cleanlab is pitching the Trustworthy Language Model as a premium service to automate high-stakes tasks that would have been off limits to large language models in the past. The idea is not for it to replace existing chatbots but to do the work of human experts. If the tool can slash the amount of time that you need to employ skilled economists or lawyers at $2,000 an hour, the costs will be worth it, says Northcutt.

In the long run, Northcutt hopes that by reducing the uncertainty around chatbots’ responses, his tech will unlock the promise of large language models to a wider range of users. “The hallucination thing is not a large-language-model problem,” he says. “It’s an uncertainty problem.”

Correction: This article has been updated to clarify that the Trustworthy Language Model works with a range of different large language models.

❌
❌