Hallucinations

A cartoon of a small robot in the middle of the picture. On the left-hand side there is a hand shining a torch on the robot, with a speech bubble which says ‘fake’. On the right-hand side is a hand holding a magnifying glass to the robot, with a speech bubble saying ‘real’.

As LLMs are predicting the next word in the sentence based on their training data, their responses can sometimes include errors or be nonsensical.

One of the classic examples of this is Google’s experimental ‘AI Overviews’ tool encouraging pizza-lovers to use non-toxic glue to make cheese stick to pizza better. A quick search online will let you find many other examples.

In the second course, Skills and strategies for using Generative AI, we learnt about some of the ways to use – or prompt – Large Language Models (LLMs) to try and ensure their outputs are relevant and as reliable as possible. Even using the clearest prompts possible however, there is a risk of incorrect or made-up information being produced.

Some errors are based on poor-quality training data, so there is not enough information on that topic in the training data for the tool to be able to predict with high accuracy the next words. These topics can be more prone to mistakes or errors, and this is the situation with legal queries at the time of writing this course.

Much of the information about the law is hidden behind paywalls and so the general LLMs have not been trained how to use it. This causes two different types of hallucinations: either the tool will make up a law or case, or it will include a relevant law or case in its answer but the content of that case or law does not relate to the legal query.

Listen to the following video from Harry Clark, a lawyer with Mishcon de Raya, who explains this further.

Download this video clip.Video player: course_5_workshop_video_clip_1.mp4

Show transcript|Hide transcript

Transcript

HARRY CLARK

There's also some sort of information which lawyers use, which just isn't going to be found on the internet or scraped from a search engine.

So things like understanding the pattern of how an enforcer or regulator works behind the scenes, you know, behind some of their press statements and the ways in which they come to conclusions can be very, very valuable information, particularly in highly regulated industries where you're working with a single regulator and you need to kind of clear a product before it launches.

So having that kind of track record of knowing how systems and regulators work, having, you know, seen or heard discreet conversations about how they might look to regulate something, or their expected sort of way of thinking and how they interpret a particular legal precedent, for example, is all sort of nebulous information which isn't going to be kind of easily written down and attributed that could be scraped by a model.

And having that sort of commercial and horizon-scanning style of thinking can be really useful in those circumstances.

End transcript

Download

Interactive feature not available in single page view (see it in standard view).

Another problem with the legal information within the training data is that it is predominantly from the USA, and there is a lack of reliable and up-to-date information about English law. The tools can therefore make errors (for example, basing the answer on out-of-date law or the law relating to other countries). Sometimes this can be easy to identify, but due to the persuasive and authoritative tone of the LLMs it can sometimes be more difficult to identify these errors.

A 2024 study found that both generic LLMs and legal LLMs were prone to making errors when answering legal queries: ChatGPT only gave a complete and accurate answer in 49% of cases. The same study compared this to specific legal GenAI tools which are trained on the legal information found within the Lexis and Westlaw databases (and therefore behind the paywalls referred to above). Despite this, the study found Lexis+ only gave accurate and complete answers to the same set of legal queries 69% of the time, while Westlaw precision had an accuracy rating of 42% (Magesh et al., 2024).

As the tools get access to more legal data and are combined with reasoning tools, their accuracy may increase. However, the risk of errors is likely to be always present due to them working as a probability tool rather than a search engine or database.

2 Some key concerns about Generative AI tools

Explainability

My OpenLearn Create Profile

Download this course

About this course

Course rewards

Ethical and responsible use of Generative AI

Hallucinations

Transcript