Hallucinations

As LLMs are predicting the next word in the sentence based on their training data, their responses can sometimes include errors or be nonsensical.
One of the classic examples of this is Google’s experimental ‘AI Overviews’ tool encouraging pizza-lovers to use non-toxic glue to make cheese stick to pizza better. A quick search online will let you find many other examples.
In the second course, Skills and strategies for using Generative AI, we learnt about some of the ways to use – or prompt – Large Language Models (LLMs) to try and ensure their outputs are relevant and as reliable as possible. Even using the clearest prompts possible however, there is a risk of incorrect or made-up information being produced.
Some errors are based on poor-quality training data, so there is not enough information on that topic in the training data for the tool to be able to predict with high accuracy the next words. These topics can be more prone to mistakes or errors, and this is the situation with legal queries at the time of writing this course.
Much of the information about the law is hidden behind paywalls and so the general LLMs have not been trained how to use it. This causes two different types of hallucinations: either the tool will make up a law or case, or it will include a relevant law or case in its answer but the content of that case or law does not relate to the legal query.
Listen to the following video from Harry Clark, a lawyer with Mishcon de Raya, who explains this further.

Transcript
Another problem with the legal information within the training data is that it is predominantly from the USA, and there is a lack of reliable and up-to-date information about English law. The tools can therefore make errors (for example, basing the answer on out-of-date law or the law relating to other countries). Sometimes this can be easy to identify, but due to the persuasive and authoritative tone of the LLMs it can sometimes be more difficult to identify these errors.
A 2024 study found that both generic LLMs and legal LLMs were prone to making errors when answering legal queries: ChatGPT only gave a complete and accurate answer in 49% of cases. The same study compared this to specific legal GenAI tools which are trained on the legal information found within the Lexis and Westlaw databases (and therefore behind the paywalls referred to above). Despite this, the study found Lexis+ only gave accurate and complete answers to the same set of legal queries 69% of the time, while Westlaw precision had an accuracy rating of 42% (Magesh et al., 2024).
As the tools get access to more legal data and are combined with reasoning tools, their accuracy may increase. However, the risk of errors is likely to be always present due to them working as a probability tool rather than a search engine or database.
2 Some key concerns about Generative AI tools
