Reproducibility

Reproducibility in science is important – asking the same question, with the same data should produce the same result. With LLMs and the GenAI ability to produce novel outputs, it is expected that the output the GenAI produces would be different.

However, the basic facts should remain the same (if the prompt is clear enough). Using a loose prompt repeatedly in different chats with a GenAI tool however is likely to produce very different results.

Activity icon Prompt for a bedtime story

Timing: Allow 5 minutes

Repeatedly use the same simple single prompt to ask a GenAI three or four times to write a bedtime story suitable for a 6-year-old. You should start a new chat with the GenAI tool each time.

Now do the same with a simple prompt to ask the GenAI three or four times how many gold medals were issued in the 2020 summer Olympics. You should start a new chat with the GenAI tool each time.

Do either of these prompts produce the same information?

To use this interactive functionality a free OU account is required. Sign in or register.
Interactive feature not available in single page view (see it in standard view).

Discussion

We used these prompts with Copilot.

  • Write a bedtime story suitable for a 6 year old.
  • How many gold medals were presented at the 2020 summer Olympic Games?

The first prompt about the bedtime story was very generic, and this loose prompt allowed the GenAI neural network to consider lots of distinct patterns for words that can be presented as a bedtime story for a 6-year-old. The ‘don’t always choose the best’ approach built into GenAI means that the story changes. If you repeat the prompt often enough you might start to see repetition in settings and storylines, but the surface language will vary from response to response.

Our three attempts produced a story about each of: Magic Forests, Brave Bunnies, and Enchanted Gardens

In contrast, the second prompt, which specified a specific event and fact required, produced similar (but not identical) response. Copilot repeatedly reported that 339 medals were presented (for some reason always adding that the USA had 39 of these – which we didn’t ask). The sentences containing that fact varied, and after 5 or 6 repeats, there were no more novel sentences produced. (The GenAI ran out of ways of telling me the same fact.)

Where the prompt has asked for a summary or shortening of text, or an opinion or explanation (for example) – you would like the output to be similar each time, even if the surface presentation (bullet points, short sentences, long paragraphs) is different. A summary, for example, should have the same key points as the content in the original, just reduced in length. An explanation of something shouldn’t vary widely depending on the AIs internal statistics: similar to asking two experts for an explanation, the responses should be similar.

If you present a well-structured prompt to a GenAI more than once and the responses are very different, this may either be an indication that the prompt needs to be improved (the GenAI has chosen to read the prompt in different ways) or that the GenAI is struggling to respond to the prompt (possibly it hasn’t been trained on relevant information). In either case some remedial action needs to be taken. The next section explains how to deal with this situation.

8 Correcting and adapting