ChatGPT with confidence scores

David Gilbertson
14 min readSep 1, 2024

--

In this post we’ll look at using the confidence scores available through the OpenAI API.

In the first section we’ll start with a gentle exploration of these scores and get a feel for what they mean with the help of a custom chat interface.

In the second section we’ll look at using confidence scores in code.

Exploring ‘confidence’

First, a quick primer on what an LLM is doing for each token in its response:

  • The model outputs a value for every token in its vocabulary (~100,000 values)
  • These values are then transformed into values that we (dubiously) call ‘probabilities’. These values are the focus of this post.
  • A single token is then selected probabilistically (sometimes the one with the highest value, sometimes not) and used in the response

Now, let’s get some terminology sorted: the values we’ll be using in this post aren’t really ‘probabilities’ (in the sense of ‘how likely it is that something will occur’) and they’re not ‘confidence’ in any meaningful way. They’re just the numbers that the LLM outputs, adjusted so that they’re positive and add to one (to a mathematician, this is enough to earn any set of numbers the label ‘probability distribution’).

So you can add ‘probability’ to the list of terms that mean one thing in academia but a slightly different thing in the real world, causing widespread misunderstanding (along with ‘theory’, ‘significance’, etc.).

In my view, it makes sense to think of these values as ‘confidence’, but remember that LLMs are like humans: just because they’re confident, doesn’t mean they’re right. In short, these values are meaningless until proven otherwise.

Let’s take a look at some examples using a chat interface that does the following:

  • For each token in the LLM’s response, it represents confidence with a red underline — brighter red for lower confidence.
  • When you hover over a word (on a device with a mouse cursor) it shows the top 10 possible tokens for that position ranked by confidence score.

You can try it out below, or at gptconfidence.streamlit.app.

If you want to run this locally (and use a model other than gpt-4o-mini) you can clone the repo here.

Let’s start simple and ask it to pick a number.

The first thing to note is that with the second token, it could have said ‘pick’ or ‘choose’ or ‘go’, etc. And even though there was only a 21% chance of choosing choose, it chose it anyway. (That’s lesson one, if you didn’t know already: LLMs don’t just pick the ‘most likely next token’ unless you configure them to do so.)

It then went on to pick/choose/go for the number 5, and we can see that this is not at all a uniform selection. So in case you were in doubt: you should not be using LLMs to make ‘random’ selections.

You might be wondering if you could somehow use this information to detect hallucinations. Well, yes and no.

This is a little bit off-topic, but trying to answer this question will deepen our understanding of the meaningfulness of these values.

One interesting case is the impossible question, like listing famous people with an interpunct in their name.

Here’s the question and the response, showing the confidence tooltip for the token where the first name starts.

Let’s think this through from the model’s perspective, remembering that the model is only ever concerned with predicting one more token.

Let’s say it’s received the prompt and the response up to 1. ** and its job is to work out which token goes in the next slot, the first person’s name. It’s too late to say “I don’t know” or “that’s a stupid question” … it has to say something, but there is no good answer, so it tries to pass the buck by coming up with a single letter (J, M, G, etc.). You can see it’s not very confident in these so even the highest-scored token is under 30%. This behaviour is a good sign that there’s no obvious token for this particular slot and that the model is about to say something incorrect.

But can you detect hallucinations at the level of a whole response, without looking at individual tokens? Well, compare the amount of red above with the below, a question with a fairly clear answer:

There does appear to be quite a strong difference between a question it knows the answer to, and one where it’s forced to hallucinate.

But it certainly isn’t the case that all hallucinations involve low confidence.

Here the model is quite sure that a perfectly valid query won’t work.

And it certainly isn’t the case that all low-confidence tokens imply hallucination, as we saw with the first example where the model could have said ‘choose’ or ‘pick’ or ‘go for’. That’s just the way natural language works, there’s quite often several ways to say the same thing.

And lastly, we have no hope of LLMs being right all the time, because they learn from humans, and humans are wrong all the time (sometimes on purpose!)

Here the model is quite sure that a nonsense psychology concept is real, because a lot of humans talk about it as though it were real.

(Religion would be another good example, except the model has been trained to dodge questions that involve religion and truth.)

So no, looking at confidence scores isn’t some magical way to check whether the LLM is actually correct, although there are signs that it could help catch some cases of hallucination.

Let’s edge toward to the goal of this post and try a closed question that the model might get wrong: the capital of Kazakhstan. (In case you’re not up on the latest from Kazakhstan: for a while the capital was Astana, then in 2019 it changed to Nur-Sultan, then in 2022 it changed back to Astana, when someone realised that ‘Astana’ means ‘Capital city’ in Kazakh.)

The model’s training data would have been inconsistent about this topic and thus the model won’t be sure what the capital is.

You can see in the tooltip for the ‘Ast’ token that there was an 88% chance of it (incorrectly) answering Nur(-Sultan), but in this particular run it went with Ast(ana).

Remember that all the model does is predict one token at a time, so once it’s chosen ‘Ast’, the next token will definitely be ‘ana’, and the rest of the response will go on to be consistent with that selection.

If it happened to pick ‘Nur’ instead of ‘Ast’, then the rest of the response would be forced to back up that assertion, conveniently ‘forgetting’ that it knows the capital changed back to Astana in 2022.

Or another way to think of it: once the response asserts that the capital is Nur-Sultan, for the remainder of the response the LLM will only draw on the training data produced while Nur-Sultan was the capital (this is a slightly dodgy claim but it can be an interesting way to think about what’s going on inside the black box).

Side note: this shows a fundamental difference between how humans and LLMs learn. We humans learn sequentially. If a new fact collides with an existing one (where only one can be true), we perform some conflict-resolution actions to work out what’s really true. Meanwhile, LLMs learn probabilistically: if some of the training data says Astana is the capital, and some of it says Nur-Sultan is the capital, the LLM learns that the capital is either Nur-Sultan or Astana.

And just in case you’re still under the illusion that these values represent ‘probability’ in the plain English sense of the word, consider this:

  • If you ask GPT-4o “Is the capital of Kazakhstan Nur-Sultan?”, it will say Yes (85%).
  • If you ask it “Is the capital of Kazakhstan Astana?”, it will say Yes (91%).

Let’s get into some code.

Using confidence programmatically

For simple cases, you should try and compress the response down into a single token (and therefore a single confidence score). That means either structuring your question as multiple choice, or instructing the model to pick from a selection of single-token words (e.g. yes/no).

Let’s start with a simple yes/no question.


import math
from openai import OpenAI

client = OpenAI()
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
dict(
role="user",
content="Is the Great Wall of China visible from the Moon?",
)
],
temperature=0,
max_tokens=1,
logprobs=True,
)

choice = completion.choices[0]
confidence = math.exp(choice.logprobs.content[0].logprob)

print(f"Answer: {choice.message.content} ({confidence:.4%})")
# Answer: No (60.5926%)

The interesting parts are:

  • temperature=0, to ensure we get the token with the highest confidence.
  • max_tokens=1, because we only want one token in the response.
  • logprobs=True, tells the API we want ‘log probabilities’ in the response object.

I don’t fully understand why OpenAI return the log probabilities instead of just the probabilities, but they do, so we need to convert these back to values between 0 and 1 using math.exp.

If you were wondering, no the returned logprob values aren’t affected by the temperature.

BTW, the API also has a logit_bias property that should in theory allow us to coerce the model into using certain tokens in the response (like ‘Yes’ and ‘No’) but I’ve had no luck getting that to behave reliably.

Next we’ll layer in a bit of real-world complexity and expand this to answer questions about an image.

Let’s say you’re building a system that requires users to upload a photo of their driver’s licence (or license, if you prefer) for authentication. You want to perform some automated checks to make sure it’s a current licence, from your country, and that the name on the licence matches the user’s name.

You could try and train a traditional ML model like ResNet, but that only gets you image recognition, not image understanding; you won’t be able to ask questions about what the model sees. For that, a multi-modal LLM might get better results.

from datetime import datetime
import base64
import math
from pathlib import Path

from openai import OpenAI

client = OpenAI()


def classify_with_confidence(file_path, name) -> tuple[str, float]:
user_prompt = f"""
Is this a current Australian driver's licence, belonging to {name}?
Answer only 'Yes' or 'No'.
Today's date is {datetime.today().strftime("%Y-%m-%d")}.
"""

# Encode the image as a base64 data URL
encoded_img = base64.b64encode(Path(file_path).read_bytes())
img_url = f"data:image/jpeg;base64,{encoded_img.decode()}"

# Create a message from the prompt and the image
message = dict(
role="user",
content=[
dict(type="text", text=user_prompt),
dict(type="image_url", image_url=dict(url=img_url)),
],
)

# Call the API, requesting logprobs
completion = client.chat.completions.create(
model="gpt-4o",
messages=[message],
temperature=0,
max_tokens=1,
logprobs=True,
)

# Return the response and confidence
choice = completion.choices[0]
confidence = math.exp(choice.logprobs.content[0].logprob)
return choice.message.content, confidence


is_valid, confidence = classify_with_confidence(
file_path="my_licence_expired.jpg",
name="David Gilbertson",
)

print(f"Answer: {is_valid} ({confidence:.4%})")
# Answer: No (99.9955%)

Although you won’t need training data when using an LLM, you’ll still need a few examples for evaluation and calibration.

During evaluation you’ll probably find that the model is wrong some of the time, the million dollar question then becomes: is there a correlation between the correctness of the response and the confidence score returned?

In the case of identifying licences, you’d fire a bunch of examples at the LLM and record two pieces of information for each: whether it was right or wrong, and the model’s confidence. Then with any luck, when you plot these you’ll see something like this:

Each blue line is an instance of a correct answer, and each orange line is an instance of a wrong answer. There is some overlap, but we can see a pretty clear indication that confidence isn’t entirely meaningless in this case.

The next step is to think about whether you’d prefer false positives or false negatives, and pick an appropriate cutoff point. In code, this might look something like this:

if is_valid == "Yes":
if confidence > 0.95:
... # Success
else:
... # Success, but flag for human review
elif is_valid == "No":
if confidence > 0.99:
... # Fail. Ask the LLM what's wrong, send that info to user
else:
... # Success, but flag for human review

Over time you should build up a set of evals so you can regularly adjust these cutoff points (and easily test the accuracy of new models as they come out).

As a side note: despite what I’ve shown in this toy example, you shouldn’t ask an LLM to do things you can do in regular code. If I were actually implementing this check, I’d ask the LLM to return the expiry date. I’d then do the comparison in code to work out if it’s current. (Did you know that when comparing two dates and working out if one comes before the other, Python has a 100% accuracy rate? What a time to be alive!)

Now, if you run your evaluations and find no clear relationship between correctness and ‘confidence’, you’re pretty much out of luck. About all you can do is try a different LLM. For one task, I found that GPT-4o was giving useful confidence scores while GPT-4o-mini’s scores weren’t useful (e.g. high confidence when wrong). As of August 2024, the Gemini and Claude APIs don’t support logprobs.

As with everything in the field of AI Engineering, if nothing else works, park it and set a reminder to try again in three months.

Multiple choice

Replacing yes/no logic with multiple choice is easy enough, just present the options as a numbered list and tell the LLM to answer with a single number. All whole numbers up to 999 are single tokens in GPT-4o, so you needn’t limit yourself to A/B/C/D type questions. I tested it with a list of ~200 countries and gave it some quiz questions (“Which country is the leading producer of platinum?”) and it had no problem selecting the correct answer by number.

But we can do something a bit more interesting than just picking one option: we can extract multiple options from a single token slot.

You could use this to, say, add a number of tags to an article, or for the following example, select a few genres to apply to a movie.

The prompt will look something like this:

Which genre best describes the movie 'Gladiator'?
Select one from the following list and return only the number:
0. Action
1. Adventure
2. Animation
3. Biography
4. Comedy
5. Crime
6. Documentary
7. Drama
8. Family
9. Fantasy
10. Film-Noir
11. History
12. Horror
13. Music
14. Musical
15. Mystery
16. Romance
17. Sci-Fi
18. Short
19. Sport
20. Thriller
21. War
22. Western

And after parsing the single-token response, we’ll end up with a dictionary of values something like this:

Drama: 43.46%
History: 33.85%
Adventure: 14.11%
Action: 8.56%

We’ll ask the model to select a single genre, and force it to return a single token, but we’ll also request the top 10 other tokens that it considered. We do this by passing in a top_logprobs argument.

import math
from openai import OpenAI

client = OpenAI()

movie_name = "Gladiator"

genres = ["Action", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Documentary", "Drama", "Family", "Fantasy", "Film-Noir", "History", "Horror", "Music", "Musical", "Mystery", "Romance", "Sci-Fi", "Short", "Sport", "Thriller", "War", "Western"]
genre_string = "\n".join([f"{i}. {g}" for i, g in enumerate(genres)])

prompt = f"""\
Which genre best describes the movie {movie_name!r}?
Select one from the following list and return only the number:
{genre_string}
"""

# Call the API, requesting logprobs and 10 top_logprobs
completion = client.chat.completions.create(
model="gpt-4o",
messages=[dict(role="user", content=prompt)],
max_tokens=1,
logprobs=True,
top_logprobs=10,
)

# Extract the options and confidences
label_dict = {}
for item in completion.choices[0].logprobs.content[0].top_logprobs:
if (confidence := math.exp(item.logprob)) > 0.01:
genre = genres[int(item.token)]
label_dict[genre] = confidence


for genre, confidence in label_dict.items():
print(f"{genre}: {confidence:.2%}")

Remember that you’ll always get 10 options back if you ask for top_logprobs=10, and there’s no guarantee that they’re all sensible tokens.

For example, when it really thinks the best answer is 3, the top 10 tokens might include 3, 03, and (Unicode Chonky Three). All the nonsense tokens tend to have very low confidence scores, which is why the above code only includes genres where the confidence is > 0.01.

With advanced prompts

The above works well if you’re happy to only output a single token. But what if you also want the model to use Chain of Thought (CoT), or explain why it thinks that licence isn’t valid?

This is still possible, it just requires a bit more code to process the response and extract the confidence scores. Below is the movie genre example, but I’m now telling the LLM to think it over first, and then put the answer in <answer> tags. To be clear, there’s nothing magic about an <answer> tag, it’s just a string of characters to look for in the response.

When parsing the response, the below code builds a running string of the answer so far, and if that ends in <answer>, then it means the current token must be the answer, so it then loops over the top 10 tokens for that slot to extract multiple genres, the same as in the code above.

import math
from openai import OpenAI

client = OpenAI()

movie_name = "Gladiator"

genres = ["Action", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Documentary", "Drama", "Family", "Fantasy", "Film-Noir", "History", "Horror", "Music", "Musical", "Mystery", "Romance", "Sci-Fi", "Short", "Sport", "Thriller", "War", "Western"]
genre_string = "\n".join([f"{i}. {g}" for i, g in enumerate(genres)])

prompt = f"""\
Which genre best describes the movie {movie_name!r}?
Consider a few likely genres and explain your reasoning,
then pick an answer from the list below
and show it in answer tags, like: <answer>4</answer>
{genre_string}
"""

# Call the API, requesting logprobs and 10 top_logprobs
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[dict(role="user", content=prompt)],
logprobs=True,
top_logprobs=10,
)

# Extract the responses and confidences
label_dict = {}
text = ""
for tokenLogProb in completion.choices[0].logprobs.content:
# When we get to the token following '<answer>', extract alternatives listed in top_logprobs
if text.endswith("<answer>"):
for item in tokenLogProb.top_logprobs:
if (confidence := math.exp(item.logprob)) > 0.01:
genre = genres[int(item.token)]
label_dict[genre] = confidence
text += tokenLogProb.token


for genre, confidence in label_dict.items():
print(f"{genre}: {confidence:.2%}")

For the record: in this case the pre-reasoning appears to make the model more certain in the selected genre, so is less than useful.

So is all this better than just asking the model for ‘the top few genres as a JSON list’? It will depend on the use case. It can be tricky to get LLMs to give you a truly variable number of options, they will tend to always give you a similar number of examples/tags/genres every time, regardless of the content. Whereas when looking at the top alternatives, you’re in control of the cutoff point (based on confidence), so if there’s one obvious answer, you’ll get one, if there’s seven viable options, you’ll get seven.

This — like all prompt engineering ‘wisdom’ — requires evaluation on your own data; everything’s just a hypothesis until you can show it works for your case.

The above examples focus on extracting the confidence for a single token in the response, but you could extend this to multiple tokens.

Imagine you ask the LLM to return a JSON list of objects, each of them having a status field of either "open" or "closed". You could loop over the returned tokens, concatenating them into a string (as above), and whenever the string ends in "status": ", you know the next token is going to be a status. So you can pull out the confidence for that token, save it in a list, then mix that list back in with the JSON object as status_confidence.

Using confidence for model selection

One last use case: you could use a fast/cheap model to take a stab at a problem, and if it reports low confidence, switch to a better/more expensive model (or slower process, e.g. a RAG step).

As always, this only makes sense if the confidence score correlates with accuracy, and my findings so far suggest that smaller models have less useful confidence than larger ones.

Hey, thanks for reading, I hope your next Wednesday is pretty good.

--

--