LLM Output Formats: Why JSON Costs More Than TSV
When prompting an LLM to return structured data, your choice of format can make a big difference. This post compares six formats, evaluating their speed, token usage, and limitations.
TL;DR
JSON is the default choice for many, but it’s a serious token hog. It can use twice as many tokens as other formats for the same data.
Of course, no format is best in all scenarios, so here’s a decision tree:
(If you’re wondering why I haven’t included XML, it’s because I’m on a personal mission to go 50 years without using XML — only 4 left!)
I’ll explain these decisions — and the limitations of each format — in more detail below. But first, I’ll compare their token usage and speed.
Token usage
The main reason to explore alternatives to JSON is to reduce the number of tokens used, which reduces costs and response times.
So to compare these formats, we’ll look at the number of tokens they require to represent a given dataset.
Comparison framework
The input I’ll use for this comparison is a block of text with one paragraph of info for each country in the EU.
Austria (Österreich), led by Chancellor Karl Nehammer (born October 18, 1972), is married to Katharina Nehammer. Austria has a population of approximately 9 million people and covers an area of 83,879 square kilometers.
Belgium (Belgique/België), under Prime Minister Alexander De Croo (born November 3, 1975), who is married to Annik Penders, has a population of around 11.6 million and an area of 30,528 square kilometers.
Bulgaria (България), led by acting Prime Minister Dimitar Glavchev, has a population of about 6.9 million and spans an area of 110,879 square kilometers. There is no publicly available information about Glavchev’s significant other.
... etc.
I’ll ask an LLM to turn this plain text into structured data where each country is a record, and each record has key/value pairs for the country’s name
, leaderName
, leaderDOB
, leaderSO
, population
, and area
.
I’ll do this for each of the formats, and check that the results are identical for all six.
For the curious, the full code is in this gist. For the mildly interested, here’s the part where I define the name, optional hint, and parser for each of the formats (I parse everything to a Pandas DataFrame):
data_formats = [
dict(
name="TSV",
parser=lambda text: pd.read_csv(StringIO(text), sep="\t"),
),
dict(
name="CSV",
hint="Wrap any values with commas in quotes.",
parser=lambda text: pd.read_csv(StringIO(text)),
),
dict(
name="columnar JSON",
hint="Return a top-level object with fields as keys and values as lists.",
parser=lambda text: pd.DataFrame(json.loads(text)),
),
dict(
name="YAML",
hint="Return a top-level list.",
parser=lambda text: pd.DataFrame(yaml.safe_load(text)),
),
dict(
name="TOML",
hint="Return an array of tables with the name 'countries'.",
parser=lambda text: pd.DataFrame(tomlkit.loads(text)["countries"]),
),
dict(
name="JSON",
parser=lambda text: pd.DataFrame(json.loads(text)),
),
]
I think it’s pretty cool that the LLM (gpt-4o-mini
for this test) is able to return identical data in all these formats. Of course, with more complex data, or a less powerful LLM, things might not be so perfect.
Results
The below chart shows the number of tokens required to represent this data in each of the formats.
JSON uses twice as many tokens as TSV. That’s not a small difference. Imagine reading the pricing page of some API and seeing it would cost $1 if you wanted your data in JSON format, 50c if you took it as TSV, or 80c for YAML.
Of course, these results are specific to the example data. The purpose of this chart is not to convince you that JSON is a terrible token hog in all scenarios, it’s to convince you that it’s worth testing your own data with other formats.
Now let’s look at the response times for these formats.
Even though JSON ‘only’ requires twice as many tokens, it routinely takes four times as long compared to TSV. I’d always considered the relationship between token count and response time to be roughly linear — close enough to O(n) — so these exaggerated response times surprised me and I propose we call this O(my).
Again, what really matters is the response times for your data, so get testing.
Limitations and considerations
If all of these formats were equally reliable and flexible, the conclusion would be simple: use TSV. But of course that’s not the case, so let’s take a deeper look at each one.
TSV
For representing tabular data, TSV and CSV are quite similar, with TSV using tabs and CSV using commas to delimit the values in each ‘row’. Differences in token usage only arise if your data contains commas or tabs, since these values need to be wrapped in double quotes.
Since tab characters are less common than commas, TSV is often going to use fewer tokens than CSV.
When it comes to parsing, TSV — like CSV — is a bit trickier than JSON to parse in vanilla Python. You could use Python’s built-in csv
package, but it’s easier with the Pandas package. It’s a similar story in other languages, where parsing either requires a bit more work or a third party package.
Speaking of parsing, it’s easy enough to parse TSV line-by-line (if your data doesn’t have newlines in it). So if you want to stream your response from the LLM and process each row as it arrives, TSV is a good choice (so is CSV). You can do this with TOML and YAML and JSON, but it’s more work. Also, shout out to NDJSON, not tested here.
CSV
As suggested earlier, the problem with CSV is that commas are common, which either means more tokens (to handle the commas), or the LLM fails to escape correctly and you get invalid data. So if your inputs might have commas, you should either avoid CSV or have a thorough prompt and some good evals in place so you can quantify reliability.
For both TSV and CSV, make sure you test your setup with data that contains all characters that need special attention (commas/tabs, newlines, double quotes).
Columnar JSON
Columnar JSON is not a common concept or phrase; I included it in this comparison because I was curious to see how token-efficient it was.
In case it isn’t clear, this is what columnar JSON would look like for the country data mentioned above:
{
"name": ["Austria", "Belgium", "Bulgaria", ...],
"leaderName": ["Karl Nehammer", "Alexander De Croo", "Dimitar Glavchev", ...],
"leaderDOB": ["October 18, 1972", "November 3, 1975", "Unknown", ...],
"leaderSO": ["Katharina Nehammer", "Annik Penders", "Unknown", ...],
"population": [9000000, 11600000, 6900000, ...],
"area": [83879, 30528, 110879, ...]
}
It’s the least human-readable of all the formats by a long shot. But the structure means that each key only appears once, instead of being repeated for every record, saving tokens.
I’ve found that sometimes the LLM will know what I mean by ‘columnar JSON’, but at other times it needs an hint, like: ‘the column names should be keys with the column contents represented as a list of items’.
To parse columnar JSON, you can pass it to Pandas like so:
df = pd.DataFrame(json.loads(text))
Unlike CSV and TSV, columnar JSON supports nested data structures, so this is a good format for a list of records where some values have their own structure.
These first three formats — TSV, CSV, columnar JSON — are only suitable for representing lists of records (tabular data). All three are terrible options for things like configuration files that are best represented as a single top-level object. The next three formats (YAML, TOML, JSON) are more versatile.
YAML
YAML can return a top-level list, but I’ve found that some LLMs have a preference for returning a top-level object, so you need to be specific in your prompt to ensure a consistent format.
I’ve also had trouble getting LLMs to return string values in the same format each time. Sometimes it doesn’t matter, but of the five ways to represent a string in YAML, only one supports parsing escape sequences (\t
, \u03B1
, etc). So if your data has escape sequences, it’s best to explicitly request double quotes for strings.
There are more ‘gotchas’ with YAML than there are with JSON, and I would suggest understanding these rather than just hoping that “the LLM will know how to format YAML”.
To parse YAML you’ll need a third-party package. I’ve used pyyaml
, which has no dependencies.
TOML
TOML is the only format here that can’t have a top-level list (it’s designed to be a configuration file format), so if you want to represent a list of records as TOML, you’ll need to put them in a top-level object, and tell the LLM what you’d like to call the key in this object.
TOML typically requires more tokens than YAML because all string values must be in quotes.
When it comes to parsing, there is a TOML parser built into Python, if you’re using Python 3.11 or later. If not, you’ll need to install tomlkit
or something similar.
TOML is less common than YAML, which might make you wonder if LLMs will struggle to get the format right. I haven’t found any clear signs of this (in the top-tier LLMs). I suspect that the relative simplicity of TOML (compared to YAML) balances things out a bit. Also, there are multiple ways to represent the same data in YAML, which could result in lower certainty in the LLM, bringing the reliability closer to that of TOML.
Anecdotally, I’ve seen errors in both TOML and YAML, but always fixable with a more thorough prompt.
My comments about strings and escape sequences in YAML also apply to TOML.
Overall, TOML and YAML are quite close, with TOML requiring a few more tokens and not supporting top-level lists, but not requiring a third party package to parse (for Python 3.11+ users).
JSON
There’s not much to say here. JSON is the defacto format for a reason; it’s versatile, easy to parse, and hard to get wrong. It’s just a pity that it comes with so many quote marks, commas, colons, and line breaks, each one adding to the token count.
It’s worth noting that JSON is often the only option if you want to use an LLM provider’s ‘structured data mode’ or some other structure-enforcing package like Guardrails or Outlines. I predict that this will change with time. As more LLM-based apps make it into production and developers turn their attention to optimizations like minimizing token usage, LLM providers will respond by supporting more format options for structured data.
A nice future would be one where LLM providers tune their models to reliably represent data in a wide range of formats, and add those formats as options in structured data mode.
A general note: be wary of old advice regarding the reliability of LLMs when returning certain formats. As you can see in this August 2024 blog post from OpenAI about structured outputs, early versions of GPT-4 handled their complex JSON test 35% of the time, while newer versions of GPT-4 topped 85%. That’s a massive jump within the GPT-4 family.
This is important if you’re using some special feature or package that forces structured data outputs, based on assumptions or evidence from a year or more ago. You might not need that feature or package, and it might be forcing you to use JSON when you could be using a cheaper, faster format.
Practical application
This all sounds great in theory, but let’s say you’ve got a setup that’s working nicely using JSON for structured data. You know that TSV would also work as a format for your data, so should you bother switching over?
If you care about speed — because humans are waiting while the LLM generates tokens — then you need to assign a value to their waiting time and I won’t attempt to cover that scenario here.
But if you’re running a process in the background, then it’s a fairly simple calculation. As an example, let’s make the following assumptions:
- Your time is worth $1,000/day
- It’ll take you half a day to convert an existing setup from JSON to TSV
- Output tokens cost $0.60 per million
- TSV will use 50% fewer tokens
- You want a return on investment in one year
We can plug these values into a little Python script:
daily_rate = 1_000
days_work = 0.5
cost_per_token = 0.6 / 1_000_000
token_reduction = 0.5
roi_days = 365
cost_of_conversion = daily_rate * days_work
break_even_saving_per_day_cost = cost_of_conversion / roi_days
break_even_saving_per_day_toks = break_even_saving_per_day_cost / cost_per_token
required_token_usage = break_even_saving_per_day_toks / token_reduction
print(f"""\
You need to save ${break_even_saving_per_day_cost:,.2f} per day
You need to reduce token usage by {break_even_saving_per_day_toks:,.0f} per day
If you'll be using {token_reduction:.0%} fewer tokens...
You'll break even if you currently use {required_token_usage:,.0f} tokens per day"""
)
This will tell us that we’d break even in a year if we currently produce 4,566,210 tokens of JSON per day.
Of course you should plug in your own values, but this gives you a ballpark figure to think about. If you’re only generating a few thousand tokens per day of structured data (and don’t care about speed) fiddling with formats is a poor use of your time. But if you’re pumping out tens of millions of tokens per day, then investigating other formats is a smart move.
Wrapping up
It’s tempting to stick with the default JSON — it’s versatile, reliable, and easy to parse. But it’s also slow and expensive by comparison. So think about these other formats, run some experiments, and choose the one that gives you a good balance of cost, reliability and speed.