TOON Format¶
TOON (Tabular Object-Oriented Notation) v3 is a compact text serialization format designed to reduce token usage when sending structured data to LLMs.
Why TOON?¶
When you send query results to an LLM as JSON, every row repeats the same field names:
[{"id":1,"name":"Widget","category":"Tools","price":29.99},
{"id":2,"name":"Gadget","category":"Tools","price":19.99}]
TOON moves field names into a header, so they appear only once:
[2,]{id,name,category,price}:
1,Widget,Tools,29.99
2,Gadget,Tools,19.99
The savings grow with row count and stabilize at the dataset's natural ceiling.
Real-World Token Savings¶
Measured on real public datasets using tiktoken cl100k_base (GPT-4o tokenizer):
| Dataset | Rows | JSON Tokens | TOON Tokens | Savings |
|---|---|---|---|---|
| MovieLens (7 cols) | 10 | 632 | 495 | 21.7% |
| MovieLens (7 cols) | 100 | 6,306 | 4,789 | 24.1% |
| MovieLens (7 cols) | 500 | 26,674 | 18,927 | 29.0% |
| Restaurant (9 cols) | 10 | 723 | 473 | 34.6% |
| Restaurant (9 cols) | 100 | 7,071 | 4,326 | 38.8% |
| Restaurant (9 cols) | 500 | 35,663 | 21,787 | 38.9% |
Savings are highest with: many columns, short values, numeric data. Savings are lowest with: long text content, few rows.
Format Specification¶
A TOON tabular document has this structure:
[ROW_COUNT,]{field1,field2,...}:
value1,value2,...
value1,value2,...
[N,]— Row count in square brackets, followed by a comma{field1,field2}— Field names in curly braces:— Header terminator- Each data row is indented with two spaces, values separated by commas
Value Rules¶
| Type | Example | TOON |
|---|---|---|
| String | "Alice" |
Alice (unquoted if safe) |
| String with comma | "Smith, John" |
"Smith, John" (quoted) |
| Number | 29.99 |
29.99 (canonical, no scientific notation) |
| Boolean | true |
true |
| Null | null |
null |
| Empty string | "" |
"" (quoted) |
Quoting Rules (Section 7.2)¶
A string value must be quoted when it:
- Contains the delimiter (comma by default)
- Matches a keyword (
true,false,null) - Looks like a number (
123,3.14) - Has leading/trailing whitespace
- Is empty
Escape Sequences (Section 7.1)¶
| Sequence | Meaning |
|---|---|
\\ |
Backslash |
\" |
Double quote |
\n |
Newline |
\r |
Carriage return |
\t |
Tab |
Conformance¶
Seamless-RAG's TOON encoder passes 166/166 official TOON v3 specification test fixtures, covering: nested escaping, empty rows, unicode content, mixed types, key folding, delimiter options, and number canonicalization.
Side-by-Side Example¶
JSON (207 tokens)¶
[{"movie_id":318,"title":"Shawshank Redemption, The (1994)","genres":"Crime, Drama","year":1994,"avg_rating":4.43,"num_ratings":317},{"movie_id":858,"title":"Godfather, The (1972)","genres":"Crime, Drama","year":1972,"avg_rating":4.29,"num_ratings":192},{"movie_id":2959,"title":"Fight Club (1999)","genres":"Action, Crime, Drama, Thriller","year":1999,"avg_rating":4.27,"num_ratings":218}]
TOON (157 tokens — 24.2% saved)¶
[3,]{movie_id,title,genres,year,avg_rating,num_ratings}:
318,"Shawshank Redemption, The (1994)","Crime, Drama",1994,4.43,317
858,"Godfather, The (1972)","Crime, Drama",1972,4.29,192
2959,Fight Club (1999),"Action, Crime, Drama, Thriller",1999,4.27,218
Usage¶
from seamless_rag.toon.encoder import encode_tabular
rows = [
{"id": 1, "content": "Climate change affects biodiversity", "score": 0.92},
{"id": 2, "content": "Recent studies show temperature rise", "score": 0.87},
]
print(encode_tabular(rows))
Output:
[2,]{id,content,score}:
1,Climate change affects biodiversity,0.92
2,Recent studies show temperature rise,0.87