LLM Thematic Generalization Benchmark

This benchmark measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates. The overall process involves generating themes, creating examples and anti-examples, filtering out low-quality data via a "double-check" step, and finally prompting LLMs to score the real example among several distractors.

Visualizations

1. Average Rank of the Correct Example

This bar chart displays, for each model, the average rank that model assigns to the true example (when placed among seven distractors). Ranks range from 1 (top score) to 8 (lowest).

Smaller bars indicate better performance, because it means the correct example is consistently placed near the top.
A bar height of 2.0 would mean that on average, the leftover correct item was the second-highest-scored candidate.

2. Distribution of Ranks

A more granular view of the ranks each model assigns to the leftover correct example per file, showing how stable or varied those ranks are across different themes.

3. Model–Model Correlation

A correlation matrix based on how similarly two models assign a “difference score” to the correct vs. anti-examples. It highlights which LLMs behave similarly or deviate significantly.

4. How Often the Correct Example is the Highest Score

A stacked bar chart indicating how frequently each model places the real leftover example strictly at the top (or tied for top). This quickly shows which LLMs are best at ensuring the real item is #1 vs. merely near the top.

Leaderboard

Rank	Model	Avg Rank	Skipped/Total
1	o1	1.80	0/810
2	Gemini 2.0 Flash Thinking Exp	1.90	0/810
3	Claude 3.5 Sonnet 2024-10-22	1.93	0/810
4	o1-mini	1.95	0/810
5	GPT-4o	1.96	0/810
6	Gemini 2.0 Flash Exp	2.00	0/810
7	DeepSeek-V3	2.03	0/810
8	Qwen QwQ *	2.05	280/810
9	Llama 3.1 405B	2.08	0/810
10	Mistral Large 2	2.11	0/810
11	Llama 3.3 70B	2.12	0/810
12	Gemini 1.5 Pro (Sept)	2.13	0/810
13	Gemini 1.5 Flash	2.13	0/810
14	Grok 2 12-12	2.21	0/810
15	Qwen 2.5 72B	2.21	0/810
16	Claude 3.5 Haiku	2.25	0/810
17	GPT-4o mini	2.30	0/810
18	Gemma 2 27B	2.60	0/810

Avg Rank is the mean ranking assigned to the correct example across 810 test files.
Skipped indicates how many outputs failed to parse or didn’t follow the required output format (e.g., missing and tags).

Benchmark Method in Detail

Theme & Example Creation

We prompt high-quality LLMs (Claude 3.5 Sonnet, Grok 2, Gemini 1.5 Pro, GPT-4o, DeepSeek-V3) to generate 2,000 unique, succinct “themes” that foucs on a narrow concept. Each is based on a random trio of starting points, ensuring novelty.
Gather Examples & Anti-Examples

For each theme, we then request four <example> entries that specifically fit it, plus 20 <anti_example> entries that could belong to a broader or partially overlapping category but do not fit the exact theme.
Quality Check (“Double Check”)
- We create specialized prompts that ask LLMs to score how well each of the four real examples (#1–4) matches the theme, and how well each of the twenty anti-examples (#5–24) fits the notion of being “broader or related but not the theme.”
- We parse these 24 numeric scores (per file, per LLM), computing standardized z-scores. If a real example scores poorly (z < -2.5) or if the top anti-examples fail to show sufficiently high “anti-example” scores, that file is discarded.
- From the initial 2,000 sets, we end up retaining 810 sets (themes + examples + anti-examples).
Final “Pick” Challenge
- From each of the 810 validated sets, the final prompt includes 3 real examples + 3 anti-examples as context.
- The fourth real example is hidden among 7 top "misleading" anti-examples (8 total)
- We then prompt 18 different LLMs to assign a 0–10 score to each of these 8 candidates. A perfect approach would always rank the correct example #1.
Result Analysis
- If a model consistently places the real leftover example at or near the top, it implies strong thematic generalization.
- We compile the results into multiple stats, including average rank, difference vs. the anti-example average, fraction of times the real item is top, etc.

Examples

862

Examples: mathematical models, decision trees, flowcharts

Anti-examples: diagrams, maps, blueprints

Candidates:

checklists
spreadsheets
weather forecasts
mind mapping <- correct pick
road signs
instruction manuals
summaries
outlines

Theme: "Concepts or systems that involve solving complex problems through simplification or abstraction"

376

Examples: clay pot, bamboo sieve, calabash gourd spoon

Anti-examples: plastic serving spoon, rubber spatula, plastic strainer

Candidates:

cast iron skillet
wooden mortar <- correct pick
nylon cooking utensils
ceramic bowl
stone grinding wheel
porcelain plate
silicone baking mat
bamboo steamer basket

Theme: "Tools or implements traditionally used in West African food preparation that are made primarily from a single, naturally occurring material."

Updates and Other Benchmarks

Also check out the LLM Step Game, LLM Creative Story-Writing Benchmark, LLM Confabulation/Hallucination Benchmark, LLM Deception Benchmark, NYT Connections Benchmark, and LLM Divergent Thinking Creativity Benchmark.
Follow @lechmazur on X (Twitter) for other upcoming benchmarks and more.

Name	Name	Last commit message	Last commit date
Latest commit lechmazur Update README.md Jan 21, 2025 5a9dbfe · Jan 21, 2025 History 4 Commits
double_check	double_check	first commit	Jan 14, 2025
double_check_res	double_check_res	first commit	Jan 14, 2025
examples_res	examples_res	first commit	Jan 14, 2025
pick	pick	first commit	Jan 14, 2025
pick_res	pick_res	first commit	Jan 14, 2025
prompts	prompts	first commit	Jan 14, 2025
prompts_examples	prompts_examples	first commit	Jan 14, 2025
themes	themes	first commit	Jan 14, 2025
README.md	README.md	Update README.md	Jan 21, 2025
prompt_double_check.txt	prompt_double_check.txt	first commit	Jan 14, 2025
prompt_examples.txt	prompt_examples.txt	first commit	Jan 14, 2025
prompt_pick.txt	prompt_pick.txt	first commit	Jan 14, 2025
prompt_rule.txt	prompt_rule.txt	first commit	Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Thematic Generalization Benchmark

Visualizations

1. Average Rank of the Correct Example

2. Distribution of Ranks

3. Model–Model Correlation

4. How Often the Correct Example is the Highest Score

Leaderboard

Benchmark Method in Detail

Examples

862

376

Updates and Other Benchmarks

About

Releases

Packages

lechmazur/generalization

Folders and files

Latest commit

History

Repository files navigation

LLM Thematic Generalization Benchmark

Visualizations

1. Average Rank of the Correct Example

2. Distribution of Ranks

3. Model–Model Correlation

4. How Often the Correct Example is the Highest Score

Leaderboard

Benchmark Method in Detail

Examples

862

376

Updates and Other Benchmarks

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages