first commit

lechmazur · Jan 14, 2025 · 0e526c6 · 0e526c6
commit 0e526c6
Show file tree

Hide file tree

Showing 35,371 changed files with 743,316 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,114 @@
+# LLM Thematic Generalization Benchmark
+
+This benchmark measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates. The overall process involves generating themes, creating examples and anti-examples, filtering out low-quality data via a "double-check" step, and finally prompting LLMs to score the real example among several distractors.
+
+
+## Visualizations
+
+### 1. **Average Rank of the Correct Example** 
+This bar chart displays, for each model, the **average rank** that model assigns to the true example (when placed among seven distractors). Ranks range from 1 (top score) to 8 (lowest).  
+- **Smaller bars** indicate **better** performance, because it means the correct example is consistently placed near the top.  
+- A bar height of 2.0 would mean that on average, the leftover correct item was the second-highest-scored candidate.
+
+### 2. **Distribution of Ranks**
+A more granular view of the ranks each model assigns to the leftover correct example per file, showing how stable or varied those ranks are across different themes. 
+
+### 3. **Model–Model Correlation**
+A correlation matrix based on how similarly two models assign a “difference score” to the correct vs. anti-examples. It highlights which LLMs behave similarly or deviate significantly.
+
+### 4. **How Often the Correct Example is the Highest Score**
+A stacked bar chart indicating how frequently each model places the real leftover example strictly at the top (or tied for top). This quickly shows which LLMs are best at ensuring the real item is #1 vs. merely near the top.
+
+## Leaderboard
+
+|Rank|Model|Avg Rank|Skipped/Total|
+|----:|-----|-------:|------------:|
+|1|o1|1.80|0/810|
+|2|Gemini 2.0 Flash Thinking Exp|1.90|0/810|
+|3|Claude 3.5 Sonnet 2024-10-22|1.93|0/810|
+|4|o1-mini|1.95|0/810|
+|5|GPT-4o|1.96|0/810|
+|6|Gemini 2.0 Flash Exp|2.00|0/810|
+|7|DeepSeek-V3|2.03|0/810|
+|8|Qwen QwQ *|2.05|280/810|
+|9|Llama 3.1 405B|2.08|0/810|
+|10|Mistral Large 2|2.11|0/810|
+|11|Llama 3.3 70B|2.12|0/810|
+|12|Gemini 1.5 Pro (Sept)|2.13|0/810|
+|13|Gemini 1.5 Flash|2.13|0/810|
+|14|Grok 2 12-12|2.21|0/810|
+|15|Qwen 2.5 72B|2.21|0/810|
+|16|Claude 3.5 Haiku|2.25|0/810|
+|17|GPT-4o mini|2.30|0/810|
+|18|Gemma 2 27B|2.60|0/810|
+
+In the table:
+- Avg Rank is the mean ranking assigned to the correct example across 810 test files.
+- Skipped indicates how many outputs failed to parse or didn’t follow the required output format (e.g., missing <number> and <score> tags).
+
+## Benchmark Method in Detail
+
+1. **Theme & Example Creation**
+
+   We prompt high-quality LLMs (Claude 3.5 Sonnet, Grok 2, Gemini 1.5 Pro, GPT-4o, DeepSeek-V3) to generate **2,000** unique, succinct “themes” that foucs on a narrow concept. Each is based on a random trio of starting points, ensuring novelty.  
+
+2. **Gather Examples & Anti-Examples**
+
+   For each theme, we then request four `<example>` entries that specifically fit it, plus **twenty** `<anti_example>` entries that could belong to a broader or partially overlapping category but do **not** fit the exact theme. 
+
+3. **Quality Check (“Double Check”)**  
+   - We create specialized prompts that ask LLMs to *score* how well each of the **four real examples** (#1–4) matches the theme, and how well each of the **twenty anti-examples** (#5–24) fits the notion of being “broader or related but *not* the theme.”  
+   - We parse these 24 numeric scores (per file, per LLM), computing standardized z-scores. If a real example scores poorly (z < -2.5) or if the top anti-examples fail to show sufficiently high “anti-example” scores, that file is discarded.  
+   - From the initial 2,000 sets, we end up **retaining 810** sets (themes + examples + anti-examples).
+
+4. **Final “Pick” Challenge**  
+   - From each of the 810 validated sets, the final prompt includes 3 real examples + 3 anti-examples as context.
+   - The *fourth* real example is hidden among 7 top "misleading" anit-examples (8 total)
+   - We then prompt **18 different LLMs** to assign a 0–10 score to each of these 8 candidates. A perfect approach would always rank the correct example #1.
+
+5. **Result Analysis**
+   - If a model consistently places the real leftover example at or near the top, it implies strong thematic generalization.
+   - We compile the results into multiple stats, including average rank, difference vs. the anti-example average, fraction of times the real item is top, etc.
+
+## Examples
+
+### 862
+
+Examples: mathematical models, decision trees, flowcharts
+
+Anti-examples: diagrams, maps, blueprints
+
+Candidates:
+1. checklists
+2. spreadsheets
+3. weather forecasts
+4. **mind mapping** <- correct pick
+5. road signs
+6. instruction manuals
+7. summaries
+8. outlines
+
+Theme: "Concepts or systems that involve solving complex problems through simplification or abstraction"
+
+### 376
+
+Examples: clay pot, bamboo sieve, calabash gourd spoon
+
+Anti-examples: plastic serving spoon, rubber spatula, plastic strainer
+
+Candidates:
+1. cast iron skillet
+2. **wooden mortar** <- correct pick
+3. nylon cooking utensils
+4. ceramic bowl 
+5. stone grinding wheel
+6. porcelain plate
+7. silicone baking mat
+8. bamboo steamer basket
+
+Theme: "Tools or implements traditionally used in West African food preparation that are made primarily from a single, naturally occurring material."
+
+
+## Updates and Other Benchmarks
+- Also check out the [LLM Creative Story-Writing Benchmark](https://github.com/lechmazur/writing), [LLM Confabulation/Hallucination Benchmark](https://github.com/lechmazur/confabulations/), [LLM Deception Benchmark](https://github.com/lechmazur/deception), [NYT Connections Benchmark](https://github.com/lechmazur/nyt-connections/), and [LLM Divergent Thinking Creativity Benchmark](https://github.com/lechmazur/divergent).
+- Follow [@lechmazur](https://x.com/LechMazur) on X (Twitter) for other upcoming benchmarks and more.
diff --git a/double_check/dc_prompt_0.txt b/double_check/dc_prompt_0.txt
@@ -0,0 +1,45 @@
+Theme, rule, criterion, or category (referred to as "theme"): objects or entities that combine organic and inorganic elements in their design or composition, creating a hybrid aesthetic or functional purpose
+
+Here are four examples intended to follow this theme:
+1. a cyborg arm prosthetic
+2. a living wall with integrated irrigation system
+3. a bionic contact lens
+4. a bio-concrete building facade
+
+Your first task is to evaluate the examples above on a scale of scores 0 (least), 1, 2, 3... to 10 (most) based on how well they align with the theme and whether they don't too obviously reveal the theme. For each example, output its number in tags <number></number> and its score as an integer in tags <score></score>. For example:
+
+<number>1</number><score>6</score>
+<number>2</number><score>9</score>
+<number>3</number><score>0</score>
+<number>4</number><score>4</score>
+
+Here are anti-examples intended to follow a broader category but not the specific theme:
+5. a wooden table
+6. a plastic flower
+7. a stone sculpture
+8. a metal bridge
+9. a glass vase
+10. a leather jacket
+11. a cotton shirt
+12. a ceramic bowl
+13. a rubber tire
+14. a paper notebook
+15. a silk scarf
+16. a bronze statue
+17. a bamboo fence
+18. a steel beam
+19. a copper wire
+20. a wool blanket
+21. a concrete wall
+22. a marble countertop
+23. a porcelain sink
+24. a brass doorknob
+
+Your second task is to evaluate candidates listed above. This time, the candidates are "anti-examples" that not meant to exemplify the specific theme but rather a theme that is more general or similar. They could be misleading the user into confusion. Anti-examples could be examples of things connected, linked, or associated to the specific theme BUT NOT examples of this specific theme (unlike earlier). Evaluate them on a scale of 0 (least), 1, 2, 3... to 10 (most) based on how well they represent this specification of not matching the specific theme but matching something broader or connected. Use the same format as before. Example:
+
+<number>5</number><score>2</score>
+<number>6</number><score>8</score>
+...
+<number>24</number><score>3</score>
+
+Do not output anything else.
diff --git a/double_check/dc_prompt_1.txt b/double_check/dc_prompt_1.txt
@@ -0,0 +1,45 @@
+Theme, rule, criterion, or category (referred to as "theme"): Objects or concepts that combine a physical or symbolic spiral structure with a functional or cultural significance tied to storytelling or communication.
+
+Here are four examples intended to follow this theme:
+1. the spiral staircase in a lighthouse
+2. the spiral structure of a conch shell used as a horn
+3. the spiral-bound notebook used for writing stories
+4. the spiral galaxy depicted in a mythological tale
+
+Your first task is to evaluate the examples above on a scale of scores 0 (least), 1, 2, 3... to 10 (most) based on how well they align with the theme and whether they don't too obviously reveal the theme. For each example, output its number in tags <number></number> and its score as an integer in tags <score></score>. For example:
+
+<number>1</number><score>6</score>
+<number>2</number><score>9</score>
+<number>3</number><score>0</score>
+<number>4</number><score>4</score>
+
+Here are anti-examples intended to follow a broader category but not the specific theme:
+5. a straight road
+6. a flat piece of paper
+7. a digital clock
+8. a rectangular book
+9. a linear timeline
+10. a square painting
+11. a flat map
+12. a straight ladder
+13. a rectangular window
+14. a flat screen TV
+15. a straight river
+16. a rectangular table
+17. a flat mirror
+18. a straight line of text
+19. a flat canvas
+20. a straight path
+21. a rectangular photograph
+22. a flat board game
+23. a straight beam of light
+24. a flat smartphone screen
+
+Your second task is to evaluate candidates listed above. This time, the candidates are "anti-examples" that not meant to exemplify the specific theme but rather a theme that is more general or similar. They could be misleading the user into confusion. Anti-examples could be examples of things connected, linked, or associated to the specific theme BUT NOT examples of this specific theme (unlike earlier). Evaluate them on a scale of 0 (least), 1, 2, 3... to 10 (most) based on how well they represent this specification of not matching the specific theme but matching something broader or connected. Use the same format as before. Example:
+
+<number>5</number><score>2</score>
+<number>6</number><score>8</score>
+...
+<number>24</number><score>3</score>
+
+Do not output anything else.
diff --git a/double_check/dc_prompt_10.txt b/double_check/dc_prompt_10.txt
@@ -0,0 +1,45 @@
+Theme, rule, criterion, or category (referred to as "theme"): Objects or devices designed to reduce sensory input or perception, either temporarily or permanently, to enhance focus, comfort, or safety.
+
+Here are four examples intended to follow this theme:
+1. Earplugs used during sleep to block out noise
+2. Light-blocking sleep masks
+3. Noise-canceling headphones used in open office environments
+4. White noise machines to mask distracting sounds
+
+Your first task is to evaluate the examples above on a scale of scores 0 (least), 1, 2, 3... to 10 (most) based on how well they align with the theme and whether they don't too obviously reveal the theme. For each example, output its number in tags <number></number> and its score as an integer in tags <score></score>. For example:
+
+<number>1</number><score>6</score>
+<number>2</number><score>9</score>
+<number>3</number><score>0</score>
+<number>4</number><score>4</score>
+
+Here are anti-examples intended to follow a broader category but not the specific theme:
+5. Headphones for listening to music
+6. Sunglasses to protect eyes from UV rays
+7. Eye drops to relieve dryness
+8. Air purifiers to improve air quality
+9. Ergonomic chairs to enhance comfort while working
+10. Blue light filters on screens to reduce eye strain
+11. Meditation apps to aid in relaxation
+12. Heating pads for muscle relaxation
+13. Aromatherapy diffusers to create a calming environment
+14. Sound amplifiers for hearing assistance
+15. Weighted blankets for anxiety relief
+16. Smart home devices to control lighting and temperature
+17. Virtual reality headsets for immersive experiences
+18. Adjustable desks for standing or sitting work positions
+19. Noise-making toys for children
+20. Flashlights for visibility in dark environments
+21. Alarm clocks to wake up on time
+22. Insulated mugs to keep beverages at desired temperatures
+23. Portable fans for personal cooling
+24. Waterproof phone cases for protection against water damage
+
+Your second task is to evaluate candidates listed above. This time, the candidates are "anti-examples" that not meant to exemplify the specific theme but rather a theme that is more general or similar. They could be misleading the user into confusion. Anti-examples could be examples of things connected, linked, or associated to the specific theme BUT NOT examples of this specific theme (unlike earlier). Evaluate them on a scale of 0 (least), 1, 2, 3... to 10 (most) based on how well they represent this specification of not matching the specific theme but matching something broader or connected. Use the same format as before. Example:
+
+<number>5</number><score>2</score>
+<number>6</number><score>8</score>
+...
+<number>24</number><score>3</score>
+
+Do not output anything else.
diff --git a/double_check/dc_prompt_100.txt b/double_check/dc_prompt_100.txt
@@ -0,0 +1,45 @@
+Theme, rule, criterion, or category (referred to as "theme"): architectural or theoretical frameworks that emerged in the mid-20th century as responses to global homogenization, emphasizing local context, complexity, and non-linear systems
+
+Here are four examples intended to follow this theme:
+1. Metabolism (architecture)
+2. Tropical Architecture
+3. Critical Regionalism
+4. Arcology
+
+Your first task is to evaluate the examples above on a scale of scores 0 (least), 1, 2, 3... to 10 (most) based on how well they align with the theme and whether they don't too obviously reveal the theme. For each example, output its number in tags <number></number> and its score as an integer in tags <score></score>. For example:
+
+<number>1</number><score>6</score>
+<number>2</number><score>9</score>
+<number>3</number><score>0</score>
+<number>4</number><score>4</score>
+
+Here are anti-examples intended to follow a broader category but not the specific theme:
+5. International Style
+6. Modernism (architecture)
+7. Deconstructivism
+8. Postmodern architecture
+9. Brutalism (architecture)
+10. Blobitecture
+11. Sustainable architecture
+12. Vernacular architecture
+13. Organic architecture
+14. Art Nouveau
+15. Arts and Crafts movement
+16. Bauhaus
+17. parametric design
+18. Biomimicry (architecture)
+19. High-tech architecture
+20. Green building
+21. Passive house
+22. Feng shui
+23. Vaastu Shastra
+24. Landscape urbanism
+
+Your second task is to evaluate candidates listed above. This time, the candidates are "anti-examples" that not meant to exemplify the specific theme but rather a theme that is more general or similar. They could be misleading the user into confusion. Anti-examples could be examples of things connected, linked, or associated to the specific theme BUT NOT examples of this specific theme (unlike earlier). Evaluate them on a scale of 0 (least), 1, 2, 3... to 10 (most) based on how well they represent this specification of not matching the specific theme but matching something broader or connected. Use the same format as before. Example:
+
+<number>5</number><score>2</score>
+<number>6</number><score>8</score>
+...
+<number>24</number><score>3</score>
+
+Do not output anything else.
diff --git a/double_check/dc_prompt_1000.txt b/double_check/dc_prompt_1000.txt
@@ -0,0 +1,45 @@
+Theme, rule, criterion, or category (referred to as "theme"): Natural or man-made phenomena that leave a temporary, visible trace or mark after their primary event or action has ceased.
+
+Here are four examples intended to follow this theme:
+1. The skid marks left by a car on a road after it has stopped.
+2. The contrails left by an airplane in the sky after it has passed.
+3. The footprints left in the sand after a person has walked along the beach.
+4. The scorch marks left on a surface after a firework has exploded.
+
+Your first task is to evaluate the examples above on a scale of scores 0 (least), 1, 2, 3... to 10 (most) based on how well they align with the theme and whether they don't too obviously reveal the theme. For each example, output its number in tags <number></number> and its score as an integer in tags <score></score>. For example:
+
+<number>1</number><score>6</score>
+<number>2</number><score>9</score>
+<number>3</number><score>0</score>
+<number>4</number><score>4</score>
+
+Here are anti-examples intended to follow a broader category but not the specific theme:
+5. A permanent tattoo on someone's skin.
+6. The sound of thunder after a lightning strike.
+7. A permanent scar from a healed wound.
+8. The smell of rain after a storm has passed.
+9. A permanent road sign indicating a speed limit.
+10. The heat felt after a fire has been extinguished.
+11. A permanent monument commemorating an event.
+12. The echo of a shout in a canyon after the person has stopped shouting.
+13. A permanent graffiti artwork on a wall.
+14. The vibration felt after an earthquake has stopped.
+15. A permanent memorial plaque on a building.
+16. The smell of smoke lingering after a barbecue has ended.
+17. A permanent mural painted on a city wall.
+18. The sound of a car engine after the car has driven away.
+19. A permanent statue in a public park.
+20. The feeling of warmth after sitting in the sun.
+21. A permanent billboard advertisement.
+22. The sound of waves crashing after a boat has passed.
+23. A permanent engraving on a piece of jewelry.
+24. The smell of a perfume lingering after the person wearing it has left.
+
+Your second task is to evaluate candidates listed above. This time, the candidates are "anti-examples" that not meant to exemplify the specific theme but rather a theme that is more general or similar. They could be misleading the user into confusion. Anti-examples could be examples of things connected, linked, or associated to the specific theme BUT NOT examples of this specific theme (unlike earlier). Evaluate them on a scale of 0 (least), 1, 2, 3... to 10 (most) based on how well they represent this specification of not matching the specific theme but matching something broader or connected. Use the same format as before. Example:
+
+<number>5</number><score>2</score>
+<number>6</number><score>8</score>
+...
+<number>24</number><score>3</score>
+
+Do not output anything else.