Benchmarks
Available benchmark suites
| Name | Questions | Runs | Avg Score | Last Run | Actions |
|---|---|---|---|---|---|
|
Algebra
A benchmark to evaluate a model's ability to solve linear and quadratic equations with integer solutions. Includes single-variable linear equations (ax + b = c) and quadratic equations with one or two integer roots. |
1080 | 1080 | 60.6/100 | 2026-06-03 18:52 | View Run |
|
Antonym Identification
A benchmark to evaluate a model's ability to identify the antonym of a word. |
1080 | 1080 | 89.1/100 | 2026-06-03 18:42 | View Run |
|
Book Author Match
A benchmark to evaluate matching famous books to their correct authors. |
414 | 414 | 76.0/100 | 2026-06-03 19:36 | View Run |
|
Definitions
A benchmark to evaluate a model's ability to identify the correct definition of words. |
880 | 880 | 83.0/100 | 2026-06-03 19:26 | View Run |
|
English Plural Generation
A benchmark to evaluate a model's ability to produce the correct plural form of English nouns, covering regular, -es, -ies, -ves, irregular, invariant, and Latin/Greek pluralization rules. |
960 | 960 | 92.0/100 | 2026-06-03 19:28 | View Run |
|
Food Category Classification
A benchmark to evaluate classification of food items by category. |
480 | 480 | 77.5/100 | 2026-06-03 19:37 | View Run |
|
Fractions and Percentages
A benchmark to evaluate a model's ability to calculate percentages and fractions, including percent-of, fraction-of, and percent change problems. |
1080 | 1080 | 87.2/100 | 2026-06-03 18:50 | View Run |
|
Geography Knowledge
A benchmark to evaluate a model's knowledge of world geography through multiple-choice questions about countries, capitals, physical features, and other geographical information. |
960 | 960 | 83.9/100 | 2026-06-03 19:29 | View Run |
|
Geometry
A benchmark to evaluate a model's ability to calculate area, perimeter, and volume for standard shapes: rectangles, triangles, rectangular boxes, and circles (using π ≈ 3.14159). |
1080 | 1080 | 75.6/100 | 2026-06-03 18:55 | View Run |
|
Historical Event Year
A benchmark to evaluate selecting the correct year for major historical events. |
432 | 432 | 70.5/100 | 2026-06-03 19:39 | View Run |
|
Lemma Identification
A benchmark to evaluate a model's ability to identify the lemma (base form) of a given word. The lemma is the dictionary form: - For nouns: the singular form (e.g., "cats" → "cat") - For verbs: the infinitive form without "to" (e.g., "running" → "run") - For adjectives: the positive form (e.g., "better" → "good") No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word. |
1080 | 1080 | 38.5/100 | 2026-06-03 18:37 | View Run |
|
Math Word Problems
A benchmark to evaluate a model's ability to read math word problems and extract the relevant numbers to compute the correct answer. Approximately one third of questions contain distractor/unused information. |
1080 | 1080 | 85.6/100 | 2026-06-03 18:49 | View Run |
|
Multilingual Synonym Generation
A benchmark to evaluate a model's ability to generate noun synonyms in multiple languages. |
1404 | 1404 | 76.1/100 | 2026-06-03 18:44 | View Run |
|
Part of Speech
A benchmark to evaluate a model's ability to identify the part of speech of a specific word in a sentence. |
960 | 960 | 89.9/100 | 2026-06-03 19:27 | View Run |
|
Pinyin Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence. |
540 | 540 | 23.0/100 | 2026-06-03 18:45 | View Run |
|
Python GCD With Validation
Write a Python 3.12 function for GCD with invalid-input exceptions. |
23 | 23 | 80.9/100 | 2026-06-03 19:40 | View Run |
|
Python Hello World Function
Write a Python 3.12 function that prints Hello world. |
23 | 23 | 91.3/100 | 2026-06-03 19:39 | View Run |
|
Python Letter Count in String
Count occurrences of a target letter in a string. |
22 | 22 | 81.4/100 | 2026-04-02 21:23 | View Run |
|
Python Minimum Coin Change
Compute minimum number of coins to make a target amount. |
22 | 22 | 73.2/100 | 2026-04-02 21:23 | View Run |
|
Python Prime Factorization
Return the prime factorization of a positive integer. |
22 | 22 | 70.9/100 | 2026-04-02 21:23 | View Run |
|
Sentence Decomposition
A benchmark to evaluate a model's ability to produce multilingual token-level sentence decomposition with grammatical metadata. No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Simple Arithmetic
A benchmark to evaluate a model's ability to perform basic arithmetic: addition, subtraction, multiplication, and division. |
1080 | 1080 | 94.0/100 | 2026-06-03 18:46 | View Run |
|
Spell Check
A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling. |
1080 | 1080 | 82.6/100 | 2026-06-03 18:41 | View Run |
|
Syllable Count
Tests ability to count syllables in words across Latin-alphabet languages. |
1080 | 1080 | 53.1/100 | 2026-06-03 18:40 | View Run |
|
Syllogism Validity
A benchmark to evaluate whether a model can determine if short categorical syllogisms are logically valid. |
368 | 368 | 67.3/100 | 2026-06-03 19:34 | View Run |
|
Time Arithmetic
A benchmark to evaluate a model's ability to add and subtract durations from clock times in 24-hour HH:MM format. |
1080 | 1080 | 58.6/100 | 2026-06-03 18:54 | View Run |
|
Translation en_de
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 80.2/100 | 2026-06-03 18:59 | View Run |
|
Translation en_es
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 81.4/100 | 2026-06-03 18:58 | View Run |
|
Translation en_fr
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 79.3/100 | 2026-06-03 18:56 | View Run |
|
Translation en_ja
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 76.9/100 | 2026-06-03 19:05 | View Run |
|
Translation en_zh
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 81.0/100 | 2026-06-03 19:03 | View Run |
|
Translation fr_es
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 73.1/100 | 2026-06-03 19:00 | View Run |
|
Translation fr_ko
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 73.7/100 | 2026-06-03 19:06 | View Run |
|
Translation it_lt
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 63.0/100 | 2026-06-03 19:07 | View Run |
|
Translation ja_lt
A benchmark to evaluate a model's ability to translate words from one language to another. |
1080 | 1080 | 70.0/100 | 2026-06-03 19:09 | View Run |
|
Unit Conversion
A benchmark to evaluate a model's ability to accurately convert between different units of measurement. |
1080 | 1080 | 72.1/100 | 2026-06-03 18:48 | View Run |
|
Validate Bulk IPA/Phonetic (bebras)
A regression benchmark for Bebras bulk pronunciation verification. Tests whether the model returns only words with wrong IPA/phonetic mappings from 20-word lists with English + Chinese disambiguation. |
5 | 5 | 20.0/100 | 2026-04-13 20:24 | View Run |
|
Validate Definition (lokys)
A regression benchmark for the lokys agent's validate_definition() function. Tests whether the LLM correctly identifies well-formed vs. problematic word definitions (e.g. circular definitions, translations used as definitions). No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Validate Lemma Form (lokys)
A regression benchmark for the lokys agent's validate_lemma_form() function. Tests whether the LLM correctly identifies if a word is in its base/lemma form and suggests the correct form when it is not. No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Validate Translation (voras)
A regression benchmark for the voras agent's validate_all_translations_for_word() function. Tests whether the LLM correctly identifies semantically incorrect or non-lemma translations across multiple target languages. No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Verb Forms
A benchmark to evaluate a model's ability to generate full verb-form paradigms across persons and tenses in multiple languages. No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Vowel Count
Tests ability to count vowels (a, e, i, o, u and accented forms) in a word across Latin-alphabet languages. |
1080 | 1080 | 43.3/100 | 2026-06-03 18:38 | View Run |
|
Word Length
A benchmark to evaluate a model's ability to count the total number of letters in a given word. |
1080 | 1080 | 63.4/100 | 2026-06-03 18:36 | View Run |
|
Word to IPA
A benchmark to evaluate a model's ability to convert words from multiple languages to their IPA (International Phonetic Alphabet) pronunciation. |
80 | 80 | 78.5/100 | 2026-03-24 20:55 | View Run |