Benchmarks

Available benchmark suites

Name	Questions	Runs	Avg Score	Last Run	Actions
Algebra A benchmark to evaluate a model's ability to solve linear and quadratic equations with integer solutions. Includes single-variable linear equations (ax + b = c) and quadratic equations with one or two integer roots.	1080	1080	60.6/100	2026-06-03 18:52	View Run
Antonym Identification A benchmark to evaluate a model's ability to identify the antonym of a word.	1080	1080	89.1/100	2026-06-03 18:42	View Run
Book Author Match A benchmark to evaluate matching famous books to their correct authors.	414	414	76.0/100	2026-06-03 19:36	View Run
Definitions A benchmark to evaluate a model's ability to identify the correct definition of words.	880	880	83.0/100	2026-06-03 19:26	View Run
English Plural Generation A benchmark to evaluate a model's ability to produce the correct plural form of English nouns, covering regular, -es, -ies, -ves, irregular, invariant, and Latin/Greek pluralization rules.	960	960	92.0/100	2026-06-03 19:28	View Run
Food Category Classification A benchmark to evaluate classification of food items by category.	480	480	77.5/100	2026-06-03 19:37	View Run
Fractions and Percentages A benchmark to evaluate a model's ability to calculate percentages and fractions, including percent-of, fraction-of, and percent change problems.	1080	1080	87.2/100	2026-06-03 18:50	View Run
Geography Knowledge A benchmark to evaluate a model's knowledge of world geography through multiple-choice questions about countries, capitals, physical features, and other geographical information.	960	960	83.9/100	2026-06-03 19:29	View Run
Geometry A benchmark to evaluate a model's ability to calculate area, perimeter, and volume for standard shapes: rectangles, triangles, rectangular boxes, and circles (using π ≈ 3.14159).	1080	1080	75.6/100	2026-06-03 18:55	View Run
Historical Event Year A benchmark to evaluate selecting the correct year for major historical events.	432	432	70.5/100	2026-06-03 19:39	View Run
Lemma Identification A benchmark to evaluate a model's ability to identify the lemma (base form) of a given word. The lemma is the dictionary form: - For nouns: the singular form (e.g., "cats" → "cat") - For verbs: the infinitive form without "to" (e.g., "running" → "run") - For adjectives: the positive form (e.g., "better" → "good") No questions — generate first	0	0	-	Never	View Generate
Letter Count A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word.	1080	1080	38.5/100	2026-06-03 18:37	View Run
Math Word Problems A benchmark to evaluate a model's ability to read math word problems and extract the relevant numbers to compute the correct answer. Approximately one third of questions contain distractor/unused information.	1080	1080	85.6/100	2026-06-03 18:49	View Run
Multilingual Synonym Generation A benchmark to evaluate a model's ability to generate noun synonyms in multiple languages.	1404	1404	76.1/100	2026-06-03 18:44	View Run
Part of Speech A benchmark to evaluate a model's ability to identify the part of speech of a specific word in a sentence.	960	960	89.9/100	2026-06-03 19:27	View Run
Pinyin Letter Count A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence.	540	540	23.0/100	2026-06-03 18:45	View Run
Python GCD With Validation Write a Python 3.12 function for GCD with invalid-input exceptions.	23	23	80.9/100	2026-06-03 19:40	View Run
Python Hello World Function Write a Python 3.12 function that prints Hello world.	23	23	91.3/100	2026-06-03 19:39	View Run
Python Letter Count in String Count occurrences of a target letter in a string.	22	22	81.4/100	2026-04-02 21:23	View Run
Python Minimum Coin Change Compute minimum number of coins to make a target amount.	22	22	73.2/100	2026-04-02 21:23	View Run
Python Prime Factorization Return the prime factorization of a positive integer.	22	22	70.9/100	2026-04-02 21:23	View Run
Sentence Decomposition A benchmark to evaluate a model's ability to produce multilingual token-level sentence decomposition with grammatical metadata. No questions — generate first	0	0	-	Never	View Generate
Simple Arithmetic A benchmark to evaluate a model's ability to perform basic arithmetic: addition, subtraction, multiplication, and division.	1080	1080	94.0/100	2026-06-03 18:46	View Run
Spell Check A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling.	1080	1080	82.6/100	2026-06-03 18:41	View Run
Syllable Count Tests ability to count syllables in words across Latin-alphabet languages.	1080	1080	53.1/100	2026-06-03 18:40	View Run
Syllogism Validity A benchmark to evaluate whether a model can determine if short categorical syllogisms are logically valid.	368	368	67.3/100	2026-06-03 19:34	View Run
Time Arithmetic A benchmark to evaluate a model's ability to add and subtract durations from clock times in 24-hour HH:MM format.	1080	1080	58.6/100	2026-06-03 18:54	View Run
Translation en_de A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	80.2/100	2026-06-03 18:59	View Run
Translation en_es A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	81.4/100	2026-06-03 18:58	View Run
Translation en_fr A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	79.3/100	2026-06-03 18:56	View Run
Translation en_ja A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	76.9/100	2026-06-03 19:05	View Run
Translation en_zh A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	81.0/100	2026-06-03 19:03	View Run
Translation fr_es A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	73.1/100	2026-06-03 19:00	View Run
Translation fr_ko A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	73.7/100	2026-06-03 19:06	View Run
Translation it_lt A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	63.0/100	2026-06-03 19:07	View Run
Translation ja_lt A benchmark to evaluate a model's ability to translate words from one language to another.	1080	1080	70.0/100	2026-06-03 19:09	View Run
Unit Conversion A benchmark to evaluate a model's ability to accurately convert between different units of measurement.	1080	1080	72.1/100	2026-06-03 18:48	View Run
Validate Bulk IPA/Phonetic (bebras) A regression benchmark for Bebras bulk pronunciation verification. Tests whether the model returns only words with wrong IPA/phonetic mappings from 20-word lists with English + Chinese disambiguation.	5	5	20.0/100	2026-04-13 20:24	View Run
Validate Definition (lokys) A regression benchmark for the lokys agent's validate_definition() function. Tests whether the LLM correctly identifies well-formed vs. problematic word definitions (e.g. circular definitions, translations used as definitions). No questions — generate first	0	0	-	Never	View Generate
Validate Lemma Form (lokys) A regression benchmark for the lokys agent's validate_lemma_form() function. Tests whether the LLM correctly identifies if a word is in its base/lemma form and suggests the correct form when it is not. No questions — generate first	0	0	-	Never	View Generate
Validate Translation (voras) A regression benchmark for the voras agent's validate_all_translations_for_word() function. Tests whether the LLM correctly identifies semantically incorrect or non-lemma translations across multiple target languages. No questions — generate first	0	0	-	Never	View Generate
Verb Forms A benchmark to evaluate a model's ability to generate full verb-form paradigms across persons and tenses in multiple languages. No questions — generate first	0	0	-	Never	View Generate
Vowel Count Tests ability to count vowels (a, e, i, o, u and accented forms) in a word across Latin-alphabet languages.	1080	1080	43.3/100	2026-06-03 18:38	View Run
Word Length A benchmark to evaluate a model's ability to count the total number of letters in a given word.	1080	1080	63.4/100	2026-06-03 18:36	View Run
Word to IPA A benchmark to evaluate a model's ability to convert words from multiple languages to their IPA (International Phonetic Alphabet) pronunciation.	80	80	78.5/100	2026-03-24 20:55	View Run