Benchmarks

Available benchmark suites

Name Questions Runs Avg Score Last Run Actions
Algebra
A benchmark to evaluate a model's ability to solve linear and quadratic equations with integer solutions. Includes single-variable linear equations (ax + b = c) and quadratic equations with one or two integer roots.
1080 1080 60.6/100 2026-06-03 18:52 View Run
Antonym Identification
A benchmark to evaluate a model's ability to identify the antonym of a word.
1080 1080 89.1/100 2026-06-03 18:42 View Run
Book Author Match
A benchmark to evaluate matching famous books to their correct authors.
414 414 76.0/100 2026-06-03 19:36 View Run
Definitions
A benchmark to evaluate a model's ability to identify the correct definition of words.
880 880 83.0/100 2026-06-03 19:26 View Run
English Plural Generation
A benchmark to evaluate a model's ability to produce the correct plural form of English nouns, covering regular, -es, -ies, -ves, irregular, invariant, and Latin/Greek pluralization rules.
960 960 92.0/100 2026-06-03 19:28 View Run
Food Category Classification
A benchmark to evaluate classification of food items by category.
480 480 77.5/100 2026-06-03 19:37 View Run
Fractions and Percentages
A benchmark to evaluate a model's ability to calculate percentages and fractions, including percent-of, fraction-of, and percent change problems.
1080 1080 87.2/100 2026-06-03 18:50 View Run
Geography Knowledge
A benchmark to evaluate a model's knowledge of world geography through multiple-choice questions about countries, capitals, physical features, and other geographical information.
960 960 83.9/100 2026-06-03 19:29 View Run
Geometry
A benchmark to evaluate a model's ability to calculate area, perimeter, and volume for standard shapes: rectangles, triangles, rectangular boxes, and circles (using π ≈ 3.14159).
1080 1080 75.6/100 2026-06-03 18:55 View Run
Historical Event Year
A benchmark to evaluate selecting the correct year for major historical events.
432 432 70.5/100 2026-06-03 19:39 View Run
Lemma Identification
A benchmark to evaluate a model's ability to identify the lemma (base form) of a given word. The lemma is the dictionary form: - For nouns: the singular form (e.g., "cats" → "cat") - For verbs: the infinitive form without "to" (e.g., "running" → "run") - For adjectives: the positive form (e.g., "better" → "good")
No questions — generate first
0 0 - Never View Generate
Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word.
1080 1080 38.5/100 2026-06-03 18:37 View Run
Math Word Problems
A benchmark to evaluate a model's ability to read math word problems and extract the relevant numbers to compute the correct answer. Approximately one third of questions contain distractor/unused information.
1080 1080 85.6/100 2026-06-03 18:49 View Run
Multilingual Synonym Generation
A benchmark to evaluate a model's ability to generate noun synonyms in multiple languages.
1404 1404 76.1/100 2026-06-03 18:44 View Run
Part of Speech
A benchmark to evaluate a model's ability to identify the part of speech of a specific word in a sentence.
960 960 89.9/100 2026-06-03 19:27 View Run
Pinyin Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence.
540 540 23.0/100 2026-06-03 18:45 View Run
Python GCD With Validation
Write a Python 3.12 function for GCD with invalid-input exceptions.
23 23 80.9/100 2026-06-03 19:40 View Run
Python Hello World Function
Write a Python 3.12 function that prints Hello world.
23 23 91.3/100 2026-06-03 19:39 View Run
Python Letter Count in String
Count occurrences of a target letter in a string.
22 22 81.4/100 2026-04-02 21:23 View Run
Python Minimum Coin Change
Compute minimum number of coins to make a target amount.
22 22 73.2/100 2026-04-02 21:23 View Run
Python Prime Factorization
Return the prime factorization of a positive integer.
22 22 70.9/100 2026-04-02 21:23 View Run
Sentence Decomposition
A benchmark to evaluate a model's ability to produce multilingual token-level sentence decomposition with grammatical metadata.
No questions — generate first
0 0 - Never View Generate
Simple Arithmetic
A benchmark to evaluate a model's ability to perform basic arithmetic: addition, subtraction, multiplication, and division.
1080 1080 94.0/100 2026-06-03 18:46 View Run
Spell Check
A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling.
1080 1080 82.6/100 2026-06-03 18:41 View Run
Syllable Count
Tests ability to count syllables in words across Latin-alphabet languages.
1080 1080 53.1/100 2026-06-03 18:40 View Run
Syllogism Validity
A benchmark to evaluate whether a model can determine if short categorical syllogisms are logically valid.
368 368 67.3/100 2026-06-03 19:34 View Run
Time Arithmetic
A benchmark to evaluate a model's ability to add and subtract durations from clock times in 24-hour HH:MM format.
1080 1080 58.6/100 2026-06-03 18:54 View Run
Translation en_de
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 80.2/100 2026-06-03 18:59 View Run
Translation en_es
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 81.4/100 2026-06-03 18:58 View Run
Translation en_fr
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 79.3/100 2026-06-03 18:56 View Run
Translation en_ja
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 76.9/100 2026-06-03 19:05 View Run
Translation en_zh
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 81.0/100 2026-06-03 19:03 View Run
Translation fr_es
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 73.1/100 2026-06-03 19:00 View Run
Translation fr_ko
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 73.7/100 2026-06-03 19:06 View Run
Translation it_lt
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 63.0/100 2026-06-03 19:07 View Run
Translation ja_lt
A benchmark to evaluate a model's ability to translate words from one language to another.
1080 1080 70.0/100 2026-06-03 19:09 View Run
Unit Conversion
A benchmark to evaluate a model's ability to accurately convert between different units of measurement.
1080 1080 72.1/100 2026-06-03 18:48 View Run
Validate Bulk IPA/Phonetic (bebras)
A regression benchmark for Bebras bulk pronunciation verification. Tests whether the model returns only words with wrong IPA/phonetic mappings from 20-word lists with English + Chinese disambiguation.
5 5 20.0/100 2026-04-13 20:24 View Run
Validate Definition (lokys)
A regression benchmark for the lokys agent's validate_definition() function. Tests whether the LLM correctly identifies well-formed vs. problematic word definitions (e.g. circular definitions, translations used as definitions).
No questions — generate first
0 0 - Never View Generate
Validate Lemma Form (lokys)
A regression benchmark for the lokys agent's validate_lemma_form() function. Tests whether the LLM correctly identifies if a word is in its base/lemma form and suggests the correct form when it is not.
No questions — generate first
0 0 - Never View Generate
Validate Translation (voras)
A regression benchmark for the voras agent's validate_all_translations_for_word() function. Tests whether the LLM correctly identifies semantically incorrect or non-lemma translations across multiple target languages.
No questions — generate first
0 0 - Never View Generate
Verb Forms
A benchmark to evaluate a model's ability to generate full verb-form paradigms across persons and tenses in multiple languages.
No questions — generate first
0 0 - Never View Generate
Vowel Count
Tests ability to count vowels (a, e, i, o, u and accented forms) in a word across Latin-alphabet languages.
1080 1080 43.3/100 2026-06-03 18:38 View Run
Word Length
A benchmark to evaluate a model's ability to count the total number of letters in a given word.
1080 1080 63.4/100 2026-06-03 18:36 View Run
Word to IPA
A benchmark to evaluate a model's ability to convert words from multiple languages to their IPA (International Phonetic Alphabet) pronunciation.
80 80 78.5/100 2026-03-24 20:55 View Run