Benchmarks

Available benchmark suites

Name Questions Runs Avg Score Last Run Actions
Algebra
A benchmark to evaluate a model's ability to solve linear and quadratic equations with integer solutions. Includes single-variable linear equations (ax + b = c) and quadratic equations with one or two integer roots.
920 920 58.6/100 2026-03-03 21:04 View Run
Antonym Identification
A benchmark to evaluate a model's ability to identify the antonym of a word.
920 920 87.6/100 2026-03-03 20:59 View Run
Book Author Match
A benchmark to evaluate matching famous books to their correct authors.
342 342 72.8/100 2026-03-03 21:39 View Run
Definitions
A benchmark to evaluate a model's ability to identify the correct definition of words.
720 720 79.3/100 2026-03-03 21:18 View Run
English Plural Generation
A benchmark to evaluate a model's ability to produce the correct plural form of English nouns, covering regular, -es, -ies, -ves, irregular, invariant, and Latin/Greek pluralization rules.
800 800 91.0/100 2026-03-03 21:20 View Run
Food Category Classification
A benchmark to evaluate classification of food items by category.
400 400 73.5/100 2026-03-03 21:42 View Run
Fractions and Percentages
A benchmark to evaluate a model's ability to calculate percentages and fractions, including percent-of, fraction-of, and percent change problems.
920 920 86.3/100 2026-03-03 21:03 View Run
Geography Knowledge
A benchmark to evaluate a model's knowledge of world geography through multiple-choice questions about countries, capitals, physical features, and other geographical information.
800 800 81.0/100 2026-03-03 21:28 View Run
Geometry
A benchmark to evaluate a model's ability to calculate area, perimeter, and volume for standard shapes: rectangles, triangles, rectangular boxes, and circles (using π ≈ 3.14159).
920 920 74.3/100 2026-03-03 21:05 View Run
Historical Event Year
A benchmark to evaluate selecting the correct year for major historical events.
360 360 65.2/100 2026-03-03 21:44 View Run
Lemma Identification
A benchmark to evaluate a model's ability to identify the lemma (base form) of a given word. The lemma is the dictionary form: - For nouns: the singular form (e.g., "cats" → "cat") - For verbs: the infinitive form without "to" (e.g., "running" → "run") - For adjectives: the positive form (e.g., "better" → "good")
No questions — generate first
0 0 - Never View Generate
Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word.
920 920 35.9/100 2026-03-03 20:55 View Run
Math Word Problems
A benchmark to evaluate a model's ability to read math word problems and extract the relevant numbers to compute the correct answer. Approximately one third of questions contain distractor/unused information.
920 920 85.6/100 2026-03-03 21:02 View Run
Multilingual Synonym Generation
A benchmark to evaluate a model's ability to generate noun synonyms in multiple languages.
1196 1196 73.6/100 2026-03-03 21:00 View Run
Part of Speech
A benchmark to evaluate a model's ability to identify the part of speech of a specific word in a sentence.
800 800 88.6/100 2026-03-03 21:19 View Run
Pinyin Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence.
460 460 22.2/100 2026-03-03 21:00 View Run
Python GCD With Validation
Write a Python 3.12 function for GCD with invalid-input exceptions.
19 19 81.1/100 2026-03-03 19:32 View Run
Python Hello World Function
Write a Python 3.12 function that prints Hello world.
19 19 89.5/100 2026-03-03 21:44 View Run
Python Letter Count in String
Count occurrences of a target letter in a string.
19 19 78.9/100 2026-03-03 19:32 View Run
Python Minimum Coin Change
Compute minimum number of coins to make a target amount.
19 19 69.4/100 2026-03-03 19:32 View Run
Python Prime Factorization
Return the prime factorization of a positive integer.
19 19 66.3/100 2026-03-03 19:32 View Run
Sentence Decomposition
A benchmark to evaluate a model's ability to produce multilingual token-level sentence decomposition with grammatical metadata.
No questions — generate first
0 0 - Never View Generate
Simple Arithmetic
A benchmark to evaluate a model's ability to perform basic arithmetic: addition, subtraction, multiplication, and division.
920 920 93.2/100 2026-03-03 21:01 View Run
Spell Check
A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling.
920 920 81.0/100 2026-03-03 20:58 View Run
Syllable Count
Tests ability to count syllables in words across Latin-alphabet languages.
920 920 50.9/100 2026-03-03 20:57 View Run
Syllogism Validity
A benchmark to evaluate whether a model can determine if short categorical syllogisms are logically valid.
304 304 61.2/100 2026-03-03 21:37 View Run
Time Arithmetic
A benchmark to evaluate a model's ability to add and subtract durations from clock times in 24-hour HH:MM format.
920 920 53.7/100 2026-03-03 21:04 View Run
Translation en_de
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 78.0/100 2026-03-03 21:23 View Run
Translation en_es
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 79.3/100 2026-03-03 21:22 View Run
Translation en_fr
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 77.0/100 2026-03-03 21:21 View Run
Translation en_ja
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 74.5/100 2026-03-03 21:25 View Run
Translation en_zh
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 79.9/100 2026-03-03 21:24 View Run
Translation fr_es
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 70.6/100 2026-03-03 21:23 View Run
Translation fr_ko
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 70.4/100 2026-03-03 21:26 View Run
Translation it_lt
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 61.0/100 2026-03-03 21:27 View Run
Translation ja_lt
A benchmark to evaluate a model's ability to translate words from one language to another.
920 920 67.6/100 2026-03-03 21:27 View Run
Unit Conversion
A benchmark to evaluate a model's ability to accurately convert between different units of measurement.
920 920 70.9/100 2026-03-03 21:02 View Run
Validate Definition (lokys)
A regression benchmark for the lokys agent's validate_definition() function. Tests whether the LLM correctly identifies well-formed vs. problematic word definitions (e.g. circular definitions, translations used as definitions).
No questions — generate first
0 0 - Never View Generate
Validate Lemma Form (lokys)
A regression benchmark for the lokys agent's validate_lemma_form() function. Tests whether the LLM correctly identifies if a word is in its base/lemma form and suggests the correct form when it is not.
No questions — generate first
0 0 - Never View Generate
Validate Translation (voras)
A regression benchmark for the voras agent's validate_all_translations_for_word() function. Tests whether the LLM correctly identifies semantically incorrect or non-lemma translations across multiple target languages.
No questions — generate first
0 0 - Never View Generate
Verb Forms
A benchmark to evaluate a model's ability to generate full verb-form paradigms across persons and tenses in multiple languages.
No questions — generate first
0 0 - Never View Generate
Vowel Count
Tests ability to count vowels (a, e, i, o, u and accented forms) in a word across Latin-alphabet languages.
920 920 41.7/100 2026-03-03 20:56 View Run
Word Length
A benchmark to evaluate a model's ability to count the total number of letters in a given word.
920 920 59.7/100 2026-03-03 20:55 View Run
Word to IPA
A benchmark to evaluate a model's ability to convert words from multiple languages to their IPA (International Phonetic Alphabet) pronunciation.
No questions — generate first
0 0 - Never View Generate