Benchmarks
Available benchmark suites
| Name | Questions | Runs | Avg Score | Last Run | Actions |
|---|---|---|---|---|---|
|
Algebra
A benchmark to evaluate a model's ability to solve linear and quadratic equations with integer solutions. Includes single-variable linear equations (ax + b = c) and quadratic equations with one or two integer roots. |
920 | 920 | 58.6/100 | 2026-03-03 21:04 | View Run |
|
Antonym Identification
A benchmark to evaluate a model's ability to identify the antonym of a word. |
920 | 920 | 87.6/100 | 2026-03-03 20:59 | View Run |
|
Book Author Match
A benchmark to evaluate matching famous books to their correct authors. |
342 | 342 | 72.8/100 | 2026-03-03 21:39 | View Run |
|
Definitions
A benchmark to evaluate a model's ability to identify the correct definition of words. |
720 | 720 | 79.3/100 | 2026-03-03 21:18 | View Run |
|
English Plural Generation
A benchmark to evaluate a model's ability to produce the correct plural form of English nouns, covering regular, -es, -ies, -ves, irregular, invariant, and Latin/Greek pluralization rules. |
800 | 800 | 91.0/100 | 2026-03-03 21:20 | View Run |
|
Food Category Classification
A benchmark to evaluate classification of food items by category. |
400 | 400 | 73.5/100 | 2026-03-03 21:42 | View Run |
|
Fractions and Percentages
A benchmark to evaluate a model's ability to calculate percentages and fractions, including percent-of, fraction-of, and percent change problems. |
920 | 920 | 86.3/100 | 2026-03-03 21:03 | View Run |
|
Geography Knowledge
A benchmark to evaluate a model's knowledge of world geography through multiple-choice questions about countries, capitals, physical features, and other geographical information. |
800 | 800 | 81.0/100 | 2026-03-03 21:28 | View Run |
|
Geometry
A benchmark to evaluate a model's ability to calculate area, perimeter, and volume for standard shapes: rectangles, triangles, rectangular boxes, and circles (using π ≈ 3.14159). |
920 | 920 | 74.3/100 | 2026-03-03 21:05 | View Run |
|
Historical Event Year
A benchmark to evaluate selecting the correct year for major historical events. |
360 | 360 | 65.2/100 | 2026-03-03 21:44 | View Run |
|
Lemma Identification
A benchmark to evaluate a model's ability to identify the lemma (base form) of a given word. The lemma is the dictionary form: - For nouns: the singular form (e.g., "cats" → "cat") - For verbs: the infinitive form without "to" (e.g., "running" → "run") - For adjectives: the positive form (e.g., "better" → "good") No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word. |
920 | 920 | 35.9/100 | 2026-03-03 20:55 | View Run |
|
Math Word Problems
A benchmark to evaluate a model's ability to read math word problems and extract the relevant numbers to compute the correct answer. Approximately one third of questions contain distractor/unused information. |
920 | 920 | 85.6/100 | 2026-03-03 21:02 | View Run |
|
Multilingual Synonym Generation
A benchmark to evaluate a model's ability to generate noun synonyms in multiple languages. |
1196 | 1196 | 73.6/100 | 2026-03-03 21:00 | View Run |
|
Part of Speech
A benchmark to evaluate a model's ability to identify the part of speech of a specific word in a sentence. |
800 | 800 | 88.6/100 | 2026-03-03 21:19 | View Run |
|
Pinyin Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence. |
460 | 460 | 22.2/100 | 2026-03-03 21:00 | View Run |
|
Python GCD With Validation
Write a Python 3.12 function for GCD with invalid-input exceptions. |
19 | 19 | 81.1/100 | 2026-03-03 19:32 | View Run |
|
Python Hello World Function
Write a Python 3.12 function that prints Hello world. |
19 | 19 | 89.5/100 | 2026-03-03 21:44 | View Run |
|
Python Letter Count in String
Count occurrences of a target letter in a string. |
19 | 19 | 78.9/100 | 2026-03-03 19:32 | View Run |
|
Python Minimum Coin Change
Compute minimum number of coins to make a target amount. |
19 | 19 | 69.4/100 | 2026-03-03 19:32 | View Run |
|
Python Prime Factorization
Return the prime factorization of a positive integer. |
19 | 19 | 66.3/100 | 2026-03-03 19:32 | View Run |
|
Sentence Decomposition
A benchmark to evaluate a model's ability to produce multilingual token-level sentence decomposition with grammatical metadata. No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Simple Arithmetic
A benchmark to evaluate a model's ability to perform basic arithmetic: addition, subtraction, multiplication, and division. |
920 | 920 | 93.2/100 | 2026-03-03 21:01 | View Run |
|
Spell Check
A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling. |
920 | 920 | 81.0/100 | 2026-03-03 20:58 | View Run |
|
Syllable Count
Tests ability to count syllables in words across Latin-alphabet languages. |
920 | 920 | 50.9/100 | 2026-03-03 20:57 | View Run |
|
Syllogism Validity
A benchmark to evaluate whether a model can determine if short categorical syllogisms are logically valid. |
304 | 304 | 61.2/100 | 2026-03-03 21:37 | View Run |
|
Time Arithmetic
A benchmark to evaluate a model's ability to add and subtract durations from clock times in 24-hour HH:MM format. |
920 | 920 | 53.7/100 | 2026-03-03 21:04 | View Run |
|
Translation en_de
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 78.0/100 | 2026-03-03 21:23 | View Run |
|
Translation en_es
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 79.3/100 | 2026-03-03 21:22 | View Run |
|
Translation en_fr
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 77.0/100 | 2026-03-03 21:21 | View Run |
|
Translation en_ja
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 74.5/100 | 2026-03-03 21:25 | View Run |
|
Translation en_zh
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 79.9/100 | 2026-03-03 21:24 | View Run |
|
Translation fr_es
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 70.6/100 | 2026-03-03 21:23 | View Run |
|
Translation fr_ko
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 70.4/100 | 2026-03-03 21:26 | View Run |
|
Translation it_lt
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 61.0/100 | 2026-03-03 21:27 | View Run |
|
Translation ja_lt
A benchmark to evaluate a model's ability to translate words from one language to another. |
920 | 920 | 67.6/100 | 2026-03-03 21:27 | View Run |
|
Unit Conversion
A benchmark to evaluate a model's ability to accurately convert between different units of measurement. |
920 | 920 | 70.9/100 | 2026-03-03 21:02 | View Run |
|
Validate Definition (lokys)
A regression benchmark for the lokys agent's validate_definition() function. Tests whether the LLM correctly identifies well-formed vs. problematic word definitions (e.g. circular definitions, translations used as definitions). No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Validate Lemma Form (lokys)
A regression benchmark for the lokys agent's validate_lemma_form() function. Tests whether the LLM correctly identifies if a word is in its base/lemma form and suggests the correct form when it is not. No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Validate Translation (voras)
A regression benchmark for the voras agent's validate_all_translations_for_word() function. Tests whether the LLM correctly identifies semantically incorrect or non-lemma translations across multiple target languages. No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Verb Forms
A benchmark to evaluate a model's ability to generate full verb-form paradigms across persons and tenses in multiple languages. No questions — generate first |
0 | 0 | - | Never | View Generate |
|
Vowel Count
Tests ability to count vowels (a, e, i, o, u and accented forms) in a word across Latin-alphabet languages. |
920 | 920 | 41.7/100 | 2026-03-03 20:56 | View Run |
|
Word Length
A benchmark to evaluate a model's ability to count the total number of letters in a given word. |
920 | 920 | 59.7/100 | 2026-03-03 20:55 | View Run |
|
Word to IPA
A benchmark to evaluate a model's ability to convert words from multiple languages to their IPA (International Phonetic Alphabet) pronunciation. No questions — generate first |
0 | 0 | - | Never | View Generate |