Benchmarks

Available benchmark suites

15 benchmark(s) are registered in code but not yet in the database. Click "Sync" above to add them, then use "Generate" on each to create questions.

0031_definitions — Definitions
0032_part_of_speech — Part of Speech
0033_plural — English Plural Generation
0061_word_to_ipa — Word to IPA
0062_sentence_decomposition — Sentence Decomposition
0121_verb_forms — Verb Forms
0122_lemma — Lemma Identification
0130_validate_lemma_form — Validate Lemma Form (lokys)
0131_validate_definition — Validate Definition (lokys)
0132_validate_translation — Validate Translation (voras)
0151_geography — Geography Knowledge
0152_syllogism_validity — Syllogism Validity
0153_book_author_match — Book Author Match
0154_food_category_classification — Food Category Classification
0155_historical_event_year — Historical Event Year

Name	Questions	Runs	Avg Score	Last Run	Actions
Algebra A benchmark to evaluate a model's ability to solve linear and quadratic equations with integer solutions. Includes single-variable linear equations (ax + b = c) and quadratic equations with one or two integer roots.	720	720	54.2/100	2026-02-28 19:19	View Run
Antonym Identification A benchmark to evaluate a model's ability to identify the antonym of a word.	760	760	85.0/100	2026-02-28 19:42	View Run
Fractions and Percentages A benchmark to evaluate a model's ability to calculate percentages and fractions, including percent-of, fraction-of, and percent change problems.	720	720	82.5/100	2026-02-28 19:19	View Run
Geometry A benchmark to evaluate a model's ability to calculate area, perimeter, and volume for standard shapes: rectangles, triangles, rectangular boxes, and circles (using π ≈ 3.14159).	720	720	70.1/100	2026-02-28 19:20	View Run
Letter Count A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word.	760	760	35.4/100	2026-02-28 19:17	View Run
Math Word Problems A benchmark to evaluate a model's ability to read math word problems and extract the relevant numbers to compute the correct answer. Approximately one third of questions contain distractor/unused information.	720	720	83.4/100	2026-02-28 19:19	View Run
Multilingual Synonym Generation A benchmark to evaluate a model's ability to generate noun synonyms in multiple languages.	988	988	69.7/100	2026-02-28 19:42	View Run
Pinyin Letter Count A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence.	380	380	23.4/100	2026-02-28 19:43	View Run
Simple Arithmetic A benchmark to evaluate a model's ability to perform basic arithmetic: addition, subtraction, multiplication, and division.	760	760	91.9/100	2026-02-28 19:43	View Run
Spell Check A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling.	760	760	78.9/100	2026-02-28 19:41	View Run
Syllable Count Tests ability to count syllables in words across Latin-alphabet languages.	760	760	50.1/100	2026-02-28 19:41	View Run
Time Arithmetic A benchmark to evaluate a model's ability to add and subtract durations from clock times in 24-hour HH:MM format.	720	720	47.5/100	2026-02-28 19:20	View Run
Translation en_de A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	73.6/100	2026-02-28 19:33	View Run
Translation en_es A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	75.2/100	2026-02-28 19:32	View Run
Translation en_fr A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	72.2/100	2026-02-28 19:31	View Run
Translation en_ja A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	69.3/100	2026-02-28 19:35	View Run
Translation en_zh A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	74.7/100	2026-02-28 19:34	View Run
Translation fr_es A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	65.8/100	2026-02-28 19:33	View Run
Translation fr_ko A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	64.1/100	2026-02-28 19:36	View Run
Translation it_lt A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	54.5/100	2026-02-28 19:36	View Run
Translation ja_lt A benchmark to evaluate a model's ability to translate words from one language to another.	720	720	60.6/100	2026-02-28 19:37	View Run
Unit Conversion A benchmark to evaluate a model's ability to accurately convert between different units of measurement.	720	720	65.3/100	2026-02-28 19:19	View Run
Vowel Count Tests ability to count vowels (a, e, i, o, u and accented forms) in a word across Latin-alphabet languages.	760	760	42.4/100	2026-02-28 19:40	View Run
Word Length A benchmark to evaluate a model's ability to count the total number of letters in a given word.	760	760	53.8/100	2026-02-28 19:16	View Run