Benchmarks
Available benchmark suites
15 benchmark(s) are registered in code but not yet in the database.
Click "Sync" above to add them, then use "Generate" on each to create questions.
0031_definitions— Definitions0032_part_of_speech— Part of Speech0033_plural— English Plural Generation0061_word_to_ipa— Word to IPA0062_sentence_decomposition— Sentence Decomposition0121_verb_forms— Verb Forms0122_lemma— Lemma Identification0130_validate_lemma_form— Validate Lemma Form (lokys)0131_validate_definition— Validate Definition (lokys)0132_validate_translation— Validate Translation (voras)0151_geography— Geography Knowledge0152_syllogism_validity— Syllogism Validity0153_book_author_match— Book Author Match0154_food_category_classification— Food Category Classification0155_historical_event_year— Historical Event Year
| Name | Questions | Runs | Avg Score | Last Run | Actions |
|---|---|---|---|---|---|
|
Algebra
A benchmark to evaluate a model's ability to solve linear and quadratic equations with integer solutions. Includes single-variable linear equations (ax + b = c) and quadratic equations with one or two integer roots. |
720 | 720 | 54.2/100 | 2026-02-28 19:19 | View Run |
|
Antonym Identification
A benchmark to evaluate a model's ability to identify the antonym of a word. |
760 | 760 | 85.0/100 | 2026-02-28 19:42 | View Run |
|
Fractions and Percentages
A benchmark to evaluate a model's ability to calculate percentages and fractions, including percent-of, fraction-of, and percent change problems. |
720 | 720 | 82.5/100 | 2026-02-28 19:19 | View Run |
|
Geometry
A benchmark to evaluate a model's ability to calculate area, perimeter, and volume for standard shapes: rectangles, triangles, rectangular boxes, and circles (using π ≈ 3.14159). |
720 | 720 | 70.1/100 | 2026-02-28 19:20 | View Run |
|
Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word. |
760 | 760 | 35.4/100 | 2026-02-28 19:17 | View Run |
|
Math Word Problems
A benchmark to evaluate a model's ability to read math word problems and extract the relevant numbers to compute the correct answer. Approximately one third of questions contain distractor/unused information. |
720 | 720 | 83.4/100 | 2026-02-28 19:19 | View Run |
|
Multilingual Synonym Generation
A benchmark to evaluate a model's ability to generate noun synonyms in multiple languages. |
988 | 988 | 69.7/100 | 2026-02-28 19:42 | View Run |
|
Pinyin Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence. |
380 | 380 | 23.4/100 | 2026-02-28 19:43 | View Run |
|
Simple Arithmetic
A benchmark to evaluate a model's ability to perform basic arithmetic: addition, subtraction, multiplication, and division. |
760 | 760 | 91.9/100 | 2026-02-28 19:43 | View Run |
|
Spell Check
A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling. |
760 | 760 | 78.9/100 | 2026-02-28 19:41 | View Run |
|
Syllable Count
Tests ability to count syllables in words across Latin-alphabet languages. |
760 | 760 | 50.1/100 | 2026-02-28 19:41 | View Run |
|
Time Arithmetic
A benchmark to evaluate a model's ability to add and subtract durations from clock times in 24-hour HH:MM format. |
720 | 720 | 47.5/100 | 2026-02-28 19:20 | View Run |
|
Translation en_de
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 73.6/100 | 2026-02-28 19:33 | View Run |
|
Translation en_es
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 75.2/100 | 2026-02-28 19:32 | View Run |
|
Translation en_fr
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 72.2/100 | 2026-02-28 19:31 | View Run |
|
Translation en_ja
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 69.3/100 | 2026-02-28 19:35 | View Run |
|
Translation en_zh
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 74.7/100 | 2026-02-28 19:34 | View Run |
|
Translation fr_es
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 65.8/100 | 2026-02-28 19:33 | View Run |
|
Translation fr_ko
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 64.1/100 | 2026-02-28 19:36 | View Run |
|
Translation it_lt
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 54.5/100 | 2026-02-28 19:36 | View Run |
|
Translation ja_lt
A benchmark to evaluate a model's ability to translate words from one language to another. |
720 | 720 | 60.6/100 | 2026-02-28 19:37 | View Run |
|
Unit Conversion
A benchmark to evaluate a model's ability to accurately convert between different units of measurement. |
720 | 720 | 65.3/100 | 2026-02-28 19:19 | View Run |
|
Vowel Count
Tests ability to count vowels (a, e, i, o, u and accented forms) in a word across Latin-alphabet languages. |
760 | 760 | 42.4/100 | 2026-02-28 19:40 | View Run |
|
Word Length
A benchmark to evaluate a model's ability to count the total number of letters in a given word. |
760 | 760 | 53.8/100 | 2026-02-28 19:16 | View Run |