Benchmarks

Available benchmark suites

Name Questions Runs Avg Score Last Run Actions
Algebra
A benchmark to evaluate a model's ability to solve linear and quadratic equations with integer solutions. Includes single-variable linear equations (ax + b = c) and quadratic equations with one or two integer roots.
720 720 54.2/100 2026-02-28 19:19 View Run
Antonym Identification
A benchmark to evaluate a model's ability to identify the antonym of a word.
760 760 85.0/100 2026-02-28 19:42 View Run
Fractions and Percentages
A benchmark to evaluate a model's ability to calculate percentages and fractions, including percent-of, fraction-of, and percent change problems.
720 720 82.5/100 2026-02-28 19:19 View Run
Geometry
A benchmark to evaluate a model's ability to calculate area, perimeter, and volume for standard shapes: rectangles, triangles, rectangular boxes, and circles (using π ≈ 3.14159).
720 720 70.1/100 2026-02-28 19:20 View Run
Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in a word.
760 760 35.4/100 2026-02-28 19:17 View Run
Math Word Problems
A benchmark to evaluate a model's ability to read math word problems and extract the relevant numbers to compute the correct answer. Approximately one third of questions contain distractor/unused information.
720 720 83.4/100 2026-02-28 19:19 View Run
Multilingual Synonym Generation
A benchmark to evaluate a model's ability to generate noun synonyms in multiple languages.
988 988 69.7/100 2026-02-28 19:42 View Run
Pinyin Letter Count
A benchmark to evaluate a model's ability to count how many times a specific letter appears in the Pinyin representation of a Chinese sentence.
380 380 23.4/100 2026-02-28 19:43 View Run
Simple Arithmetic
A benchmark to evaluate a model's ability to perform basic arithmetic: addition, subtraction, multiplication, and division.
760 760 91.9/100 2026-02-28 19:43 View Run
Spell Check
A benchmark to evaluate a model's ability to identify misspelled words in a sentence and provide their correct spelling.
760 760 78.9/100 2026-02-28 19:41 View Run
Syllable Count
Tests ability to count syllables in words across Latin-alphabet languages.
760 760 50.1/100 2026-02-28 19:41 View Run
Time Arithmetic
A benchmark to evaluate a model's ability to add and subtract durations from clock times in 24-hour HH:MM format.
720 720 47.5/100 2026-02-28 19:20 View Run
Translation en_de
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 73.6/100 2026-02-28 19:33 View Run
Translation en_es
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 75.2/100 2026-02-28 19:32 View Run
Translation en_fr
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 72.2/100 2026-02-28 19:31 View Run
Translation en_ja
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 69.3/100 2026-02-28 19:35 View Run
Translation en_zh
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 74.7/100 2026-02-28 19:34 View Run
Translation fr_es
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 65.8/100 2026-02-28 19:33 View Run
Translation fr_ko
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 64.1/100 2026-02-28 19:36 View Run
Translation it_lt
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 54.5/100 2026-02-28 19:36 View Run
Translation ja_lt
A benchmark to evaluate a model's ability to translate words from one language to another.
720 720 60.6/100 2026-02-28 19:37 View Run
Unit Conversion
A benchmark to evaluate a model's ability to accurately convert between different units of measurement.
720 720 65.3/100 2026-02-28 19:19 View Run
Vowel Count
Tests ability to count vowels (a, e, i, o, u and accented forms) in a word across Latin-alphabet languages.
760 760 42.4/100 2026-02-28 19:40 View Run
Word Length
A benchmark to evaluate a model's ability to count the total number of letters in a given word.
760 760 53.8/100 2026-02-28 19:16 View Run