Benchmark Dashboard

Last updated: June 06, 2026 at 02:27

Tier legend
T1 Tier 1 (Screening)
T2 Tier 2 (Core)
T3 Tier 3 (Advanced)
N/A Excluded by model policy.
Dashboard defaults to remote + word processing. Latency is shown as the median for non-excluded questions, and suspiciously slow incorrect outliers are ignored in that latency summary.
Filters
Benchmark
Claude Haiku 4.5
GPT-5 mini
GPT-5 nano
GPT-5.4 mini
GPT-5.4 nano
Gemma 2 2B (LMStudio)
1500 MB
Gemma 2 9B (LMStudio)
5800 MB
Gemma 2B (LMStudio)
1500 MB
Gemma 3 12B (LMStudio)
8100 MB
Gemma 4 12B (LMStudio)
7560 MB
Gemma 4 E4B IT (LMStudio)
3200 MB
Granite 3.2 8B (LMStudio)
4900 MB
Llama 2 7B (LMStudio)
4900 MB
Llama 3 8B (LMStudio)
4900 MB
Llama 3.1 8B (LMStudio)
4900 MB
Llama 3.2 1B (LMStudio)
1300 MB
Ministral 8B (LMStudio)
4900 MB
OLMo 3 7B (LMStudio)
4300 MB
Phi-3.5 Mini (LMStudio)
2500 MB
Phi-4 (LMStudio)
9100 MB
Qwen3 1.7B (LMStudio)
1100 MB
Qwen3 4B (LMStudio)
2800 MB
Qwen3 VL 8B (LMStudio)
5000 MB
Qwen3.5 2B (LMStudio)
2700 MB
Qwen3.5 4B (LMStudio)
3400 MB
Qwen3.5 9B (LMStudio)
6600 MB
SmolLM2 1.7B (LMStudio)
1100 MB
Word Length
0011_word_length T1
A benchmark to evaluate a model's ability to count the to...
100
714.0ms median
730µ$
100
1223.5ms median
80µ$
42
915.5ms median
16µ$
98
766.5ms median
50µ$
cost warning
100
654.5ms median
37µ$
18
640.0ms median
82µ$
1 latency outlier
60
725.0ms median
120µ$
1 latency outlier
20
292.0ms median
63µ$
1 latency outlier
85
1277.5ms median
169µ$
52
1378.0ms median
73µ$
90
623.0ms median
32µ$
22
779.0ms median
96µ$
1 latency outlier
28
592.0ms median
29µ$
95
565.0ms median
132µ$
1 latency outlier
75
623.0ms median
138µ$
1 latency outlier
28
188.0ms median
33µ$
1 latency outlier
45
564.0ms median
122µ$
1 latency outlier
82
521.0ms median
114µ$
1 latency outlier
15
332.0ms median
60µ$
1 latency outlier
100
938.0ms median
54µ$
72
219.0ms median
63µ$
1 latency outlier
57
423.5ms median
93µ$
48
2558.0ms median
213µ$
1 latency outlier
88
446.5ms median
21µ$
72
911.0ms median
46µ$
92
1207.5ms median
62µ$
35
198.5ms median
26µ$
Letter Count
0012_letter_count T1
A benchmark to evaluate a model's ability to count how ma...
78
691.0ms median
717µ$
48
1130.0ms median
77µ$
42
946.0ms median
15µ$
80
690.5ms median
43µ$
cost warning
28
714.5ms median
35µ$
12
642.5ms median
35µ$
30
599.5ms median
31µ$
20
288.0ms median
14µ$
52
1327.5ms median
70µ$
72
1403.0ms median
77µ$
35
649.0ms median
33µ$
52
652.5ms median
34µ$
15
540.5ms median
27µ$
15
568.0ms median
27µ$
35
605.5ms median
28µ$
42
192.0ms median
10µ$
32
545.0ms median
26µ$
45
442.0ms median
24µ$
12
350.0ms median
18µ$
48
1119.5ms median
53µ$
22
178.5ms median
9µ$
48
379.0ms median
19µ$
55
2536.0ms median
130µ$
20
445.0ms median
22µ$
38
896.0ms median
45µ$
50
1346.0ms median
65µ$
20
185.5ms median
10µ$
Vowel Count
0013_vowel_count T1
Tests ability to count vowels (a, e, i, o, u and accented forms) in a word acros...
80
572.0ms median
791µ$
88
1134.0ms median
97µ$
55
948.5ms median
19µ$
98
818.5ms median
44µ$
cost warning
50
691.5ms median
50µ$
40
694.0ms median
38µ$
38
969.5ms median
49µ$
28
363.5ms median
18µ$
42
1501.0ms median
80µ$
12
1664.0ms median
94µ$
52
869.0ms median
44µ$
32
796.0ms median
39µ$
28
723.5ms median
35µ$
38
695.0ms median
36µ$
42
731.5ms median
38µ$
28
238.0ms median
12µ$
48
721.0ms median
37µ$
68
635.5ms median
32µ$
0
430.5ms median
23µ$
85
1246.5ms median
63µ$
45
245.5ms median
12µ$
40
473.0ms median
24µ$
68
2987.0ms median
153µ$
15
581.0ms median
29µ$
25
1193.5ms median
60µ$
28
1972.0ms median
101µ$
8
244.0ms median
12µ$
Syllable Count
0014_syllable_count T1
Tests ability to count syllables in words across Latin-alphabet languages....
92
591.5ms median
738µ$
90
1364.0ms median
85µ$
62
1023.0ms median
17µ$
95
717.5ms median
47µ$
cost warning
60
667.5ms median
40µ$
30
715.5ms median
38µ$
70
1051.0ms median
52µ$
22
360.0ms median
18µ$
48
1623.0ms median
85µ$
28
1860.5ms median
107µ$
82
843.5ms median
41µ$
38
863.0ms median
45µ$
30
747.5ms median
37µ$
75
711.5ms median
36µ$
88
729.0ms median
36µ$
42
238.5ms median
12µ$
72
710.5ms median
35µ$
62
666.0ms median
33µ$
5
433.5ms median
22µ$
95
1361.0ms median
71µ$
25
256.0ms median
13µ$
62
497.5ms median
25µ$
28
2995.0ms median
152µ$
12
522.0ms median
27µ$
52
1042.0ms median
52µ$
60
1679.0ms median
81µ$
15
244.5ms median
12µ$
Spell Check
0015_spell_check T1
A benchmark to evaluate a model's ability to identify mis...
100
695.0ms median
796µ$
100
1227.5ms median
89µ$
95
984.5ms median
18µ$
100
817.5ms median
45µ$
cost warning
100
699.0ms median
42µ$
90
929.5ms median
50µ$
100
1430.5ms median
70µ$
55
507.0ms median
28µ$
100
2047.5ms median
106µ$
88
1782.0ms median
95µ$
82
810.5ms median
41µ$
95
1175.0ms median
59µ$
38
1096.0ms median
54µ$
98
947.5ms median
49µ$
92
960.5ms median
49µ$
40
337.0ms median
17µ$
90
1013.0ms median
51µ$
85
810.0ms median
43µ$
38
589.0ms median
29µ$
95
1856.0ms median
96µ$
75
290.5ms median
15µ$
88
620.0ms median
32µ$
100
3221.5ms median
166µ$
82
759.0ms median
39µ$
90
1395.0ms median
71µ$
95
2139.5ms median
107µ$
25
326.5ms median
16µ$
Antonym Identification
0016_antonym T1
A benchmark to evaluate a model's ability to identify the...
100
570.5ms median
723µ$
100
1141.0ms median
80µ$
100
984.5ms median
16µ$
100
768.5ms median
46µ$
cost warning
100
673.5ms median
37µ$
100
737.5ms median
37µ$
100
928.0ms median
47µ$
90
393.0ms median
20µ$
100
1702.5ms median
86µ$
95
1633.5ms median
90µ$
98
763.5ms median
38µ$
100
713.0ms median
37µ$
38
746.0ms median
39µ$
100
672.0ms median
33µ$
100
693.0ms median
35µ$
62
251.5ms median
13µ$
100
657.0ms median
33µ$
100
631.0ms median
32µ$
8
471.0ms median
25µ$
100
1248.5ms median
63µ$
92
217.0ms median
11µ$
98
437.0ms median
22µ$
100
2860.5ms median
146µ$
100
546.0ms median
27µ$
100
1053.5ms median
52µ$
100
1534.5ms median
77µ$
30
230.5ms median
12µ$
Multilingual Synonym Generation
0017_synonyms T1
A benchmark to evaluate a model's ability to generate noun synonyms ...
98
595.5ms median
739µ$
100
1124.5ms median
83µ$
96
853.5ms median
17µ$
100
724.5ms median
43µ$
cost warning
100
680.0ms median
39µ$
77
790.0ms median
42µ$
96
1050.0ms median
53µ$
31
390.0ms median
20µ$
98
1764.0ms median
92µ$
67
1783.0ms median
96µ$
94
776.0ms median
39µ$
90
1023.5ms median
55µ$
17
861.0ms median
49µ$
79
787.0ms median
44µ$
81
803.5ms median
44µ$
21
258.5ms median
13µ$
69
749.5ms median
38µ$
77
812.5ms median
42µ$
2
564.0ms median
31µ$
94
1707.5ms median
91µ$
88
247.5ms median
13µ$
96
513.5ms median
27µ$
100
3064.0ms median
158µ$
88
570.5ms median
28µ$
92
1078.5ms median
58µ$
94
1620.0ms median
85µ$
15
285.5ms median
15µ$
Pinyin Letter Count
0018_pinyin_letters T1
A benchmark to evaluate a model's ability to count how many times a s...
35
671.5ms median
879µ$
35
1135.0ms median
119µ$
15
1066.0ms median
24µ$
40
740.0ms median
37µ$
cost warning
20
682.5ms median
68µ$
30
692.5ms median
41µ$
30
810.0ms median
45µ$
60
349.0ms median
18µ$
50
1626.0ms median
92µ$
35
1446.0ms median
97µ$
15
1059.0ms median
55µ$
25
772.5ms median
39µ$
15
678.5ms median
37µ$
25
533.0ms median
30µ$
15
526.0ms median
29µ$
10
203.0ms median
11µ$
15
567.0ms median
31µ$
20
466.5ms median
27µ$
0
766.0ms median
41µ$
35
1092.0ms median
62µ$
5
220.5ms median
12µ$
5
414.0ms median
22µ$
20
3396.5ms median
174µ$
5
547.0ms median
30µ$
0
1344.0ms median
76µ$
25
2369.5ms median
117µ$
35
220.5ms median
12µ$
Simple Arithmetic
0021_simple_arithmetic T1
A benchmark to evaluate a model's ability to perform basic arithmeti...
100
602.0ms median
703µ$
100
1066.5ms median
73µ$
100
881.5ms median
15µ$
100
719.5ms median
38µ$
cost warning
100
691.5ms median
32µ$
98
649.0ms median
35µ$
100
663.5ms median
34µ$
98
310.5ms median
16µ$
100
1489.0ms median
78µ$
95
1511.0ms median
79µ$
100
620.5ms median
31µ$
100
706.0ms median
32µ$
57
493.0ms median
26µ$
100
563.0ms median
28µ$
90
607.0ms median
30µ$
100
187.0ms median
9µ$
100
635.5ms median
32µ$
100
433.0ms median
23µ$
10
355.0ms median
18µ$
100
1112.0ms median
51µ$
100
202.0ms median
10µ$
100
410.5ms median
21µ$
100
2982.5ms median
157µ$
100
444.0ms median
23µ$
100
838.5ms median
42µ$
98
1231.5ms median
59µ$
95
196.0ms median
10µ$
Unit Conversion
0022_unit_conversion T1
A benchmark to evaluate a model's ability to accurately convert ...
100
613.0ms median
712µ$
100
1172.0ms median
80µ$
72
901.5ms median
16µ$
100
729.0ms median
40µ$
cost warning
100
689.5ms median
35µ$
25
717.5ms median
39µ$
98
1108.0ms median
56µ$
12
346.0ms median
17µ$
95
1856.0ms median
97µ$
25
2211.0ms median
119µ$
92
732.0ms median
39µ$
88
878.5ms median
50µ$
18
881.5ms median
45µ$
80
634.5ms median
32µ$
85
633.5ms median
32µ$
22
227.5ms median
12µ$
85
834.0ms median
43µ$
98
645.0ms median
32µ$
5
427.5ms median
23µ$
98
1219.0ms median
64µ$
68
295.0ms median
16µ$
85
617.5ms median
33µ$
95
3234.5ms median
161µ$
75
543.5ms median
31µ$
90
1235.0ms median
64µ$
92
1915.0ms median
95µ$
50
219.5ms median
13µ$
Math Word Problems
0023_word_problems T1
A benchmark to evaluate a model's ability to read math word problems...
100
537.5ms median
728µ$
98
1082.0ms median
78µ$
95
933.0ms median
16µ$
100
692.0ms median
44µ$
cost warning
100
711.0ms median
37µ$
75
672.0ms median
34µ$
100
900.0ms median
45µ$
68
338.0ms median
17µ$
100
1466.5ms median
75µ$
45
1734.5ms median
97µ$
98
653.0ms median
34µ$
90
766.5ms median
39µ$
68
653.5ms median
33µ$
100
570.0ms median
31µ$
70
606.5ms median
30µ$
75
188.0ms median
10µ$
100
635.5ms median
33µ$
100
447.5ms median
24µ$
20
396.0ms median
21µ$
100
1086.5ms median
55µ$
92
217.5ms median
11µ$
92
422.0ms median
22µ$
100
2934.5ms median
152µ$
100
485.0ms median
25µ$
95
962.0ms median
49µ$
72
1678.0ms median
84µ$
62
202.0ms median
11µ$
Fractions and Percentages
0024_percentage_math T1
A benchmark to evaluate a model's ability to calculate percentages a...
100
611.5ms median
708µ$
100
1025.0ms median
74µ$
100
897.5ms median
15µ$
100
710.0ms median
121µ$
100
682.0ms median
33µ$
90
672.0ms median
34µ$
98
718.5ms median
37µ$
52
296.5ms median
15µ$
100
1444.5ms median
72µ$
72
1621.0ms median
97µ$
98
661.0ms median
32µ$
100
723.5ms median
37µ$
48
614.5ms median
32µ$
90
571.5ms median
28µ$
100
615.0ms median
31µ$
50
187.0ms median
10µ$
95
623.0ms median
31µ$
98
457.5ms median
24µ$
5
378.0ms median
20µ$
100
915.5ms median
51µ$
95
204.0ms median
10µ$
100
423.5ms median
21µ$
100
2957.0ms median
150µ$
100
485.0ms median
25µ$
100
938.5ms median
46µ$
100
1437.0ms median
71µ$
68
210.0ms median
11µ$
Algebra
0025_algebra T1
A benchmark to evaluate a model's ability to solve linear and quadra...
100
613.0ms median
721µ$
98
1029.0ms median
79µ$
80
918.0ms median
16µ$
100
721.0ms median
134µ$
100
701.0ms median
36µ$
42
649.5ms median
37µ$
72
976.5ms median
50µ$
30
322.0ms median
17µ$
85
1374.5ms median
77µ$
12
2798.0ms median
172µ$
1 latency outlier
75
707.0ms median
35µ$
45
715.0ms median
40µ$
10
620.5ms median
36µ$
48
588.5ms median
35µ$
55
739.5ms median
37µ$
42
215.0ms median
11µ$
48
612.0ms median
31µ$
62
594.5ms median
31µ$
2
425.0ms median
22µ$
90
918.5ms median
56µ$
60
204.0ms median
11µ$
65
401.0ms median
21µ$
100
2889.5ms median
151µ$
50
472.0ms median
27µ$
80
925.0ms median
49µ$
52
1503.5ms median
76µ$
35
197.0ms median
12µ$
Time Arithmetic
0026_time_arithmetic T1
A benchmark to evaluate a model's ability to add and subtract ...
100
612.0ms median
730µ$
100
1129.5ms median
82µ$
82
1021.0ms median
16µ$
100
712.5ms median
140µ$
100
702.0ms median
38µ$
35
830.5ms median
44µ$
70
1012.5ms median
51µ$
8
433.5ms median
22µ$
70
1704.0ms median
89µ$
70
1986.5ms median
101µ$
78
801.0ms median
38µ$
60
917.5ms median
47µ$
18
863.0ms median
43µ$
65
580.5ms median
31µ$
55
706.0ms median
36µ$
2
257.5ms median
16µ$
55
695.0ms median
35µ$
65
518.0ms median
27µ$
0
452.0ms median
23µ$
90
1339.0ms median
64µ$
28
249.5ms median
13µ$
65
487.0ms median
25µ$
75
3312.5ms median
169µ$
40
528.0ms median
28µ$
65
1153.5ms median
59µ$
85
1493.0ms median
77µ$
5
274.5ms median
13µ$
Geometry
0027_geometry T1
A benchmark to evaluate a model's ability to calculate area, perimet...
100
612.5ms median
714µ$
100
1072.5ms median
76µ$
98
919.5ms median
15µ$
100
715.0ms median
126µ$
100
702.5ms median
34µ$
68
666.0ms median
34µ$
95
767.0ms median
42µ$
30
314.5ms median
17µ$
95
1464.5ms median
77µ$
38
2005.5ms median
120µ$
95
685.0ms median
36µ$
72
731.5ms median
39µ$
20
608.0ms median
32µ$
82
571.5ms median
30µ$
60
621.5ms median
33µ$
40
197.5ms median
10µ$
82
654.5ms median
34µ$
95
467.0ms median
26µ$
5
376.5ms median
19µ$
100
1121.5ms median
55µ$
80
210.0ms median
11µ$
98
428.0ms median
23µ$
100
3050.5ms median
156µ$
60
496.0ms median
26µ$
88
976.5ms median
51µ$
100
1441.0ms median
74µ$
45
210.5ms median
11µ$
Definitions
0031_definitions T2
A benchmark to evaluate a model's ability to identify the...
100
511.0ms median
103µ$
100
1026.5ms median
63µ$
100
920.0ms median
13µ$
100
689.0ms median
96µ$
100
663.5ms median
26µ$
98
299.0ms median
16µ$
100
560.0ms median
29µ$
2
679.5ms median
33µ$
-
98
19474.0ms median
984µ$
100
419.5ms median
21µ$
100
457.0ms median
25µ$
10
646.0ms median
114µ$
1 latency outlier
98
396.5ms median
20µ$
98
398.5ms median
21µ$
60
86.0ms median
5µ$
100
395.0ms median
20µ$
100
372.5ms median
106µ$
0
599.5ms median
234µ$
2 latency outliers
100
850.0ms median
43µ$
- - -
100
229.5ms median
11µ$
100
19892.5ms median
989µ$
-
65
114.0ms median
6µ$
Part of Speech
0032_part_of_speech T2
A benchmark to evaluate a model's ability to identify the...
98
597.0ms median
798µ$
98
1125.5ms median
99µ$
98
919.5ms median
20µ$
98
712.5ms median
189µ$
98
734.0ms median
51µ$
92
835.0ms median
45µ$
98
1102.0ms median
55µ$
70
412.0ms median
21µ$
100
1830.5ms median
95µ$
98
1449.5ms median
76µ$
95
942.0ms median
46µ$
95
971.5ms median
49µ$
90
831.0ms median
42µ$
98
758.0ms median
38µ$
95
815.5ms median
40µ$
70
272.0ms median
14µ$
95
785.0ms median
40µ$
95
716.0ms median
36µ$
10
523.0ms median
28µ$
100
1422.5ms median
71µ$
- - -
98
572.5ms median
28µ$
100
1343.0ms median
69µ$
100
2000.5ms median
99µ$
78
264.5ms median
13µ$
English Plural Generation
0033_plural T2
A benchmark to evaluate a model's ability to produce the correct plu...
100
613.0ms median
830µ$
100
1125.0ms median
109µ$
100
992.0ms median
22µ$
100
694.5ms median
225µ$
100
682.5ms median
61µ$
92
717.5ms median
39µ$
100
926.0ms median
48µ$
88
321.5ms median
16µ$
100
1404.5ms median
76µ$
98
1074.0ms median
57µ$
92
1034.5ms median
50µ$
98
849.0ms median
43µ$
85
660.0ms median
35µ$
100
698.0ms median
35µ$
98
715.0ms median
36µ$
72
248.0ms median
13µ$
92
624.0ms median
32µ$
95
663.5ms median
33µ$
25
419.0ms median
21µ$
100
1205.0ms median
63µ$
- - -
90
577.5ms median
29µ$
95
1380.0ms median
68µ$
100
2198.0ms median
109µ$
92
232.0ms median
12µ$
Word to IPA
0061_word_to_ipa T3
A benchmark to evaluate a model's ability to convert words from mult...
N/A
78
1494.5ms median
95µ$
N/A
80
1091.0ms median
536µ$
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Sentence Decomposition
0062_sentence_decomposition T3
A benchmark to evaluate a model's ability to produce multilingual ...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Translation en_fr
0101_translation_en_fr T1
A benchmark to evaluate a model's ability to translate ...
98
612.0ms median
750µ$
95
1007.0ms median
118µ$
85
921.5ms median
24µ$
98
667.0ms median
153µ$
95
686.5ms median
41µ$
72
775.0ms median
87µ$
1 latency outlier
88
1157.0ms median
84µ$
52
393.5ms median
19µ$
88
1828.5ms median
98µ$
88
1446.5ms median
73µ$
92
623.5ms median
34µ$
88
1131.0ms median
56µ$
28
994.0ms median
123µ$
1 latency outlier
90
882.5ms median
44µ$
85
817.5ms median
42µ$
48
248.0ms median
13µ$
88
747.0ms median
38µ$
82
739.0ms median
37µ$
15
716.5ms median
46µ$
90
1588.5ms median
79µ$
88
219.0ms median
12µ$
90
495.0ms median
25µ$
92
3849.5ms median
197µ$
95
571.0ms median
27µ$
95
1083.5ms median
54µ$
95
1699.0ms median
87µ$
32
297.5ms median
15µ$
Translation en_es
0102_translation_en_es T1
A benchmark to evaluate a model's ability to translate ...
98
611.0ms median
750µ$
95
1066.0ms median
118µ$
98
936.5ms median
24µ$
98
693.5ms median
153µ$
95
680.0ms median
41µ$
90
763.0ms median
39µ$
98
1151.0ms median
57µ$
55
391.5ms median
19µ$
95
1901.0ms median
95µ$
90
1490.5ms median
76µ$
90
801.0ms median
39µ$
92
1129.0ms median
56µ$
18
963.5ms median
49µ$
95
903.0ms median
44µ$
98
823.0ms median
41µ$
40
240.5ms median
13µ$
92
747.5ms median
38µ$
75
695.0ms median
36µ$
10
668.5ms median
37µ$
92
1597.0ms median
79µ$
85
217.0ms median
11µ$
95
498.0ms median
25µ$
95
3811.0ms median
195µ$
90
561.5ms median
26µ$
98
1109.0ms median
56µ$
95
1710.5ms median
87µ$
32
304.0ms median
15µ$
Translation en_de
0103_translation_en_de T1
A benchmark to evaluate a model's ability to translate ...
100
613.0ms median
759µ$
95
1002.0ms median
119µ$
92
920.0ms median
24µ$
98
706.0ms median
156µ$
95
694.0ms median
42µ$
85
789.0ms median
40µ$
90
1167.0ms median
57µ$
48
383.5ms median
19µ$
92
1888.0ms median
96µ$
88
1446.0ms median
80µ$
92
788.0ms median
38µ$
92
1131.0ms median
57µ$
18
997.5ms median
51µ$
85
918.0ms median
46µ$
92
844.5ms median
43µ$
45
247.5ms median
13µ$
88
797.5ms median
40µ$
92
758.0ms median
38µ$
8
1008.0ms median
53µ$
88
1609.5ms median
82µ$
92
232.5ms median
12µ$
95
509.5ms median
26µ$
95
3964.5ms median
202µ$
95
566.5ms median
27µ$
95
1122.5ms median
58µ$
92
1762.0ms median
91µ$
25
307.5ms median
15µ$
Translation fr_es
0104_translation_fr_es T1
A benchmark to evaluate a model's ability to translate ...
100
594.0ms median
750µ$
1 excluded Q
100
1014.0ms median
118µ$
1 excluded Q
87
978.0ms median
24µ$
1 excluded Q
100
685.0ms median
154µ$
1 excluded Q
92
687.0ms median
42µ$
1 excluded Q
77
759.0ms median
38µ$
1 excluded Q
85
1085.0ms median
54µ$
1 excluded Q
46
384.0ms median
19µ$
1 excluded Q
82
1858.0ms median
94µ$
1 excluded Q
87
1522.0ms median
77µ$
1 excluded Q
82
790.0ms median
37µ$
1 excluded Q
87
1103.0ms median
56µ$
1 excluded Q
13
999.0ms median
48µ$
1 excluded Q
85
905.0ms median
44µ$
1 excluded Q
82
814.0ms median
40µ$
1 excluded Q
31
247.0ms median
13µ$
1 excluded Q
82
736.0ms median
36µ$
1 excluded Q
69
755.0ms median
37µ$
1 excluded Q
0
720.0ms median
42µ$
1 excluded Q
92
1536.0ms median
76µ$
1 excluded Q
85
221.0ms median
11µ$
1 excluded Q
82
483.0ms median
25µ$
1 excluded Q
90
3752.0ms median
189µ$
1 excluded Q
85
548.0ms median
26µ$
1 excluded Q
92
1191.0ms median
60µ$
1 excluded Q
82
1781.0ms median
95µ$
1 excluded Q
36
302.0ms median
15µ$
1 excluded Q
Translation en_zh
0105_translation_en_zh T1
A benchmark to evaluate a model's ability to translate ...
100
611.5ms median
761µ$
98
981.5ms median
119µ$
90
921.0ms median
24µ$
100
762.5ms median
156µ$
100
701.0ms median
42µ$
92
786.0ms median
39µ$
92
1095.0ms median
54µ$
72
381.0ms median
19µ$
98
1875.5ms median
95µ$
57
1612.0ms median
86µ$
92
720.0ms median
35µ$
98
1103.0ms median
55µ$
15
1048.0ms median
55µ$
80
902.0ms median
44µ$
48
901.0ms median
47µ$
48
258.5ms median
13µ$
92
874.5ms median
43µ$
88
914.0ms median
44µ$
5
764.5ms median
47µ$
98
1893.5ms median
92µ$
95
229.5ms median
11µ$
98
523.0ms median
26µ$
100
3988.5ms median
208µ$
100
529.0ms median
26µ$
100
1039.0ms median
53µ$
98
1732.0ms median
90µ$
42
335.0ms median
18µ$
Translation en_ja
0106_translation_en_ja T1
A benchmark to evaluate a model's ability to translate ...
100
613.0ms median
760µ$
98
1006.5ms median
120µ$
85
1020.0ms median
24µ$
100
721.5ms median
159µ$
100
714.0ms median
43µ$
85
805.0ms median
41µ$
70
1154.5ms median
57µ$
60
398.0ms median
19µ$
90
1921.0ms median
97µ$
72
1565.5ms median
84µ$
90
643.5ms median
34µ$
88
1131.5ms median
56µ$
22
1056.5ms median
53µ$
68
903.0ms median
44µ$
75
859.5ms median
44µ$
30
240.0ms median
13µ$
82
789.0ms median
40µ$
75
932.5ms median
46µ$
10
786.5ms median
48µ$
92
1874.5ms median
92µ$
82
241.0ms median
12µ$
95
544.5ms median
27µ$
100
3895.5ms median
203µ$
85
494.0ms median
24µ$
95
1072.5ms median
54µ$
95
1703.5ms median
90µ$
35
346.0ms median
17µ$
Translation fr_ko
0107_translation_fr_ko T1
A benchmark to evaluate a model's ability to translate ...
100
514.5ms median
765µ$
100
1059.5ms median
121µ$
90
923.0ms median
24µ$
100
704.0ms median
161µ$
98
712.0ms median
44µ$
80
842.0ms median
42µ$
85
1246.0ms median
61µ$
48
409.5ms median
20µ$
80
2076.5ms median
102µ$
85
1687.0ms median
96µ$
88
817.0ms median
40µ$
82
1084.0ms median
53µ$
15
1167.5ms median
61µ$
75
915.5ms median
46µ$
72
827.5ms median
42µ$
18
263.5ms median
15µ$
70
746.5ms median
37µ$
52
893.5ms median
46µ$
8
926.5ms median
55µ$
95
1952.0ms median
96µ$
78
237.0ms median
12µ$
92
542.5ms median
27µ$
98
3923.0ms median
201µ$
85
584.0ms median
29µ$
95
1135.0ms median
57µ$
92
1722.5ms median
89µ$
15
398.5ms median
21µ$
Translation it_lt
0108_translation_it_lt T1
A benchmark to evaluate a model's ability to translate ...
95
603.5ms median
760µ$
92
992.5ms median
122µ$
85
935.5ms median
25µ$
98
685.5ms median
164µ$
92
699.5ms median
44µ$
65
870.0ms median
43µ$
85
1253.0ms median
64µ$
15
406.0ms median
20µ$
88
2107.0ms median
105µ$
55
1882.0ms median
100µ$
55
849.5ms median
42µ$
55
1155.0ms median
58µ$
12
1053.5ms median
52µ$
65
969.0ms median
48µ$
68
975.5ms median
49µ$
10
282.5ms median
14µ$
68
804.0ms median
41µ$
30
914.5ms median
47µ$
0
916.5ms median
54µ$
90
1964.0ms median
96µ$
60
264.0ms median
13µ$
78
592.5ms median
29µ$
92
3972.5ms median
205µ$
55
451.5ms median
24µ$
95
1272.0ms median
64µ$
88
1902.0ms median
96µ$
18
325.0ms median
17µ$
Translation ja_lt
0109_translation_ja_lt T1
A benchmark to evaluate a model's ability to translate ...
98
560.5ms median
760µ$
98
1038.0ms median
122µ$
98
946.0ms median
24µ$
100
685.0ms median
163µ$
100
664.0ms median
44µ$
65
865.0ms median
43µ$
92
1255.0ms median
63µ$
22
410.5ms median
20µ$
100
2042.5ms median
102µ$
55
1842.0ms median
101µ$
80
829.5ms median
42µ$
55
1149.0ms median
58µ$
15
1017.0ms median
51µ$
75
952.0ms median
48µ$
75
945.5ms median
48µ$
20
267.0ms median
16µ$
68
789.0ms median
41µ$
28
906.5ms median
46µ$
8
822.5ms median
51µ$
90
1917.5ms median
96µ$
75
256.5ms median
13µ$
90
577.5ms median
29µ$
98
3910.5ms median
213µ$
82
447.0ms median
24µ$
95
1128.0ms median
57µ$
100
1827.5ms median
90µ$
15
315.0ms median
16µ$
Verb Forms
0121_verb_forms T3
A benchmark to evaluate a model's ability to generate full verb-form...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Lemma Identification
0122_lemma T3
A benchmark to evaluate a model's ability to identify the lemma (base ...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Validate Lemma Form (lokys)
0130_validate_lemma_form T3
A regression benchmark for the lokys agent's validate_lemma_form() f...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Validate Definition (lokys)
0131_validate_definition T3
A regression benchmark for the lokys agent's validate_definition() f...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Validate Translation (voras)
0132_validate_translation T3
A regression benchmark for the voras agent's validate_all_translatio...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Validate Bulk IPA/Phonetic (bebras)
0141_validate_pronunciation_bulk T3
A regression benchmark for Bebras bulk pronunciation verification. ...
N/A N/A N/A
20
1670.0ms median
2406µ$
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Geography Knowledge
0151_geography T2
A benchmark to evaluate a model's knowledge of world geography throu...
100
549.0ms median
763µ$
100
1212.0ms median
90µ$
100
993.0ms median
18µ$
100
697.5ms median
169µ$
100
670.0ms median
47µ$
78
797.5ms median
44µ$
95
1100.0ms median
55µ$
82
382.0ms median
20µ$
100
1686.0ms median
89µ$
98
1439.0ms median
77µ$
98
665.0ms median
36µ$
100
863.0ms median
45µ$
12
876.0ms median
55µ$
50
765.0ms median
38µ$
70
867.5ms median
49µ$
92
257.5ms median
13µ$
100
738.0ms median
37µ$
95
561.0ms median
29µ$
20
506.5ms median
26µ$
88
1671.0ms median
88µ$
- - -
95
544.5ms median
27µ$
100
1029.0ms median
52µ$
92
1658.0ms median
89µ$
52
256.0ms median
14µ$
Syllogism Validity
0152_syllogism_validity T2
A benchmark to evaluate whether a model can determine if short ...
100
2677.0ms median
1923µ$
100
2448.5ms median
265µ$
100
1534.0ms median
54µ$
100
903.5ms median
453µ$
100
974.0ms median
138µ$
19
3209.0ms median
169µ$
88
6058.5ms median
311µ$
0
934.5ms median
49µ$
94
9676.5ms median
510µ$
88
10932.5ms median
540µ$
100
7651.0ms median
377µ$
50
8270.5ms median
435µ$
6
2164.0ms median
134µ$
62
4346.0ms median
230µ$
69
5841.0ms median
319µ$
31
925.5ms median
53µ$
75
3686.0ms median
200µ$
62
7497.5ms median
373µ$
62
3783.0ms median
156µ$
100
18457.5ms median
1041µ$
- - -
69
8663.0ms median
392µ$
69
15473.0ms median
1626µ$
3 latency outliers
-
12
1063.0ms median
54µ$
Book Author Match
0153_book_author_match T2
A benchmark to evaluate matching famous books to their correct autho...
100
1690.5ms median
1348µ$
100
2239.5ms median
220µ$
94
1431.5ms median
44µ$
100
934.5ms median
361µ$
100
915.0ms median
95µ$
11
2014.5ms median
100µ$
100
2996.5ms median
145µ$
11
717.0ms median
38µ$
100
9186.0ms median
465µ$
67
4109.5ms median
229µ$
100
2414.0ms median
129µ$
94
5832.0ms median
320µ$
33
2283.0ms median
141µ$
89
3735.5ms median
193µ$
89
4019.5ms median
209µ$
44
1357.5ms median
68µ$
94
2826.0ms median
154µ$
94
4290.5ms median
216µ$
83
2236.5ms median
127µ$
94
14947.5ms median
761µ$
- - -
72
2935.0ms median
141µ$
61
6148.0ms median
375µ$
-
22
694.0ms median
37µ$
Food Category Classification
0154_food_category_classification T2
A benchmark to evaluate classification of food items by category....
100
1688.0ms median
1243µ$
100
2151.5ms median
173µ$
100
1215.0ms median
34µ$
100
867.0ms median
293µ$
100
758.0ms median
79µ$
25
1420.0ms median
74µ$
100
2297.0ms median
116µ$
20
753.0ms median
40µ$
100
6117.0ms median
290µ$
90
3670.5ms median
193µ$
100
2197.5ms median
123µ$
95
4370.5ms median
240µ$
20
1639.5ms median
88µ$
95
2687.0ms median
134µ$
75
2668.0ms median
135µ$
75
558.5ms median
30µ$
100
2539.0ms median
138µ$
75
3506.5ms median
184µ$
20
623.0ms median
113µ$
1 latency outlier
100
12760.0ms median
632µ$
- -
100
6767.0ms median
344µ$
95
2765.0ms median
136µ$
55
4767.0ms median
399µ$
1 latency outlier
-
20
647.0ms median
32µ$
Historical Event Year
0155_historical_event_year T2
A benchmark to evaluate selecting the correct year for major histori...
100
1586.0ms median
1247µ$
100
1943.5ms median
182µ$
89
1320.0ms median
38µ$
100
895.0ms median
318µ$
100
817.0ms median
93µ$
100
1767.5ms median
91µ$
94
2992.0ms median
150µ$
44
770.0ms median
41µ$
100
9206.0ms median
454µ$
89
4637.0ms median
275µ$
100
2438.5ms median
136µ$
83
5828.0ms median
288µ$
11
2179.5ms median
147µ$
6
3402.5ms median
173µ$
17
4595.0ms median
232µ$
22
967.5ms median
56µ$
94
2512.0ms median
133µ$
67
3639.0ms median
188µ$
28
994.0ms median
149µ$
1 latency outlier
94
12527.5ms median
633µ$
- -
100
9934.0ms median
494µ$
61
2038.5ms median
121µ$
78
5651.0ms median
303µ$
-
22
751.0ms median
41µ$
Python Hello World Function
0301_python_hello_world T2
Write a Python 3.12 function that prints Hello world....
100
738.0ms median
220µ$
100
1251.0ms median
102µ$
100
2838.0ms median
20µ$
100
784.0ms median
196µ$
100
604.0ms median
53µ$
100
705.0ms median
35µ$
100
2291.0ms median
115µ$
100
677.0ms median
34µ$
100
2824.0ms median
141µ$
100
20546.0ms median
1027µ$
100
1203.0ms median
60µ$
100
2025.0ms median
101µ$
0
1999.0ms median
100µ$
100
1592.0ms median
80µ$
100
1451.0ms median
73µ$
0
463.0ms median
23µ$
100
1516.0ms median
76µ$
100
1637.0ms median
82µ$
N/A
100
3153.0ms median
158µ$
- -
100
3346.0ms median
167µ$
100
793.0ms median
40µ$
100
29482.0ms median
1474µ$
-
100
551.0ms median
28µ$
Python GCD With Validation
0302_python_gcd T2
Write a Python 3.12 function for GCD with invalid-input exceptions....
0
1298.0ms median
918µ$
100
4217.0ms median
260µ$
100
3006.0ms median
50µ$
100
1337.0ms median
558µ$
100
1188.0ms median
153µ$
100
3051.0ms median
153µ$
100
7020.0ms median
351µ$
0
4855.0ms median
243µ$
100
7905.0ms median
395µ$
100
78672.0ms median
3934µ$
0
4039.0ms median
202µ$
100
5997.0ms median
300µ$
0
5380.0ms median
269µ$
100
4998.0ms median
250µ$
100
5551.0ms median
278µ$
100
2082.0ms median
104µ$
100
4843.0ms median
242µ$
100
4716.0ms median
236µ$
0
4542.0ms median
227µ$
100
9782.0ms median
489µ$
- -
100
6117.0ms median
306µ$
100
3876.0ms median
194µ$
- -
100
2372.0ms median
119µ$
Python Letter Count in String
0303_python_letter_count T2
Count occurrences of a target letter in a string....
0
842.0ms median
535µ$
100
2911.0ms median
320µ$
100
1463.0ms median
57µ$
100
1519.0ms median
549µ$
100
953.0ms median
181µ$
0
2393.0ms median
120µ$
100
7248.0ms median
362µ$
0
4930.0ms median
247µ$
100
9981.0ms median
499µ$
-
0
5179.0ms median
259µ$
100
5841.0ms median
292µ$
0
2522.0ms median
126µ$
100
4749.0ms median
237µ$
100
13036.0ms median
652µ$
100
1449.0ms median
72µ$
100
5333.0ms median
267µ$
0
5780.0ms median
289µ$
0
8038.0ms median
402µ$
0
8626.0ms median
431µ$
- -
0
6000.0ms median
300µ$
0
2517.0ms median
126µ$
- -
100
3031.0ms median
152µ$
Python Minimum Coin Change
0304_python_coin_change T2
Compute minimum number of coins to make a target amount....
0
2083.0ms median
1422µ$
100
5037.0ms median
712µ$
0
4173.0ms median
121µ$
100
1460.0ms median
1510µ$
100
2309.0ms median
404µ$
0
2442.0ms median
122µ$
0
12101.0ms median
605µ$
0
7292.0ms median
365µ$
100
25978.0ms median
1299µ$
-
0
9764.0ms median
488µ$
0
10511.0ms median
526µ$
0
2352.0ms median
118µ$
0
7226.0ms median
361µ$
0
16252.0ms median
813µ$
0
5343.0ms median
267µ$
0
7930.0ms median
397µ$
100
9870.0ms median
494µ$
0
0.0ms median
6035µ$
1 latency outlier
0
21341.0ms median
1067µ$
- -
100
12970.0ms median
649µ$
0
9989.0ms median
499µ$
- -
0
2315.0ms median
116µ$
Python Prime Factorization
0305_python_prime_factorization T2
Return the prime factorization of a positive integer....
0
1821.0ms median
850µ$
100
4098.0ms median
417µ$
100
2563.0ms median
80µ$
100
1029.0ms median
733µ$
100
1338.0ms median
211µ$
0
2760.0ms median
138µ$
100
8917.0ms median
446µ$
0
4337.0ms median
217µ$
100
13446.0ms median
672µ$
-
100
5588.0ms median
279µ$
100
10789.0ms median
539µ$
0
6498.0ms median
325µ$
0
6940.0ms median
347µ$
100
6034.0ms median
302µ$
0
2812.0ms median
141µ$
100
5473.0ms median
274µ$
100
6080.0ms median
304µ$
0
21967.0ms median
1098µ$
100
12348.0ms median
617µ$
- -
100
6774.0ms median
339µ$
0
4004.0ms median
200µ$
- -
0
2032.0ms median
102µ$