Benchmark Dashboard

Last updated: March 16, 2026 at 08:36

Tier legend
T1 Tier 1 (Screening)
T2 Tier 2 (Core)
T3 Tier 3 (Advanced)
N/A Excluded by model policy.
Filters
Benchmark
Claude Haiku 4.5
GPT-5 mini
GPT-5 nano
Gemma 2 2B (LMStudio)
1500 MB
Gemma 2 9B (LMStudio)
5800 MB
Gemma 2B (LMStudio)
1500 MB
Gemma 3 12B (LMStudio)
8100 MB
Granite 3.2 8B (LMStudio)
4900 MB
Llama 2 7B (LMStudio)
4900 MB
Llama 3 8B (LMStudio)
4900 MB
Llama 3.1 8B (LMStudio)
4900 MB
Llama 3.2 1B (LMStudio)
1300 MB
Ministral 8B (LMStudio)
4900 MB
OLMo 3 7B (LMStudio)
4300 MB
Phi-3.5 Mini (LMStudio)
2500 MB
Phi-4 (LMStudio)
9100 MB
Qwen3 1.7B (LMStudio)
1100 MB
Qwen3 4B (LMStudio)
2800 MB
Qwen3 VL 8B (LMStudio)
5000 MB
Qwen3.5 2B (LMStudio)
2700 MB
Qwen3.5 4B (LMStudio)
3400 MB
Qwen3.5 9B (LMStudio)
6600 MB
SmolLM2 1.7B (LMStudio)
1100 MB
Word Length
0011_word_length T1
A benchmark to evaluate a model's ability to count the to...
100
742.5ms
730µ$
100
1341.8ms
80µ$
42
992.8ms
16µ$
17
1644.2ms
82µ$
60
2390.3ms
120µ$
20
1256.9ms
63µ$
85
3372.6ms
173µ$
22
1927.9ms
96µ$
27
570.5ms
29µ$
95
2631.9ms
132µ$
75
2754.4ms
138µ$
27
660.2ms
33µ$
45
2443.2ms
122µ$
82
2280.9ms
114µ$
15
1196.8ms
60µ$
100
1078.6ms
54µ$
72
1265.8ms
63µ$
57
1859.3ms
93µ$
47
4256.9ms
213µ$
87
421.7ms
21µ$
72
927.1ms
46µ$
92
1229.8ms
62µ$
35
518.9ms
26µ$
Letter Count
0012_letter_count T1
A benchmark to evaluate a model's ability to count how ma...
77
704.6ms
717µ$
47
1256.8ms
77µ$
42
1101.0ms
15µ$
12
692.6ms
35µ$
30
628.5ms
31µ$
20
277.3ms
14µ$
52
1398.2ms
70µ$
52
670.4ms
34µ$
15
534.9ms
27µ$
15
532.6ms
27µ$
35
560.5ms
28µ$
42
194.4ms
10µ$
32
522.6ms
26µ$
45
485.4ms
24µ$
12
360.2ms
18µ$
47
1061.0ms
53µ$
22
187.9ms
9µ$
47
385.5ms
19µ$
55
2607.5ms
130µ$
20
447.4ms
22µ$
37
896.8ms
45µ$
50
1308.5ms
65µ$
20
190.7ms
10µ$
Vowel Count
0013_vowel_count T1
Tests ability to count vowels (a, e, i, o, u and accented forms) in a word acros...
80
708.0ms
791µ$
87
1232.1ms
97µ$
55
1334.4ms
19µ$
40
754.2ms
38µ$
37
984.6ms
49µ$
27
364.0ms
18µ$
42
1589.7ms
80µ$
32
771.2ms
39µ$
27
699.3ms
35µ$
37
726.3ms
36µ$
42
764.7ms
38µ$
27
239.8ms
12µ$
47
729.8ms
37µ$
67
637.0ms
32µ$
0
452.1ms
23µ$
85
1255.0ms
63µ$
45
248.4ms
12µ$
40
486.7ms
24µ$
67
3062.4ms
153µ$
15
584.6ms
29µ$
25
1195.6ms
60µ$
27
2028.8ms
101µ$
7
248.2ms
12µ$
Syllable Count
0014_syllable_count T1
Tests ability to count syllables in words across Latin-alphabet languages....
92
679.5ms
738µ$
90
1444.3ms
85µ$
62
1248.6ms
17µ$
30
769.3ms
38µ$
70
1045.7ms
52µ$
22
357.9ms
18µ$
47
1692.8ms
85µ$
37
893.3ms
45µ$
30
732.5ms
37µ$
75
724.1ms
36µ$
87
717.0ms
36µ$
42
240.1ms
12µ$
72
695.6ms
35µ$
62
661.5ms
33µ$
5
440.4ms
22µ$
95
1412.0ms
71µ$
25
259.7ms
13µ$
62
499.4ms
25µ$
27
3046.4ms
152µ$
12
536.6ms
27µ$
52
1039.4ms
52µ$
60
1628.2ms
81µ$
15
248.1ms
12µ$
Spell Check
0015_spell_check T1
A benchmark to evaluate a model's ability to identify mis...
100
805.6ms
796µ$
100
1291.1ms
89µ$
95
1463.5ms
18µ$
90
990.0ms
50µ$
100
1398.9ms
70µ$
55
559.9ms
28µ$
100
2127.1ms
106µ$
95
1183.5ms
59µ$
37
1076.5ms
54µ$
97
970.7ms
49µ$
92
982.8ms
49µ$
40
339.5ms
17µ$
90
1019.6ms
51µ$
85
861.5ms
43µ$
37
585.0ms
29µ$
95
1911.1ms
96µ$
75
295.0ms
15µ$
87
635.0ms
32µ$
100
3319.7ms
166µ$
82
775.6ms
39µ$
90
1421.6ms
71µ$
95
2134.0ms
107µ$
25
325.9ms
16µ$
Antonym Identification
0016_antonym T1
A benchmark to evaluate a model's ability to identify the...
100
646.8ms
723µ$
100
1251.0ms
80µ$
100
1216.5ms
16µ$
100
740.1ms
37µ$
100
935.1ms
47µ$
90
390.8ms
20µ$
100
1717.2ms
86µ$
100
731.3ms
37µ$
37
780.9ms
39µ$
100
667.2ms
33µ$
100
702.1ms
35µ$
62
263.1ms
13µ$
100
663.1ms
33µ$
100
633.1ms
32µ$
7
505.5ms
25µ$
100
1259.8ms
63µ$
92
216.2ms
11µ$
97
440.2ms
22µ$
100
2915.6ms
146µ$
100
530.0ms
27µ$
100
1048.6ms
52µ$
100
1542.8ms
77µ$
30
231.8ms
12µ$
Multilingual Synonym Generation
0017_synonyms T1
A benchmark to evaluate a model's ability to generate noun synonyms ...
98
689.8ms
739µ$
100
1287.4ms
83µ$
96
865.5ms
17µ$
76
839.6ms
42µ$
96
1065.6ms
53µ$
30
405.5ms
20µ$
98
1848.6ms
92µ$
90
1106.2ms
55µ$
17
975.7ms
49µ$
78
875.4ms
44µ$
80
887.8ms
44µ$
21
262.1ms
13µ$
69
752.2ms
38µ$
76
840.1ms
42µ$
1
611.6ms
31µ$
94
1816.7ms
91µ$
88
264.2ms
13µ$
96
537.9ms
27µ$
100
3160.9ms
158µ$
88
555.7ms
28µ$
92
1164.3ms
58µ$
94
1691.9ms
85µ$
15
304.2ms
15µ$
Pinyin Letter Count
0018_pinyin_letters T1
A benchmark to evaluate a model's ability to count how many times a s...
35
819.4ms
879µ$
35
1116.7ms
119µ$
15
1048.0ms
24µ$
30
812.1ms
41µ$
30
902.9ms
45µ$
60
351.2ms
18µ$
50
1847.8ms
92µ$
25
787.5ms
39µ$
15
735.8ms
37µ$
25
600.8ms
30µ$
15
587.9ms
29µ$
10
212.5ms
11µ$
15
624.6ms
31µ$
20
531.7ms
27µ$
0
827.1ms
41µ$
35
1229.7ms
62µ$
5
237.4ms
12µ$
5
443.6ms
22µ$
20
3472.7ms
174µ$
5
601.7ms
30µ$
0
1524.8ms
76µ$
25
2335.8ms
117µ$
35
236.0ms
12µ$
Simple Arithmetic
0021_simple_arithmetic T1
A benchmark to evaluate a model's ability to perform basic arithmeti...
100
682.2ms
703µ$
100
1101.0ms
73µ$
100
936.5ms
15µ$
97
707.1ms
35µ$
100
678.7ms
34µ$
97
309.9ms
16µ$
100
1564.2ms
78µ$
100
642.0ms
32µ$
57
528.7ms
26µ$
100
552.4ms
28µ$
90
603.8ms
30µ$
100
188.5ms
9µ$
100
640.6ms
32µ$
100
467.8ms
23µ$
10
354.2ms
18µ$
100
1028.2ms
51µ$
100
201.3ms
10µ$
100
413.5ms
21µ$
100
3133.3ms
157µ$
100
449.9ms
23µ$
100
833.4ms
42µ$
97
1181.1ms
59µ$
95
191.7ms
10µ$
Unit Conversion
0022_unit_conversion T1
A benchmark to evaluate a model's ability to accurately convert ...
100
603.5ms
712µ$
100
1179.1ms
80µ$
72
1141.5ms
16µ$
25
772.3ms
39µ$
97
1128.2ms
56µ$
12
349.1ms
17µ$
95
1948.9ms
97µ$
87
1006.1ms
50µ$
17
900.4ms
45µ$
80
646.0ms
32µ$
85
647.1ms
32µ$
22
229.6ms
12µ$
85
864.5ms
43µ$
97
641.5ms
32µ$
5
458.0ms
23µ$
97
1274.2ms
64µ$
67
322.5ms
16µ$
85
660.5ms
33µ$
95
3214.4ms
161µ$
75
614.5ms
31µ$
90
1274.3ms
64µ$
92
1893.0ms
95µ$
50
253.3ms
13µ$
Math Word Problems
0023_word_problems T1
A benchmark to evaluate a model's ability to read math word problems...
100
601.5ms
728µ$
97
1118.8ms
80µ$
95
1056.9ms
16µ$
75
676.9ms
34µ$
100
897.4ms
45µ$
67
340.9ms
17µ$
100
1497.3ms
75µ$
90
786.5ms
39µ$
67
650.5ms
33µ$
100
610.9ms
31µ$
70
606.1ms
30µ$
75
190.4ms
10µ$
100
650.8ms
33µ$
100
474.7ms
24µ$
20
416.9ms
21µ$
100
1094.2ms
55µ$
92
222.6ms
11µ$
92
443.4ms
22µ$
100
3032.8ms
152µ$
100
505.0ms
25µ$
95
981.1ms
49µ$
72
1680.0ms
84µ$
62
213.8ms
11µ$
Fractions and Percentages
0024_percentage_math T1
A benchmark to evaluate a model's ability to calculate percentages a...
100
740.1ms
708µ$
100
1172.5ms
74µ$
100
918.6ms
15µ$
90
673.3ms
34µ$
97
736.6ms
37µ$
52
293.4ms
15µ$
100
1448.0ms
72µ$
100
741.5ms
37µ$
47
643.8ms
32µ$
90
551.6ms
28µ$
100
612.5ms
31µ$
50
192.7ms
10µ$
95
617.3ms
31µ$
97
484.1ms
24µ$
5
396.2ms
20µ$
100
1019.6ms
51µ$
95
208.8ms
10µ$
100
424.9ms
21µ$
100
2990.4ms
150µ$
100
492.3ms
25µ$
100
919.8ms
46µ$
100
1426.6ms
71µ$
67
216.6ms
11µ$
Algebra
0025_algebra T1
A benchmark to evaluate a model's ability to solve linear and quadra...
100
714.4ms
721µ$
97
1072.6ms
79µ$
80
1036.0ms
16µ$
42
735.3ms
37µ$
72
1003.0ms
50µ$
30
332.6ms
17µ$
85
1531.8ms
77µ$
45
793.1ms
40µ$
10
717.7ms
37µ$
47
700.1ms
35µ$
55
746.2ms
37µ$
42
227.4ms
11µ$
47
624.3ms
31µ$
62
613.6ms
31µ$
2
433.4ms
22µ$
90
1115.1ms
56µ$
60
223.2ms
11µ$
65
425.2ms
21µ$
100
3017.0ms
151µ$
50
533.4ms
27µ$
80
975.8ms
49µ$
52
1511.2ms
76µ$
35
238.7ms
12µ$
Time Arithmetic
0026_time_arithmetic T1
A benchmark to evaluate a model's ability to add and subtract ...
100
689.0ms
730µ$
100
1179.0ms
82µ$
82
1085.4ms
16µ$
35
882.1ms
44µ$
70
1012.1ms
51µ$
7
442.7ms
22µ$
70
1777.7ms
89µ$
60
933.4ms
47µ$
17
869.1ms
43µ$
65
611.2ms
31µ$
55
711.8ms
36µ$
2
315.4ms
16µ$
55
699.7ms
35µ$
65
540.1ms
27µ$
0
454.2ms
23µ$
90
1277.0ms
64µ$
27
263.3ms
13µ$
65
495.2ms
25µ$
75
3380.5ms
169µ$
40
550.1ms
28µ$
65
1181.0ms
59µ$
85
1536.0ms
77µ$
5
258.1ms
13µ$
Geometry
0027_geometry T1
A benchmark to evaluate a model's ability to calculate area, perimet...
100
707.0ms
714µ$
100
1144.2ms
76µ$
97
965.0ms
15µ$
67
684.8ms
34µ$
95
837.9ms
42µ$
30
335.1ms
17µ$
95
1529.5ms
77µ$
72
789.0ms
39µ$
20
643.2ms
32µ$
82
599.6ms
30µ$
60
651.1ms
33µ$
40
201.6ms
10µ$
82
670.8ms
34µ$
95
521.3ms
26µ$
5
387.3ms
19µ$
100
1093.7ms
55µ$
80
223.0ms
11µ$
97
453.9ms
23µ$
100
3113.6ms
156µ$
60
515.0ms
26µ$
87
1014.8ms
51µ$
100
1478.3ms
74µ$
45
228.0ms
11µ$
Definitions
0031_definitions T2
A benchmark to evaluate a model's ability to identify the...
100
558.4ms
102µ$
100
1293.3ms
63µ$
100
994.1ms
13µ$
97
318.3ms
16µ$
100
585.1ms
29µ$
2
654.2ms
33µ$
-
100
492.0ms
25µ$
10
2288.6ms
114µ$
97
395.9ms
20µ$
97
423.6ms
21µ$
60
89.7ms
5µ$
100
397.4ms
20µ$
100
2114.6ms
106µ$
0
4680.6ms
234µ$
100
850.6ms
43µ$
- - -
100
225.1ms
11µ$
100
19770.0ms
989µ$
-
65
119.0ms
6µ$
Part of Speech
0032_part_of_speech T2
A benchmark to evaluate a model's ability to identify the...
97
655.9ms
798µ$
97
1213.9ms
99µ$
97
947.7ms
20µ$
92
903.0ms
45µ$
97
1100.8ms
55µ$
70
417.5ms
21µ$
100
1892.0ms
95µ$
95
979.7ms
49µ$
90
848.5ms
42µ$
97
761.0ms
38µ$
95
807.1ms
40µ$
70
272.4ms
14µ$
95
796.9ms
40µ$
95
727.5ms
36µ$
10
568.4ms
28µ$
100
1418.1ms
71µ$
- - -
97
568.4ms
28µ$
100
1372.7ms
69µ$
100
1982.8ms
99µ$
77
268.8ms
13µ$
English Plural Generation
0033_plural T2
A benchmark to evaluate a model's ability to produce the correct plu...
100
620.4ms
830µ$
100
1598.1ms
109µ$
100
1046.6ms
22µ$
92
781.9ms
39µ$
100
956.5ms
48µ$
87
329.1ms
16µ$
100
1518.5ms
76µ$
97
850.7ms
43µ$
85
689.9ms
35µ$
100
703.8ms
35µ$
97
727.0ms
36µ$
72
255.0ms
13µ$
92
648.2ms
32µ$
95
659.7ms
33µ$
25
424.5ms
21µ$
100
1250.5ms
63µ$
- - -
90
574.0ms
29µ$
95
1369.0ms
68µ$
100
2183.0ms
109µ$
92
234.3ms
12µ$
Word to IPA
0061_word_to_ipa T3
A benchmark to evaluate a model's ability to convert words from mult...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Sentence Decomposition
0062_sentence_decomposition T3
A benchmark to evaluate a model's ability to produce multilingual ...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Translation en_fr
0101_translation_en_fr T1
A benchmark to evaluate a model's ability to translate ...
97
638.1ms
750µ$
95
1141.8ms
118µ$
85
1065.3ms
24µ$
72
1747.2ms
87µ$
87
1680.5ms
86µ$
52
387.8ms
19µ$
87
1964.0ms
98µ$
87
1123.4ms
56µ$
27
2466.6ms
123µ$
90
880.3ms
44µ$
85
837.9ms
42µ$
47
253.6ms
13µ$
87
760.5ms
38µ$
82
730.7ms
37µ$
15
915.9ms
46µ$
90
1571.2ms
79µ$
87
229.8ms
12µ$
90
505.2ms
25µ$
92
3947.1ms
197µ$
95
535.4ms
27µ$
95
1087.7ms
54µ$
95
1740.9ms
87µ$
32
307.7ms
15µ$
Translation en_es
0102_translation_en_es T1
A benchmark to evaluate a model's ability to translate ...
97
591.8ms
750µ$
95
1151.5ms
118µ$
97
983.9ms
24µ$
90
772.9ms
39µ$
97
1142.9ms
57µ$
55
372.2ms
19µ$
95
1898.5ms
95µ$
92
1118.0ms
56µ$
17
979.2ms
49µ$
95
877.1ms
44µ$
97
825.4ms
41µ$
40
260.3ms
13µ$
92
751.3ms
38µ$
75
715.2ms
36µ$
10
738.2ms
37µ$
92
1588.0ms
79µ$
85
220.6ms
11µ$
95
506.1ms
25µ$
95
3892.2ms
195µ$
90
516.1ms
26µ$
97
1116.0ms
56µ$
95
1747.1ms
87µ$
32
302.1ms
15µ$
Translation en_de
0103_translation_en_de T1
A benchmark to evaluate a model's ability to translate ...
100
727.5ms
759µ$
95
1067.5ms
119µ$
92
966.7ms
24µ$
85
792.1ms
40µ$
90
1135.8ms
57µ$
47
380.4ms
19µ$
92
1924.6ms
96µ$
92
1144.9ms
57µ$
17
1009.8ms
51µ$
85
924.0ms
46µ$
92
867.7ms
43µ$
45
266.1ms
13µ$
87
808.4ms
40µ$
92
766.2ms
38µ$
7
1060.1ms
53µ$
87
1638.5ms
82µ$
92
237.1ms
12µ$
95
526.8ms
26µ$
95
4042.6ms
202µ$
95
547.6ms
27µ$
95
1157.2ms
58µ$
92
1826.0ms
91µ$
25
306.2ms
15µ$
Translation fr_es
0104_translation_fr_es T1
A benchmark to evaluate a model's ability to translate ...
97
653.4ms
750µ$
97
1061.2ms
118µ$
85
994.7ms
24µ$
75
764.8ms
38µ$
82
1089.3ms
54µ$
45
370.5ms
19µ$
80
1867.8ms
93µ$
85
1109.8ms
56µ$
12
953.8ms
49µ$
82
877.3ms
44µ$
80
807.8ms
40µ$
30
253.3ms
13µ$
80
726.3ms
36µ$
67
739.1ms
37µ$
0
841.7ms
42µ$
90
1512.5ms
76µ$
82
224.7ms
11µ$
80
493.1ms
25µ$
87
3800.8ms
190µ$
82
521.1ms
26µ$
90
1192.3ms
60µ$
80
1898.6ms
95µ$
35
300.0ms
15µ$
Translation en_zh
0105_translation_en_zh T1
A benchmark to evaluate a model's ability to translate ...
100
696.6ms
761µ$
97
1008.6ms
119µ$
90
945.1ms
24µ$
92
785.1ms
39µ$
92
1083.9ms
54µ$
72
377.1ms
19µ$
97
1904.4ms
95µ$
97
1108.4ms
55µ$
15
1098.6ms
55µ$
80
881.4ms
44µ$
47
944.7ms
47µ$
47
266.4ms
13µ$
92
864.5ms
43µ$
87
884.1ms
44µ$
5
948.9ms
47µ$
97
1838.7ms
92µ$
95
225.5ms
11µ$
97
512.0ms
26µ$
100
4150.1ms
208µ$
100
520.9ms
26µ$
100
1058.3ms
53µ$
97
1801.5ms
90µ$
42
364.5ms
18µ$
Translation en_ja
0106_translation_en_ja T1
A benchmark to evaluate a model's ability to translate ...
100
643.1ms
760µ$
97
1053.0ms
120µ$
85
1176.0ms
24µ$
85
810.3ms
41µ$
70
1137.3ms
57µ$
60
386.9ms
19µ$
90
1942.6ms
97µ$
87
1111.8ms
56µ$
22
1060.1ms
53µ$
67
878.9ms
44µ$
75
880.4ms
44µ$
30
255.7ms
13µ$
82
793.4ms
40µ$
75
913.2ms
46µ$
10
956.1ms
48µ$
92
1841.2ms
92µ$
82
241.7ms
12µ$
95
546.8ms
27µ$
100
4068.3ms
203µ$
85
472.6ms
24µ$
95
1081.0ms
54µ$
95
1805.0ms
90µ$
35
343.4ms
17µ$
Translation fr_ko
0107_translation_fr_ko T1
A benchmark to evaluate a model's ability to translate ...
100
640.6ms
765µ$
100
1145.7ms
121µ$
90
971.7ms
24µ$
80
849.1ms
42µ$
85
1228.5ms
61µ$
47
404.3ms
20µ$
80
2034.4ms
102µ$
82
1064.5ms
53µ$
15
1212.0ms
61µ$
75
909.7ms
46µ$
72
844.4ms
42µ$
17
300.8ms
15µ$
70
746.8ms
37µ$
52
926.9ms
46µ$
7
1104.2ms
55µ$
95
1913.3ms
96µ$
77
236.9ms
12µ$
92
533.7ms
27µ$
97
4012.7ms
201µ$
85
582.1ms
29µ$
95
1139.7ms
57µ$
92
1775.7ms
89µ$
15
411.0ms
21µ$
Translation it_lt
0108_translation_it_lt T1
A benchmark to evaluate a model's ability to translate ...
95
630.3ms
760µ$
92
1088.0ms
122µ$
85
979.0ms
25µ$
65
865.8ms
43µ$
85
1280.7ms
64µ$
15
397.4ms
20µ$
87
2105.2ms
105µ$
55
1157.8ms
58µ$
12
1042.0ms
52µ$
65
963.4ms
48µ$
67
972.7ms
49µ$
10
288.1ms
14µ$
67
815.8ms
41µ$
30
934.1ms
47µ$
0
1081.5ms
54µ$
90
1909.9ms
96µ$
60
266.2ms
13µ$
77
585.1ms
29µ$
92
4101.7ms
205µ$
55
487.4ms
24µ$
95
1278.8ms
64µ$
87
1915.7ms
96µ$
17
329.6ms
17µ$
Translation ja_lt
0109_translation_ja_lt T1
A benchmark to evaluate a model's ability to translate ...
97
573.6ms
760µ$
97
1099.9ms
122µ$
97
1002.5ms
24µ$
65
860.4ms
43µ$
92
1265.5ms
63µ$
22
393.3ms
20µ$
100
2040.9ms
102µ$
55
1152.4ms
58µ$
15
1017.3ms
51µ$
75
969.4ms
48µ$
75
951.0ms
48µ$
20
315.9ms
16µ$
67
810.5ms
41µ$
27
912.7ms
46µ$
7
1029.2ms
51µ$
90
1916.9ms
96µ$
75
256.1ms
13µ$
90
580.8ms
29µ$
97
4266.3ms
213µ$
82
488.2ms
24µ$
95
1137.2ms
57µ$
100
1802.0ms
90µ$
15
318.9ms
16µ$
Verb Forms
0121_verb_forms T3
A benchmark to evaluate a model's ability to generate full verb-form...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Lemma Identification
0122_lemma T3
A benchmark to evaluate a model's ability to identify the lemma (base ...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Validate Lemma Form (lokys)
0130_validate_lemma_form T3
A regression benchmark for the lokys agent's validate_lemma_form() f...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Validate Definition (lokys)
0131_validate_definition T3
A regression benchmark for the lokys agent's validate_definition() f...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Validate Translation (voras)
0132_validate_translation T3
A regression benchmark for the voras agent's validate_all_translatio...
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Geography Knowledge
0151_geography T2
A benchmark to evaluate a model's knowledge of world geography throu...
100
624.8ms
763µ$
100
1346.4ms
90µ$
100
1036.4ms
18µ$
77
871.5ms
44µ$
95
1095.6ms
55µ$
82
389.6ms
20µ$
100
1772.5ms
89µ$
100
900.0ms
45µ$
12
1094.0ms
55µ$
50
767.0ms
39µ$
70
986.8ms
49µ$
92
257.1ms
13µ$
100
748.3ms
37µ$
95
581.9ms
29µ$
20
525.3ms
26µ$
87
1768.9ms
88µ$
- - -
95
534.8ms
27µ$
100
1046.7ms
52µ$
92
1782.9ms
89µ$
52
278.5ms
14µ$
Syllogism Validity
0152_syllogism_validity T2
A benchmark to evaluate whether a model can determine if short ...
100
2797.9ms
1923µ$
100
2444.8ms
265µ$
100
1615.1ms
54µ$
18
3381.4ms
169µ$
87
6221.0ms
311µ$
0
988.0ms
49µ$
93
10207.6ms
510µ$
50
8709.2ms
435µ$
6
2670.4ms
134µ$
62
4609.0ms
230µ$
68
6374.0ms
319µ$
31
1061.8ms
53µ$
75
4001.5ms
200µ$
62
7461.4ms
373µ$
62
3129.2ms
167µ$
100
20821.9ms
1041µ$
- - -
68
7846.6ms
392µ$
68
32521.4ms
1626µ$
-
12
1086.6ms
54µ$
Book Author Match
0153_book_author_match T2
A benchmark to evaluate matching famous books to their correct autho...
100
1736.1ms
1348µ$
100
2286.9ms
220µ$
94
1507.6ms
44µ$
11
2007.1ms
100µ$
100
2901.4ms
145µ$
11
761.2ms
38µ$
100
9295.4ms
465µ$
94
6399.6ms
320µ$
33
2815.4ms
141µ$
88
3868.4ms
193µ$
88
4175.8ms
209µ$
44
1355.3ms
68µ$
94
3078.9ms
154µ$
94
4314.6ms
216µ$
83
2542.8ms
127µ$
94
15229.5ms
761µ$
- - -
72
2817.1ms
141µ$
61
7502.1ms
375µ$
-
22
731.3ms
37µ$
Food Category Classification
0154_food_category_classification T2
A benchmark to evaluate classification of food items by category....
100
1820.7ms
1243µ$
100
2229.8ms
173µ$
100
1382.6ms
34µ$
25
1487.3ms
74µ$
100
2329.5ms
116µ$
20
791.7ms
40µ$
100
5791.7ms
290µ$
95
4799.6ms
240µ$
20
1759.5ms
88µ$
95
2677.8ms
134µ$
75
2690.9ms
135µ$
75
602.3ms
30µ$
100
2749.8ms
138µ$
75
3687.3ms
184µ$
20
2268.7ms
113µ$
100
12646.5ms
632µ$
- -
100
6874.1ms
344µ$
95
2709.6ms
136µ$
55
7976.4ms
399µ$
-
20
647.8ms
32µ$
Historical Event Year
0155_historical_event_year T2
A benchmark to evaluate selecting the correct year for major histori...
100
1598.7ms
1247µ$
100
2041.4ms
182µ$
88
1713.8ms
38µ$
100
1827.2ms
91µ$
94
2994.6ms
150µ$
44
825.5ms
41µ$
100
9076.9ms
454µ$
83
5751.8ms
288µ$
11
2946.1ms
147µ$
5
3458.3ms
173µ$
16
4640.4ms
232µ$
22
1125.6ms
56µ$
94
2659.7ms
133µ$
66
3754.7ms
188µ$
27
2985.6ms
149µ$
94
12668.4ms
633µ$
- -
100
9887.1ms
494µ$
61
2422.7ms
121µ$
77
6056.6ms
303µ$
-
22
817.7ms
41µ$
Python Hello World Function
0301_python_hello_world T2
Write a Python 3.12 function that prints Hello world....
100
738.0ms
220µ$
100
1251.0ms
102µ$
100
2838.0ms
20µ$
100
705.0ms
35µ$
100
2291.0ms
115µ$
100
677.0ms
34µ$
100
2824.0ms
141µ$
100
2025.0ms
101µ$
0
1999.0ms
100µ$
100
1592.0ms
80µ$
100
1451.0ms
73µ$
0
463.0ms
23µ$
100
1516.0ms
76µ$
100
1637.0ms
82µ$
N/A
100
3153.0ms
158µ$
- -
100
3346.0ms
167µ$
100
793.0ms
40µ$
100
29482.0ms
1474µ$
-
100
551.0ms
28µ$
Python GCD With Validation
0302_python_gcd T2
Write a Python 3.12 function for GCD with invalid-input exceptions....
0
1298.0ms
918µ$
100
4217.0ms
260µ$
100
3006.0ms
50µ$
100
3051.0ms
153µ$
100
7020.0ms
351µ$
40
4855.0ms
243µ$
100
7905.0ms
395µ$
100
5997.0ms
300µ$
0
5380.0ms
269µ$
100
4998.0ms
250µ$
100
5551.0ms
278µ$
100
2082.0ms
104µ$
100
4843.0ms
242µ$
100
4716.0ms
236µ$
0
4542.0ms
227µ$
100
9782.0ms
489µ$
- -
100
6117.0ms
306µ$
100
3876.0ms
194µ$
- -
100
2372.0ms
119µ$
Python Letter Count in String
0303_python_letter_count T2
Count occurrences of a target letter in a string....
90
842.0ms
535µ$
100
2911.0ms
320µ$
100
1463.0ms
57µ$
80
2393.0ms
120µ$
100
7248.0ms
362µ$
70
4930.0ms
247µ$
100
9981.0ms
499µ$
100
5841.0ms
292µ$
0
2522.0ms
126µ$
100
4749.0ms
237µ$
100
13036.0ms
652µ$
100
1449.0ms
72µ$
100
5333.0ms
267µ$
0
5780.0ms
289µ$
0
8038.0ms
402µ$
90
8626.0ms
431µ$
- -
90
6000.0ms
300µ$
80
2517.0ms
126µ$
- -
100
3031.0ms
152µ$
Python Minimum Coin Change
0304_python_coin_change T2
Compute minimum number of coins to make a target amount....
92
2083.0ms
1422µ$
100
5037.0ms
712µ$
92
4173.0ms
121µ$
0
2442.0ms
122µ$
92
12101.0ms
605µ$
0
7292.0ms
365µ$
100
25978.0ms
1299µ$
75
10511.0ms
526µ$
0
2352.0ms
118µ$
92
7226.0ms
361µ$
92
16252.0ms
813µ$
92
5343.0ms
267µ$
75
7930.0ms
397µ$
100
9870.0ms
494µ$
0
120695.0ms
6035µ$
92
21341.0ms
1067µ$
- -
100
12970.0ms
649µ$
92
9989.0ms
499µ$
- -
33
2315.0ms
116µ$
Python Prime Factorization
0305_python_prime_factorization T2
Return the prime factorization of a positive integer....
0
1821.0ms
850µ$
100
4098.0ms
417µ$
100
2563.0ms
80µ$
0
2760.0ms
138µ$
100
8917.0ms
446µ$
0
4337.0ms
217µ$
100
13446.0ms
672µ$
100
10789.0ms
539µ$
0
6498.0ms
325µ$
60
6940.0ms
347µ$
100
6034.0ms
302µ$
60
2812.0ms
141µ$
100
5473.0ms
274µ$
100
6080.0ms
304µ$
0
21967.0ms
1098µ$
100
12348.0ms
617µ$
- -
100
6774.0ms
339µ$
70
4004.0ms
200µ$
- -
70
2032.0ms
102µ$