Classification Tasks

4 benchmark tasks with side-by-side model comparisons

Spam vs Ham Email

EASY
15 runs · Last: Mar 31
o4 Mini: 1296ms · $0.000000o3 Mini: 1147ms · $0.000000o3: 1860ms · $0.000000GPT-4.1 Nano: 1023ms · $0.000000GPT-4.1 Mini: 1895ms · $0.000000GPT-4.1: 680ms · $0.000000Claude Opus 4.6: 2877ms · $0.000000GPT-4o: 2635ms · $0.0000771/2GPT-4o Mini: 711ms · $0.0000051/2Claude Sonnet 4.6: 437ms · $0.0001321/2Claude Haiku 4.5: 263ms · $0.0000441/2

Sentiment with Sarcasm

MEDIUM
15 runs · Last: Mar 31
o4 Mini: 1611ms · $0.000000o3 Mini: 2809ms · $0.000000o3: 1884ms · $0.000000GPT-4.1 Nano: 432ms · $0.000000GPT-4.1 Mini: 416ms · $0.000000GPT-4.1: 598ms · $0.000000Claude Opus 4.6: 2956ms · $0.000000Claude Sonnet 4.6: 1052ms · $0.000605Claude Haiku 4.5: 481ms · $0.000161GPT-4o Mini: 435ms · $0.0000051/2GPT-4o: 422ms · $0.0000871/2

Multi-label Intent Detection

HARD
15 runs · Last: Mar 31
o4 Mini: 2968ms · $0.000000o3 Mini: 2338ms · $0.000000o3: 2295ms · $0.000000GPT-4.1 Nano: 404ms · $0.000000GPT-4.1 Mini: 746ms · $0.000000GPT-4.1: 1073ms · $0.000000Claude Opus 4.6: 2133ms · $0.000000Claude Sonnet 4.6: 1977ms · $0.001173GPT-4o Mini: 568ms · $0.0000121/2GPT-4o: 388ms · $0.0001941/2Claude Haiku 4.5: 227ms · $0.0000731/2

Customer Email Intent Classification

HARD
11 runs · Last: Mar 31
o4 Mini: 3348ms · $0.000000o3 Mini: 8618ms · $0.000000o3: 5935ms · $0.000000GPT-4.1 Nano: 589ms · $0.000000GPT-4.1 Mini: 857ms · $0.000000GPT-4.1: 877ms · $0.000000Claude Opus 4.6: 3221ms · $0.000000Claude Sonnet 4.6: 2871ms · $0.001506GPT-4o Mini: 1688ms · $0.000045GPT-4o: 2122ms · $0.000752Claude Haiku 4.5: 1760ms · $0.000542