Code Generation Tasks

4 benchmark tasks with side-by-side model comparisons

Simple Function

EASY
11 runs · Last: Mar 31
o4 Mini: 4249ms · $0.000000o3 Mini: 4551ms · $0.000000o3: 2740ms · $0.000000GPT-4.1 Nano: 1692ms · $0.000000GPT-4.1 Mini: 1563ms · $0.000000GPT-4.1: 1755ms · $0.000000Claude Opus 4.6: 10868ms · $0.000000Claude Sonnet 4.6: 12311ms · $0.014532GPT-4o Mini: 6150ms · $0.000190GPT-4o: 5467ms · $0.003735Claude Haiku 4.5: 4190ms · $0.003664

Edge Case Handling

HARD
11 runs · Last: Mar 31
o4 Mini: 27743ms · $0.000000o3 Mini: 19384ms · $0.000000o3: 24018ms · $0.000000GPT-4.1 Nano: 7903ms · $0.000000GPT-4.1 Mini: 18718ms · $0.000000GPT-4.1: 10111ms · $0.000000Claude Opus 4.6: 38250ms · $0.000000GPT-4o Mini: 39831ms · $0.000393Claude Sonnet 4.6: 27367ms · $0.030957Claude Haiku 4.5: 11671ms · $0.008989GPT-4o: 10219ms · $0.007950

SQL Query Generation

MEDIUM
11 runs · Last: Mar 31
o4 Mini: 10482ms · $0.000000o3 Mini: 3186ms · $0.000000o3: 8353ms · $0.000000GPT-4.1 Nano: 3131ms · $0.000000GPT-4.1 Mini: 7395ms · $0.000000GPT-4.1: 4235ms · $0.000000Claude Opus 4.6: 11787ms · $0.000000Claude Sonnet 4.6: 15313ms · $0.013038GPT-4o Mini: 10387ms · $0.000298GPT-4o: 6381ms · $0.004300Claude Haiku 4.5: 3607ms · $0.002311

Balanced Parentheses Validator with Nesting Depth

MEDIUM
11 runs · Last: Mar 31
o4 Mini: 32413ms · $0.000000o3 Mini: 24937ms · $0.000000o3: 29742ms · $0.000000GPT-4.1 Nano: 2220ms · $0.000000GPT-4.1 Mini: 3703ms · $0.000000GPT-4.1: 3225ms · $0.000000Claude Opus 4.6: 4505ms · $0.000000Claude Sonnet 4.6: 7433ms · $0.008988GPT-4o Mini: 7330ms · $0.000195GPT-4o: 3320ms · $0.004060Claude Haiku 4.5: 2319ms · $0.002471