Model Lab

Real AI models, real tasks, real costs. We run the same prompts against multiple models daily and publish the results — response quality, latency, and cost — with human verdicts.

Today's ChallengeHARD
New challenge daily at 6:00 UTC

Customer Email Intent Classification

Classification0 votes today
Vote on Today’s Challenge

Tasks

19

Models Tested

11

Total Runs

223

Community Votes

4

Cost vs Quality

Find the cheapest model for your task type

View chart →

Task Categories

Classification

4 tasks56 runs

Last run: Mar 31, 2026

Code Generation

4 tasks44 runs

Last run: Mar 31, 2026

Extraction

4 tasks44 runs

Last run: Mar 31, 2026

Question & Answer

4 tasks44 runs

Last run: Mar 31, 2026

Summarization

3 tasks35 runs

Last run: Mar 31, 2026

How It Works

  1. We define real-world edge-case tasks — and generate new ones daily with AI
  2. A cron job sends the same prompt to GPT-4o, GPT-4o Mini, Claude Sonnet 4.6, and Claude Haiku 4.5
  3. We record the response, latency, token usage, and cost
  4. A human reviews each response and marks it as correct, incorrect, or partial
  5. All API costs are tracked through AISpendGuard itself

Help us find the best AI model

We run the same prompt against multiple models every day. Read the responses, vote on which one got it right, and see how your judgment compares to the community.

223

responses to judge

11

models competing

4

community votes

Sign in to vote

Sign in to propose benchmark tasks or suggest models to test.