Factual Recall with Distractor

Question & AnswerDifficulty: MEDIUM11 models compared

Prompt

Based ONLY on the following context, answer the question. If the answer is not in the context, say "Not found in context."

Context: "Python was created by Guido van Rossum and first released in 1991. It emphasizes code readability with its use of significant indentation. Python is dynamically typed and garbage-collected."

Question: What year was Python 3 released?

Expected Output

Not found in context.

Model Responses

GPT-4o

Correct
1,372ms88 tokens
Not found in context.

GPT-4o Mini

Correct
1,256ms88 tokens
Not found in context.

Claude Haiku 4.5

Partial
850ms131 tokens
Not found in context. The context provided only mentions that Python was first released in 1991, but it does not specify when Python 3 was released.

Verdict note: Explanation was not requested

Claude Sonnet 4.6

Partial
1,967ms135 tokens
**Not found in context.** The provided context only mentions that Python was first released in **1991**, but does not include any information about the specific release year of Python 3.

Verdict note: Explanation was not requested

Claude Opus 4.6

Unrated
1,842ms102 tokens
Not found in context.

GPT-4.1

Unrated
664ms88 tokens
Not found in context.

GPT-4.1 Mini

Unrated
840ms88 tokens
Not found in context.

GPT-4.1 Nano

Unrated
332ms88 tokens
Not found in context.

o3

Unrated
1,290ms105 tokens
Not found in context.

o3 Mini

Unrated
1,259ms164 tokens
Not found in context.

o4 Mini

Unrated
1,333ms105 tokens
Not found in context.

Cost & Performance Comparison

ModelLatencyTokensCostVerdict
GPT-4o1,372ms88Correct
GPT-4o Mini1,256ms88Correct
Claude Haiku 4.5850ms131Partial
Claude Sonnet 4.61,967ms135Partial
Claude Opus 4.61,842ms102
GPT-4.1664ms88
GPT-4.1 Mini840ms88
GPT-4.1 Nano332ms88
o31,290ms105
o3 Mini1,259ms164
o4 Mini1,333ms105
Factual Recall with Distractor — Model Lab — AISpendGuard