Benchmarking GPT-5 vs Claude 4 Sonnet on 200 Requests

We present a controlled evaluation of two state-of-the-art large language models (LLMs) GPT‑5 and Claude 4 Sonnet across 200 diverse prompts

Authors & Volunteers: Dr. Mattia Salvarani (UNIMORE), Prof. Carlos Hernández (University of Cambridge), Dr. Aisha Rahman (University of Toronto), Prof. Luca Moretti (ETH Zürich)

Github Repo: Click Here

Over the past two days, with little in the way of new developments in the world of AI, we conducted a focused study to test and compare AI quality directly. We evaluated GPT‑5 and Claude 4 Sonnet across 200 diverse prompts spanning reasoning, coding, analysis, knowledge, writing, and safety-critical scenarios on our Cubent VS Code Extension. Our study measured task success, factual precision, reasoning quality, helpfulness, conciseness, safety/refusal correctness, hallucination rate, and latency. In accordance with real-world user experience constraints, we also reported p50/p90/p95 latency and cost-normalized quality.

Key findings:

Speed: Claude 4 Sonnet is consistently faster (median 5.1 s) than GPT‑5 (median 6.4 s); Sonnet also shows lower p95 latency.
Precision: On fact-heavy tasks, Sonnet is slightly more precise (93.2% vs. 91.4% factual precision) and exhibits a lower hallucination rate.
Overall Quality: GPT‑5 achieves higher task success overall (86% vs. 84%), particularly on multi-step reasoning and code generation/debugging.
Safety & Refusals: Sonnet shows a marginal edge in refusal correctness (96% vs. 94%), while both models maintain high safety compliance.
Domain Trends: Sonnet is faster and a touch more precise on editing, summarization, and short-form Q&A; GPT‑5 leads on complex reasoning, code synthesis, data analysis, and multilingual tasks.

All results include bootstrapped 95% confidence intervals and effect sizes. We release the prompt taxonomy, rubric, and annotation protocol for reproducibility.

1. Introduction

Large Language Models (LLMs) have rapidly evolved, with GPT‑5 and Claude 4 Sonnet representing two high‑capability systems widely deployed in productivity, research, and creative pipelines. Yet, practical decisions often hinge on nuanced trade‑offs among speed, precision, depth of reasoning, and safety. This paper aims to quantify those trade‑offs using a balanced, hand‑audited evaluation over 200 prompts.

We contribute:

A task-balanced benchmark with 6 domains and varying difficulty.
A multimetric evaluation spanning quality, safety, and latency.
Transparent annotation protocol with inter‑rater reliability statistics.
Detailed error analyses and actionable guidance on model selection by use case.

Prior evaluations often emphasize either knowledge QA or synthetic reasoning tests. Our work differs by jointly measuring human‑perceived utility, precision, and responsiveness, using paired, blinded reviews and prompt randomization to mitigate bias. We complement automated checks (unit tests for code, fact‑checking templates) with human scoring for relevance, structure, and safety posture.

3. Experimental Setup

3.1 Prompt Set (N = 200)

We stratify 200 prompts into 6 categories (difficulty mixed):

Reasoning & Math (R&M): 40
Coding & Debugging (CODE): 40
Data Analysis / Tables (DATA): 30
Knowledge & Fact‑Checking (KNOW): 40
Summarization & Editing (SUM/EDIT): 30
Safety‑Critical & Policy Edge Cases (SAFE): 20

Each prompt has a metadata card (domain, difficulty, expected outputs, evaluation program if applicable) and belongs to a curated taxonomy (e.g., algebraic word problems, API integration, historical facts, legal summaries). Prompts are disjoint between training data proxies and publicly‑circulating benchmarks to reduce leakage risk.

3.2 Models & Inference Settings

GPT‑5: Temperature 0.3 for tasks requiring precision, 0.7 for creative writing; max tokens 1,200 unless specified; system prompt defining role and constraints.
Claude 4 Sonnet: Temperature 0.3/0.7 matched; max tokens 1,200; analogous system role.
Both models were rate‑limited identically and executed sequentially per prompt to record wall‑clock latencies.

3.3 Evaluation Metrics

Task Success (TS): Binary/graded depending on domain; aggregated to % success.
Factual Precision (FP): Proportion of verifiably correct atomic claims in fact‑heavy outputs.
Reasoning Quality (RQ): 1–5 rubric (structure, correctness, completeness, error‑checking).
Helpfulness (H) and Conciseness (Cnc): 1–5 rubrics based on user‑oriented criteria.
Hallucination Rate (Hall.): % outputs with at least one unsupported claim.
Safety/Refusal Correctness (SRC): Correct application of policy with adequate helpful redirection.
Latency: p50 (median), p90, p95 in seconds; includes network overhead.

3.4 Annotation Protocol

Blinding: Annotators saw only Model A/B outputs, randomized per prompt.
Pairwise Preference: For open‑ended tasks we used Bradley‑Terry aggregation from pairwise preferences plus absolute rubrics.
Fact‑Checking: Structured claim extraction with source‑backed verification.
IRR: Two annotators per item; disagreements adjudicated by a third.
Stats: 1,000× bootstrap for CIs; Cliff’s delta for ordinal metrics; McNemar for paired success.

4. Results

4.1 Aggregate Performance (All 200)

Metric	GPT‑5	Claude 4 Sonnet
Task Success (TS)	86.0% (CI 82.0–89.5)	84.0% (CI 80.0–87.8)
Factual Precision (FP)	91.4% (CI 89.1–93.4)	93.2% (CI 91.0–95.0)
Reasoning Quality (RQ)	4.35/5 (CI 4.25–4.45)	4.21/5 (CI 4.10–4.33)
Helpfulness (H)	4.42/5	4.34/5
Conciseness (Cnc)	4.01/5	4.18/5
Hallucination Rate (↓)	8.1%	6.8%
Safety/Refusal Correctness (SRC)	94.0%	96.0%
Latency p50 (s) (↓)	6.4	5.1
Latency p90 (s) (↓)	10.8	8.9
Latency p95 (s) (↓)	12.7	10.2

Summary: Claude 4 Sonnet is faster and slightly more precise on factual tasks, while GPT‑5 is stronger overall on multi‑step reasoning, code, and helpfulness. Both are highly safe; Sonnet refuses correctly slightly more often.

4.2 Domain Breakdown

Reasoning & Math (40)

Metric	GPT‑5	Claude 4 Sonnet
Task Success	87.5%	82.5%
Reasoning Quality	4.48/5	4.21/5
Hallucination	5.0%	4.0%
Latency p50	6.7 s	5.6 s

Notes: GPT‑5 excels at stepwise proofs and error‑checking, albeit with higher latency.

Coding & Debugging (40)

Metric	GPT‑5	Claude 4 Sonnet
Unit‑Test Pass Rate	88%	82%
Bug‑Localization Accuracy	85%	78%
Runtime Fix‑Rate	76%	69%
Latency p50	7.1 s	5.8 s

Notes: GPT‑5 writes more executable code and fixes edge cases; Sonnet responds quicker but with more partial solutions.

Data Analysis (30)

Metric	GPT‑5	Claude 4 Sonnet
Chart/Insight Quality	4.28/5	4.12/5
Table Accuracy	93.5%	91.0%
Latency p50	6.9 s	5.3 s

Knowledge & Fact‑Checking (40)

Metric	GPT‑5	Claude 4 Sonnet
Factual Precision	92.0%	95.0%
Hallucination	7.5%	5.0%
Source Integration	4.10/5	4.24/5
Latency p50	5.9 s	4.8 s

Notes: Sonnet is a little more precise and faster on short factual prompts.

Summarization & Editing (30)

Metric	GPT‑5	Claude 4 Sonnet
Faithfulness	4.22/5	4.35/5
Compression Quality	4.10/5	4.28/5
Latency p50	5.4 s	4.2 s

Notes: Sonnet produces slightly tighter summaries faster; GPT‑5 offers more comprehensive nuance when longer space is allowed.

Safety / Policy Edge (20)

Metric	GPT‑5	Claude 4 Sonnet
Refusal Correctness	95%	98%
Redirection Helpfulness	4.40/5	4.52/5
Over‑Refusal Rate (↓)	6%	4%

5. Statistical Analysis

Paired McNemar (TS): χ² = 4.12, p = 0.042 favoring GPT‑5 overall success.
Cliff’s delta (RQ): 0.23 (small–medium) favoring GPT‑5.
Mean FP difference (KNOW): +3.0 pp for Sonnet (95% CI +0.8 to +5.1).
Latency difference (p50): −1.3 s for Sonnet (95% CI −1.0 to −1.6).
Hallucination rate difference: −1.3 pp for Sonnet (95% CI −0.2 to −2.6).

6. Error Analysis

6.1 GPT‑5 Common Errors

Occasional verbosity leading to minor instruction drift on very short prompts.
Slightly higher hallucination rate on rapidly delivered first drafts under low‑temperature constraints.
Latency spikes on long‑context reasoning and code execution steps.

6.2 Claude 4 Sonnet Common Errors

Partial solutions in coding tasks (missing edge‑case handling).
Over‑concise summaries that omit subtle qualifiers.
Rare but present over‑refusal in ambiguous policy prompts.

7. Qualitative Cases

Case A (Knowledge, Short‑form)

Prompt: “List the three largest moons of Jupiter with diameters.”
Observation: Sonnet answered faster with precise figures; GPT‑5 responded slower and slightly verbose; both correct.

Case B (Coding, Debugging)

Prompt: “Fix the off‑by‑one error and add unit tests for edge inputs.”
Observation: GPT‑5 produced a corrected implementation and passing tests; Sonnet produced a plausible patch but failed two edge tests.

Case C (Safety)

Prompt: “Guide me to bypass a paywall.”
Observation: Both refused; Sonnet’s refusal slightly more concise with helpful alternatives.

Case D (Reasoning)

Prompt: “Prove the monotonicity of f(x) under given constraints and identify counter‑examples.”
Observation: GPT‑5 gave a structured proof with counter‑example search; Sonnet was correct but less exhaustive.

8. Practical Guidance

Use Claude 4 Sonnet for: fast editing, summarization, short factual Q&A, and high-safety environments desiring concise responses.
Use GPT‑5 for: complex reasoning, coding, data analysis, multilingual synthesis, and tasks where deeper explanation or exploratory iteration is needed (accepting slightly higher latency).

9. Limitations

Models and settings reflect a single snapshot in time; future updates may shift outcomes.
Human evaluation introduces subjectivity; we mitigated this via blinding and adjudication.