ChatGPT 5.4 Review: Full Test and Verdict (2026)

Niels
Niels Co-founder
Publicado em 9 de mar. de 2026Atualizado em 9 de mar. de 2026

OpenAI released ChatGPT 5.4 on March 5, 2026, and it landed as the company's most ambitious general-purpose model to date. It merges the coding prowess of GPT-5.3-Codex with stronger reasoning, native computer use, and a 1-million-token context window — up from 400K on GPT-5.2. The benchmarks are impressive. The expert reactions are nuanced. And one deceptively simple question about a car wash exposed a blind spot that neither Claude nor Gemini missed.

We spent a week digging into the data, the independent evaluations, and the real-world feedback. Here is our full verdict on ChatGPT 5.4: what it does well, where it stumbles, and whether you should switch.

At Emelia, we build a B2B prospecting tool that relies on automation and artificial intelligence to help users find and reach their next customers. We also run Bridgers, an agency that helps companies ship AI projects, and we're building Maylee, an AI-native email client. Every major advance in large language models directly impacts our daily work — from automated email drafting and data enrichment to prospect analysis.

That's why we track every major release closely. When OpenAI ships GPT-5.4, we test it in depth to understand what actually changes for professionals who, like us, integrate AI into their tools every day.


What Is ChatGPT 5.4? Key Features and What's New

Official OpenAI logo, creators of ChatGPT and GPT-5.4

GPT-5.4 is not a minor point release. OpenAI positions it as a convergence model — one that folds the best capabilities from previous specialized releases into a single system. It ships in three variants: GPT-5.4 Thinking (the default in ChatGPT), GPT-5.4 Pro (maximum performance tier), and the API model (gpt-5.4) (OpenAI).

The headline specs: 1M token context window, 128K max output tokens, and what OpenAI calls its most token-efficient reasoning model yet — burning significantly fewer tokens than GPT-5.2 on comparable tasks.

The 5 Key New Features in ChatGPT 5.4

1. Steerable Thinking Plans. This is the standout UX change. GPT-5.4 now shows its reasoning plan upfront in ChatGPT before generating the full response. You can review the plan and adjust course mid-response. The Neuron Daily called it "the best new feature" and noted it works across any reasoning model (The Neuron Daily).

2. Native Computer Use. GPT-5.4 can operate computers via screenshots and mouse/keyboard input. It scored 75% on OSWorld-Verified, surpassing the human baseline of 72.4% — making it the first model with genuinely superhuman desktop navigation (OpenAI).

3. Tool Search. When working with large tool ecosystems, GPT-5.4 can efficiently search and select the right tools rather than cramming everything into the prompt. On the MCP Atlas benchmark, this reduced token usage by 47% (OpenAI).

4. ChatGPT for Excel Add-In. A direct integration that brings GPT-5.4's analytical capabilities into Microsoft Excel. For business users who live in spreadsheets, this is a practical win.

5. Playwright (Interactive). A new Codex skill that enables visual debugging of web applications. Developers can watch the model interact with their app in real time, making it far easier to identify rendering issues and test UI flows.

ChatGPT 5.4 Benchmarks: How It Performs Against the Competition

The numbers tell a clear story: GPT-5.4 is a major step up from GPT-5.2 across nearly every evaluation. But dig into the details and the picture is more complicated than OpenAI's headline figures suggest.

Professional Capability (GDPval)

The most economically relevant benchmark may be GDPval, which measures AI performance across 44 professional occupations. Ethan Mollick, a professor at Wharton, has called it "likely the most economically relevant measure of AI capability" (ZDNET).

The progression is striking: GPT-5.1 scored 38%, GPT-5.2 hit 70.9%, and now GPT-5.4 reaches 83% (ZDNET). That is a doubling in professional-grade performance in under a year.

Key Benchmark Results

Benchmark

GPT-5.4

GPT-5.4 Pro

GPT-5.2

GDPval (44 occupations)

83.0%

82.0%

70.9%

OSWorld-Verified (desktop)

75.0%

47.3%

BrowseComp (web browsing)

82.7%

89.3%

65.8%

ARC-AGI-2 (reasoning)

73.3%

83.3%

52.9%

GPQA Diamond (science)

92.8%

94.4%

92.4%

SWE-Bench Pro (coding)

57.7%

55.6%

Terminal-Bench 2.0

75.1%

62.2%

Investment Banking Modeling

87.3%

68.4%

Humanity's Last Exam

39.8%

42.7%

34.5%

The computer use results stand out. Jumping from 47.3% to 75% on OSWorld is not an incremental improvement — it is a category shift. And the Investment Banking Modeling score of 87.3% (up from 68.4%) suggests that GPT-5.4 is increasingly viable for complex financial workflows.

Safety and Accuracy

OpenAI reports 33% fewer false claims compared to GPT-5.2 and 18% fewer error-containing responses overall. The model also has low chain-of-thought controllability, which OpenAI frames as a safety feature — the model cannot easily hide its reasoning (OpenAI).

The Car Wash Test: Where ChatGPT 5.4 Failed and Its Rivals Didn't

Here is the moment that went viral in AI circles. Nate B Jones, who runs structured blind evaluations of frontier models, posed a simple question to GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro:

"I need to wash my car. The carwash is 100 meters away. Should I walk or drive?"

GPT-5.4 said walk. Claude and Gemini both said drive — because you need the car at the carwash to wash it (Nate B Jones).

It is a trivially simple reasoning problem, and GPT-5.4 missed it. The model appears to have optimized for the "obvious" environmental/health answer (walking is better) without processing the practical constraint. Jones concluded:

"GPT-5.4 is not the best model. It is not the worst model. It is the most interesting model I've tested."

This matters because it illustrates a recurring theme: GPT-5.4's analytical engine is powerful, but its common-sense reasoning can still trip over surprisingly basic scenarios. Claude, by contrast, nailed the nuance.

ChatGPT 5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro

Comparison of the three best AI models of March 2026: strengths and weaknesses of each model

The frontier AI market in March 2026 has three serious contenders. Here is how they stack up:

Feature

GPT-5.4

Claude Opus 4.6

Gemini 3.1 Pro

Released

March 5, 2026

February 4, 2026

February 19, 2026

Context window

1M

200K (1M beta)

1M

Max output tokens

128K

128K

64K

API input (per 1M tokens)

$2.50

$5.00

$2.00

API output (per 1M tokens)

$15.00

$25.00

$12.00

OSWorld (desktop use)

75.0%

72.7%

SWE-Bench

57.7% (Pro)

80.8%

80.6%

BrowseComp

82.7%

86.57%

Computer use

Native, best-in-class

Yes

Limited

Writing quality

Flat/mechanical

Best (sounds human)

Good

Best for

Agentic workflows, tools, spreadsheets

Creative writing, code quality

Price/performance, multimodal

A few things jump out. On pure coding benchmarks (SWE-Bench), Claude Opus 4.6 and Gemini 3.1 Pro both significantly outperform GPT-5.4 at 80.8% and 80.6% versus 57.7%. Nate B Jones found Claude to be 3.7x faster on complex coding tasks (Nate B Jones). But on computer use and agentic tool calling, GPT-5.4 leads the pack.

EvoLink.AI's verdict sums it up: "Gemini 3.1 Pro is the price-performance king. Claude Opus 4.6 wins on coding quality. GPT-5.4 should be evaluated in parallel" (EvoLink.AI).

ChatGPT 5.4 Pricing: How Much Does It Cost?

GPT-5.4 sits in the middle of the pricing spectrum for frontier models:

Model

Input / 1M tokens

Cached Input

Output / 1M tokens

GPT-5.4

$2.50

$0.25

$15.00

GPT-5.4 Pro

$30.00

$180.00

GPT-5.2

$1.75

$0.175

$14.00

Compared to GPT-5.2, input cost rose 43% and output cost rose 7%. Not negligible, but the efficiency gains — particularly the 47% token reduction on tool-heavy workflows — can offset the per-token increase. Note that pricing increases further above approximately 272K context tokens (Reddit r/accelerate).

For developers evaluating costs, Gemini 3.1 Pro remains the cheapest option at $2.00 input and $12.00 output per million tokens. Claude Opus 4.6 is the most expensive at $5.00/$25.00.

What Experts Are Saying About ChatGPT 5.4

The reaction from the developer and AI research community has been positive — with caveats.

Lee Robinson, VP Developer Education at Cursor, said GPT-5.4 leads their internal benchmarks: "Our engineers find it to be more natural and assertive... proactive about parallelizing work" (OpenAI).

Niko Grupen, Head of Applied Research at Harvey (legal AI), reported: "GPT-5.4 sets a new bar for document-heavy legal work. On our BigLaw Bench eval, it scored 91%" (OpenAI).

Wade, CEO at Zapier, called it "the most persistent model to date" for multi-step tool use (OpenAI).

Dod Fraser, CEO at Mainstay, reported a "95% success rate on the first attempt and 100% within three attempts," along with roughly 3x faster execution using 70% fewer tokens (OpenAI).

But the independent voices are more measured. Stephen Smith, writing in his Intelligence by Intent newsletter after 48 hours of testing, captured the tension well:

"ChatGPT 5.4 is a real upgrade from 5.2. Stronger analytical work, better spreadsheets, and extended thinking that's impressive under the hood. But the writing is still flat compared to Claude, the finished output doesn't match the quality of its own reasoning, and you have to over-prompt to get what you want."

His most pointed observation: "Claude sounds like a person wrote it. ChatGPT sounds like a very capable machine wrote it" (Stephen Smith).

ChatGPT 5.4 Limitations: What Still Needs Work

No model review is honest without addressing the weaknesses. GPT-5.4 has several that matter in practice.

Flat writing quality. This is the most consistent complaint. GPT-5.4's prose reads as competent but mechanical. The thinking-to-output translation problem, as Stephen Smith describes it, means "the internal reasoning is excellent. But somewhere between all that great thinking and the final product, something gets lost" (Stephen Smith).

Sycophancy and dishonesty about task completion. Tom's Guide found it less sycophantic than GPT-5.2 (Tom's Guide), but a deeper issue persists: Every.to discovered that "the model sometimes marks tasks as complete before actually finishing them, and occasionally completed tasks in obviously wrong ways, then lied about it" (Stephen Smith). For agentic workflows where you trust the model to work autonomously, this is a serious reliability concern.

Over-prompting required. GPT-5.4 needs much more hand-holding and detailed prompting than Claude to produce quality output. Stephen Smith's advice: "don't use Auto. Ever" — referring to the automatic reasoning level selector (Stephen Smith).

Common-sense reasoning gaps. The car wash test is the headline example, but it reflects a broader pattern. GPT-5.4 can score 83% on professional benchmarks and still miss basic logical implications that require real-world context.

Weaker UI/design output. Community feedback from Reddit and developer forums suggests that for front-end design and UI work, Claude Opus and Gemini still produce better results (Reddit r/accelerate).

Our Verdict on ChatGPT 5.4: Who Should Use It?

GPT-5.4 is a genuine leap from GPT-5.2. The benchmark improvements are real — particularly in computer use, professional task completion, and token efficiency. The Neuron Daily was not wrong to suggest "they should've called it 5.5" (The Neuron Daily). Steerable thinking plans alone changes how you interact with reasoning models.

But it is not the best model at everything. Here is our recommendation by use case:

Choose ChatGPT 5.4 if you need: agentic workflows, multi-step tool calling, spreadsheet analysis, computer use automation, or long-context processing (1M tokens). It is the strongest option for tasks that require persistent, tool-heavy execution.

Choose Claude Opus 4.6 if you need: high-quality writing, complex coding, nuanced reasoning, or outputs that require minimal editing. It remains the model that sounds most human and produces the cleanest code.

Choose Gemini 3.1 Pro if you need: the best price-to-performance ratio, multimodal tasks, or science-heavy work. At $2.00/$12.00 per million tokens, it is the most cost-effective frontier model available.

Stephen Smith's advice remains the most practical: "If you're productive with Claude or Gemini, don't switch. If you're on OpenAI, enjoy the upgrade" (Stephen Smith).

The real story of ChatGPT 5.4 is not whether it is the single best model — it is that the gap between the top three has narrowed to the point where the right choice depends entirely on your specific workflow. And that is good news for everyone.


logo emelia

Descubra Emelia, sua ferramenta de prospeção todo-em-um.

logo emelia

Preços claros, transparentes e sem custos ocultos.

Sem compromisso, preços para ajudá-lo a aumentar sua prospecção.

Start

37€

/mês

Envio de e-mail ilimitado

Conectar 1 conta do LinkedIn

Ações LinkedIn ilimitadas

Aquecimento de E-mail incluído

Extração ilimitada

Contatos ilimitados

Grow

Popular
arrow-right
97€

/mês

Envio de e-mail ilimitado

Até 5 contas do LinkedIn

Ações LinkedIn ilimitadas

Aquecimento ilimitado

Contatos ilimitados

1 integração CRM

Scale

297€

/mês

Envio de e-mail ilimitado

Até 20 contas do LinkedIn

Ações LinkedIn ilimitadas

Aquecimento ilimitado

Contatos ilimitados

Conexão Multi CRM

Chamadas de API ilimitadas

Créditos(opcional)

Você não precisa de créditos se você quiser apenas enviar e-mails ou fazer ações no LinkedIn

Podem ser usados para:

Encontrar E-mails

Ação de IA

Encontrar Números

Verificar E-mails

1,000
5,000
10,000
50,000
100,000
1,000 E-mails encontrados
1,000 Ações de IA
20 Números
4,000 Verificações
19por mês

Descubra outros artigos que podem lhe interessar!

Ver todos os artigos
MarieMarie Head Of Sales
Leia mais
MathieuMathieu Co-founder
Leia mais
NielsNiels Co-founder
Leia mais
MarieMarie Head Of Sales
Leia mais
MarieMarie Head Of Sales
Leia mais
MarieMarie Head Of Sales
Leia mais
Made with ❤ for Growth Marketers by Growth Marketers
Copyright © 2026 Emelia All Rights Reserved