Digital Economy

What this Gpt 5 is like: here are the first reviews from international experts

8 August 2025

Aggiungi Il Sole 24 Ore
ai preferiti su Google

6' min read

The new Gpt 5 model is the horizon of maturity for artificial intelligence. A qualitative, rather than quantitative evolution, which consolidates the progress made so far and makes it more usable. More effective, all-round. Thus, Gp5 still shifts the centre of gravity of generative AI. This seems to be the first concordant judgement of the international specialised press on the arrival of Gpt 5, which - let us remember - is already available in Italy, also free in Chatgpt.

Who knows whether it really is - as OpenAI chief Sam Altman says - a step towards general artificial intelligence. More likely to the experts it appears as a transition to 'operational intelligence'. The trade press is impressed by the many practical and measurable improvements. Gpt 5 shows more robust reasoning in task chains, ability to perform tasks that previously required more human orchestration, and coding performance that is state of the art, as the American Tom's Guide notes.

Interface

Many are impressed by the cleanliness of the interface in Chatgpt, which now no longer asks you for your model and chooses autonomously whether to reason. An advance that demonstrates an increase in autonomous decision-making capabilities but also in energy and computational efficiency, notes Mit Technology Review. Efficiency also demonstrated by the decision to make Gpt 5 free for all in Chatgpt. Of course, users can still force 'think longer' and other tools with a click; and if Chatgpt starts to reason, they can force an immediate response instead.

Reasoning and hallucinations

Initial reviews on the quality of reasoning applied to real problems are positive. According to testers and technical commentators (Tom's Hardware, Techtarget), GPT 5 shows better 'consistency' in tackling multi-step problems and a greater propensity to complete sequences of operations without 'losing' the thread. Progress that affects the mode of use. Now the model no longer just answers, but governs workflows that integrate search, data manipulation and final output. The account of the first testers at the reference sites seems to confirm that the progress is not causal and extemporaneous; it is precisely the result of OpenAI's tuning interventions aimed at precisely these practical scenarios. Let us bear in mind that OpenAI took two years to move from model 4 to model 5.

Speaking of practical progress, the company also says that hallucinations have decreased by 26 per cent and there is now 44 per cent less chance that an answer will contain a major factual error. This is only the company's voice on this for now, but experts already say that even if this were the case, it would still not be optimal: it means that one in ten answers can still contain hallucinations, notes Mashable, and that's very serious with a usage that is becoming more and more common: asking Chatgpt for medical answers.

OpenAI tested Gpt 5 on its own internal benchmark, Simple QA. This test is a collection of 'fact-finding questions with short answers that measure the accuracy of the model for the answers attempted', according to the system board description. For this evaluation, GPT-5 did not have access to the web and hallucinations are therefore high: 47 per cent (40 per cent with reasoning), compared to 52 per cent in the 4th.

Beth Barnes, founder of the non-profit artificial intelligence research organisation Metr, was quick to spot an inaccuracy in a GPT-5 answer explaining how aircraft work.

Programming

Many then cite progress in coding as one of the most important achievements in Gpt 5, thus closing the gap with Anthropic's Claude Sonnet (now the most popular AI tool for programming). Data shared by OpenAI and relayed by technical journals show that the model scores higher in software-oriented benchmarks (SWE-Bench and similar); it uses fewer tokens and fewer calls to external tools to solve the same problem. The difference here is twofold: not only is the model more accurate in producing useful code, it is more efficient, which reduces usage costs at scale and increases its attractiveness for commercial products aiming to automate part of the development cycle. Extensive testing will certainly be needed to understand the real quality compared to competitors, on a practical level and in terms of integration with third-party systems.

Contextual and multimodal window

Less central to the debate, but not unimportant, are two other issues: the contextual window and multimodality. Technical analyses report that Gpt 5 is designed to handle much larger contexts - the numbers vary depending on the source and configuration, but the direction is clear: working with long documents, multi-part projects, or conversations with extensive memory becomes feasible without having to continuously recapitulate information. This capability has been read by many experts (Tom's Hardware, PanelsAI) as an enabler for professional applications: contract reviews, continuous reporting, financial analyses requiring consistency over hundreds of pages can now be driven with less human intervention. At the same time, technical sources emphasise that the word 'multimodal' should be understood pragmatically: better integration of text, images and structured data is already in place; audio and video are working prospects, but practical robustness depends on use cases and integration pipelines.

Agents

Another recurring thread in the specialised pages concerns agent 'capabilities' and the tools designed to build them. The technical press (Techcrunch, Digital Watch Observatory) has devoted in-depth coverage to the infrastructural innovations that accompany the model: Responses API, Agents SDK and routing systems that allow the model to decide whether to use a 'thinking' mode or a rapid response are all elements that transform Gpt 5 into a tailor-made agent platform rather than a simple endpoint for textual completions. The experts explain that, thanks to these APIs and sdk, developers and companies can orchestrate stacks - web search, calls to internal databases, artefact generation (slides, spreadsheets, code) - with security controls and backups. The distance between prototype and production product is thus reduced.

Critical aspects: testing, pricing, security

Alongside the positive tones, however, the technical press maintains a critical and measured register: influential blogs and analysts call for independent verification and reproducible benchmarks before treating the release as a definitive 'breakthrough'. Platformer, Hacker News and other industry commentators point out that metrics exhibited in briefings or releases may be conditioned by a priori chosen test sets and tuning conditions that are not automatically replicated in all production environments. The open community and technical forums - where impromptu tests and bottom-up comparisons emerge - also note that the perception of usefulness can vary radically depending on the domain: what works well for writing code is not automatically transferable to clinical evaluation tasks or regulated processes. This call for independent measurement is a recurring refrain in the technical press.

The issue of cost and access is another critical element. Several articles (e.g. Platformer, The Verge) point out that OpenAI has chosen a multi-level strategy: 'mini' and 'nano' models for low-cost and latency cases, a 'standard' version for heavy tasks, and direct integration into Chatgpt. Industry publications have noted that this move will broaden the user base. At the same time, engineers recall that the real economic parameter to monitor remains the price per token in the production pipeline: Gpt 5's efficiency in generating responses with fewer tokens and fewer tool-calls may translate into a competitive advantage, but the cost mathematics are strictly dependent on the type of load and usage patterns. Caution, therefore.

But especially on the security and governance front, the trade press shows caution: the model's extended ability to generate complex artefacts and orchestrate actions on external resources requires new audit tools, access limits and operational policies. Technical experts point out that the problem is not just reducing hallucinations, but managing dependencies between model and business systems - how a response occurs, who is responsible for the output, and how the decision chain is traced in the presence of autonomous agents. Technical discussions focus on practical issues: logging, testing in isolated environments, mandatory human approvals on sensitive output, and clear criteria for blocking risky functionality.

This all sounds very familiar to us Europeans, since on 2 August the AI Act obligations for suppliers of general-purpose models (such as Gpt 5) were triggered, with implications also for the companies that use them.