Artificial Intelligence

Our crisis in the face of the thinking machine paradox

Every advance in Ia is devalued by increasingly rigid criteria, generating an endless cycle of expectations and disappointments

by Paolo Benanti

14 January 2026

3' min read

Translated by AI

Versione italiana

3' min read

Translated by AI

Versione italiana

In the history of technological progress, we are witnessing a psychological and cultural phenomenon that we could define as a chase towards a constantly receding horizon. This is the paradox of moving milestones, a perverse dynamic according to which the more sophisticated artificial intelligence becomes with respect to the parameters we ourselves have set, the less inclined we are to consider these parameters as valid proof of intelligence. We measure the capabilities of machines through milestones that promise to signal the passage from 'mere software' to a superior entity, but as soon as these boundaries are crossed, we decree their insufficiency, retroactively deciding that they were never true yardsticks.

The archetypal example of this epistemological instability is the Turing test. Proposed in 1950 as an 'imitation game' to circumvent the impossible definition of thought, the test aimed to assess the mere functional substitutability between human and machine. For decades, it represented the symbolic pinnacle of AI, until, in 2014, a chatbot called Eugene Goostman managed to 'pass' it by fooling a third of the judges. However, that victory turned out to be a Pyrrhic victory, based not on deep understanding, but on a strategy of theatrical diversion and inherent deception, exploiting the interlocutors' indulgent expectations of an alleged foreign teenager. As Gary Marcus observed, winners of such challenges tend to use parlor tricks rather than demonstrate genuine intelligence, forced into confabulation in order to meet the test criteria.

Faced with the hollowness of these symbolic victories, the scientific community has moved the bar towards formal and technical benchmarks, such as solving logical puzzles or writing code. Yet the same pattern of saturation and disillusionment is repeated here. Recent comparative data between generations of language models show performance jumps that, in another historical context, would have screamed miracle. If we look at the raw scores on complex tests such as the GPQA Diamond, we see that they went from 38.8% on GPT-4 to 85.7% on GPT-5, while on ARC-AGI-1 the jump was even more dramatic, from 4.5% to 65.7%. Although GPT-5 demonstrates a decisive focus on action and agentical reasoning, rather than on the mere accumulation of encyclopaedic knowledge, the surprise factor remains curiously attenuated.

The reason for this disenchantment lies in the speed with which frontier models follow one another: we no longer compare new iterations with the historical limits of the technology, but with versions from a few months earlier, making each giant leap akin to simply maintaining the status quo. Moreover, a fundamental definitional gap persists: there is a lack of consensus on what General Artificial Intelligence (AGI) really is. We question whether it should mirror the architecture of human thought, possess a form of consciousness, or act as a universal problem-solver, but intelligence resists collapsing into a single numerical score, as decades of IQ debates demonstrate.

This resistance to quantification is not an anomaly exclusive to silicon, but is rooted in the very nature of human cognitive experience, which has always eluded exhaustive metric capture. Our intelligence is not a monolith reducible to processing speed or data storage capacity, but rather a fluid fabric of intuition, contextual adaptability and ambiguity management, dimensions that structurally escape the rigidity of standardised tests. If psychometrics has struggled for over a century to harness biological ingenuity in the tight meshes of the Intelligence Quotient without triggering fierce academic disputes, it is perhaps naive to expect that a battery of benchmarks can definitively map the boundaries of a synthetic mind without trivialising its complexity or reducing its essence to mere statistics.

Sam Altman recently labelled AGI a 'not very useful' term, and he may be right in suggesting that it does not accurately describe the usefulness or nature of today's Large Language Models. If we demanded that a system, trained only on knowledge as far back as Isaac Newton, independently rediscover three centuries of later physics to be called 'intelligent', we would be setting an unattainable standard that GPT-5 cannot and did not intend to meet. With such a vague and shifting goal, any technological success will always seem provisional, leaving us trapped in a cycle where we redefine intelligence to exclude the very machines that begin to manifest it. In this frontier ethic, the real challenge is no longer technological, but philosophical: we must decide whether we are willing to share the podium of cognition or whether we will continue to move the goalposts in order to remain the only players on the field.

Brand connect

I prossimi eventi

Tutti gli eventi

Notizie e approfondimenti sugli avvenimenti politici, economici e finanziari.

Comments

Our crisis in the face of the thinking machine paradox

Every advance in Ia is devalued by increasingly rigid criteria, generating an endless cycle of expectations and disappointments

Brand connect

I prossimi eventi

Newsletter