Our crisis in the face of the thinking machine paradox
Every advance in Ia is devalued by increasingly rigid criteria, generating an endless cycle of expectations and disappointments
In the history of technological progress, we are witnessing a psychological and cultural phenomenon that we could define as a chase towards a constantly receding horizon. This is the paradox of moving milestones, a perverse dynamic according to which the more sophisticated artificial intelligence becomes with respect to the parameters we ourselves have set, the less inclined we are to consider these parameters as valid proof of intelligence. We measure the capabilities of machines through milestones that promise to signal the passage from 'mere software' to a superior entity, but as soon as these boundaries are crossed, we decree their insufficiency, retroactively deciding that they were never true yardsticks.
The archetypal example of this epistemological instability is the Turing test. Proposed in 1950 as an 'imitation game' to circumvent the impossible definition of thought, the test aimed to assess the mere functional substitutability between human and machine. For decades, it represented the symbolic pinnacle of AI, until, in 2014, a chatbot called Eugene Goostman managed to 'pass' it by fooling a third of the judges. However, that victory turned out to be a Pyrrhic victory, based not on deep understanding, but on a strategy of theatrical diversion and inherent deception, exploiting the interlocutors' indulgent expectations of an alleged foreign teenager. As Gary Marcus observed, winners of such challenges tend to use parlor tricks rather than demonstrate genuine intelligence, forced into confabulation in order to meet the test criteria.
Faced with the hollowness of these symbolic victories, the scientific community has moved the bar towards formal and technical benchmarks, such as solving logical puzzles or writing code. Yet the same pattern of saturation and disillusionment is repeated here. Recent comparative data between generations of language models show performance jumps that, in another historical context, would have screamed miracle. If we look at the raw scores on complex tests such as the GPQA Diamond, we see that they went from 38.8% on GPT-4 to 85.7% on GPT-5, while on ARC-AGI-1 the jump was even more dramatic, from 4.5% to 65.7%. Although GPT-5 demonstrates a decisive focus on action and agentical reasoning, rather than on the mere accumulation of encyclopaedic knowledge, the surprise factor remains curiously attenuated.
The reason for this disenchantment lies in the speed with which frontier models follow one another: we no longer compare new iterations with the historical limits of the technology, but with versions from a few months earlier, making each giant leap akin to simply maintaining the status quo. Moreover, a fundamental definitional gap persists: there is a lack of consensus on what General Artificial Intelligence (AGI) really is. We question whether it should mirror the architecture of human thought, possess a form of consciousness, or act as a universal problem-solver, but intelligence resists collapsing into a single numerical score, as decades of IQ debates demonstrate.
This resistance to quantification is not an anomaly exclusive to silicon, but is rooted in the very nature of human cognitive experience, which has always eluded exhaustive metric capture. Our intelligence is not a monolith reducible to processing speed or data storage capacity, but rather a fluid fabric of intuition, contextual adaptability and ambiguity management, dimensions that structurally escape the rigidity of standardised tests. If psychometrics has struggled for over a century to harness biological ingenuity in the tight meshes of the Intelligence Quotient without triggering fierce academic disputes, it is perhaps naive to expect that a battery of benchmarks can definitively map the boundaries of a synthetic mind without trivialising its complexity or reducing its essence to mere statistics.

