Is generative artificial intelligence really a bad worker?

by Alessandro Longo

17 November 2025

Aggiungi Il Sole 24 Ore
ai preferiti su Google

4' min read

Translated by AI

Versione italiana

4' min read

Translated by AI

Versione italiana

It is not yet clear how much AI currently helps productivity at work. But one thing is certain: if it were a worker, it would be terrible. Left on its own, in short, AI would wreak havoc. Even in the simplest things.

This is revealed by the new Remote Labor Index study, by Scale AI in collaboration with the Center for AI Safety (Cais). The first to test the best artificial intelligence agents in a systematic way, with office tasks. It turns out that Manus, Grok, Claude, Chatgpt and Gemini, are still far from being able to replace human workers in the online freelance labour market.

Only 3% of tasks completed

The test simulated a series of real assignments taken from platforms such as Upwork. From graphics to video editing; from game development to administrative tasks.

The result is unequivocal: even the most advanced models successfully completed less than 3% of the tasks. They earn a total of $1,810 out of a potential $143,991.

According to Dan Hendrycks, director of Cais, the main limitation is not the ability to generate text or code, but the lack of long-term memory, the inability to learn from experience and the difficulties in handling complex, multi-step processes. These three points correspond to the main gap areas of generative AI and its most significant differences from the human brain for cognitive work tasks. The brain continuously learns, adapts and has a flexibility that allows it (us) to find its way in complex tasks made up of several disparate actions. 'Agents know how to respond, but they don't know how to work,' Hendrycks summarises.

In contrast to the OpenAI benchmark

The results of the Remote Labor Index stand in stark contrast to other benchmarks, such as GDPval, developed by OpenAI, according to which state-of-the-art models such as Gpt-5 now approach human performance in various office jobs, especially in writing and analysis tasks.

The difference, explain the authors of the Cais study, lies in the context: GDPval measures linguistic and cognitive competence in controlled environments, whereas the Remote Labor Index assesses operational ability in realistic scenarios, which require coordination, planning and adaptation.

Increased productivity, but not replacement

The message that emerges is twofold. On the one hand, AI can enhance the productivity of human freelancers - by speeding up draft production, analysis and brainstorming.

On the other, its autonomy is still limited. Online work platforms, where success depends on complex interactions with customers, deadlines and feedback, remain today an eminently human domain.

A study by Mit Sloan (2024) had already found a similar effect: the use of generative AI increases productivity by 14 per cent among employees performing standardised cognitive tasks, but does not improve - and sometimes worsens - performance in jobs requiring autonomy or creativity.

In reality on augmented, worker-side productivity, there is no general clarity yet. The well-known study 'The GenAI Divide: State of AI in Business 2025' by MIT Project Nanda reports that only 5% of companies claim a measurable return in terms of productivity from the use of generative AI.

More recent research, by Wharton-Gbk, conducted among managers and C-levels of large enterprises, tells a different story. 82% of business leaders say they use generative AI tools every week (46% daily), while 72% measure Roi and three out of four report positive results.

A divergence that perhaps depends on the digital maturity of the companies involved: where processes are already digitised and staff trained, AI tends to generate tangible value. But even for MIT, the gap between experimentation and real impact is mainly due to incomplete technology integration, poor staff training and a lack of clear ROI metrics.

In short, it is not clear how much the productivity defect depends on the immaturity of the tool and how much on the inability of companies to use it.

This issue will probably be resolved in the coming months as both technology and adoption evolve. At the same time, some companies are racing ahead. Many companies have announced layoffs or a hiring freeze; especially junior and middle managers seem to be the most affected. Amazon has announced 14 thousand corporate redundancies, the first wave of a plan that will reach 30 thousand cuts. JPMorgan Chase has just declared that the bank will hire more cautiously; ditto Walmart, while Goldman Sachs also plans layoffs and all cite AI as the reason.

'If people become more productive, there is no need to hire more staff,' Airbnb CEO Brian Chesky said in an interview. "I see a lot of companies preemptively holding the line, anticipating and hoping for a smaller workforce."

Perhaps these are companies that believe they can reap productivity gains that other companies miss; or some of them use AI as an excuse to make cuts for more mundane reasons such as crisis or poor past choices, according to MIT economist David Autor to Nbc News.

On this point, too, as you can see, there are no certainties. A constant in these agitated times of the AI revolution.