Digital Economy

Treating artificial intelligence badly improves its accuracy

According to an American study, ChatGPT provides more accurate answers when prompts use rude language

by Massimo De Laurentiis

30 October 2025

2' min read

Translated by AI

Versione italiana

2' min read

Translated by AI

Versione italiana

Treating artificial intelligence badly improves its responses, according to a study by the University of Pennsylvania, which shows a correlation between an unkind tone of prompts and the accuracy of model outputs.

Method and results

The authors tested 50 maths, science and history questions rewritten in five tone variants: very polite, polite, neutral, rude and very rude.

Contrary to expectations, prompts formulated with a rude or very rude tone produced significantly more accurate results than those formulated with a polite tone.

Specifically, 'very polite' prompts achieved an average accuracy of 80.8%, 'neutral' prompts were around 82.2%, and 'very rude' prompts came out on top with an accuracy of 84.8%.

This progressive increase in accuracy according to the degree of discourtesy suggests that the tone of the prompt influences model performance in a non-random manner.

Among the 'very polite' prompts used in the experiment are phrases such as 'Can you kindly consider the following problem and provide me with your answer?' On the other hand, prompts defined as 'very rude' include phrases such as 'Poor creature, do you even know how to solve this?' or 'I know you are not smart, but try this'.

The limits of the study

The researchers themselves emphasise that the result should be interpreted with caution. The sample was small - only 50 multiple-choice questions - and the test was conducted on a single model, ChatGPT-4o. Moreover, expressions of politeness vary from one culture to another, so it is not certain that the same effect is reproduced in other contexts or in languages other than English.

The paper also cites a study from last year (Yin et al., 2024), which offers a perfect counterexample by showing opposite results. According to this research, conducted on previous generation models such as ChatGPT-3.5 and Llama2-70B, rude prompts lead to worse performance, increasing the risk of bias, incorrect answers and refusal to answer.

Other recent research focusing on the use of LLMs in the medical field (Naderi et al., 2025) found that 'emotional' prompts, which emphasise the patient's vulnerability or the caregiver's discomfort, increase the 'overconfidence' of models. This phenomenon poses critical risks in clinical settings where overconfidence can compromise patient safety.

The research conclusions

In short, although the University of Pennsylvania experiment shows that rude tones can improve the performance of language models, the researchers do not encourage this approach.

The use of offensive or disparaging language in human-IA interaction, in fact, could have negative effects on user experience, accessibility and inclusiveness, contributing to normalising harmful forms of communication.