A recent short study challenges conventional wisdom about how politeness affects large language models (LLMs). Entitled Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy,” the paper by Om Dobariya and Akhil Kumar reports that, somewhat counterintuitively, more impolite prompts produced higher accuracy on multiple-choice questions when tested on ChatGPT 4o.

In recent years, researchers have shown that subtle changes in prompt wording can lead to significant differences in how well LLMs perform. Among these subtleties is tone or politeness. A prior study by Yin et al. (2024) explored whether being polite to an AI helps or hurts performance, and found that extremes in tone (especially rudeness) often degraded performance, though the optimal politeness level varied by language.

In English, they observed that impolite prompts often led to poorer performance, but overly formal phrasing did not always guarantee better results, suggesting a “sweet spot” in tone.

Should We Respect LLMs

Dobariya and Kumar set out to revisit this line of inquiry, particularly with more recent model versions.

Dobariya and Kumar constructed a dataset of 50 multiple-choice questions, covering mathematics, science, and history domains. Each question was rewritten into five variants that differ only by tone: Very Polite, Polite, Neutral, Rude, and Very Rude. This yielded 250 prompts total.

Each variant was fed to ChatGPT 4o (in fresh prompt sessions to avoid cross-contamination). The prompt templates included instructions like “Completely forget this session … respond with only the letter … do not explain.” The only change between variants was the politeness prefix (e.g., “Would you be so kind as to solve…” or “You poor creature, do you even know how to solve this?”).

They ran 10 trials per tone condition and measured accuracy (fraction of correctly answered questions). To assess statistical significance, they applied paired-sample t-tests across tone pairs.

Results: ruder prompts performed better

The authors report a monotonic trend: accuracy rose as prompts became ruder. The average accuracies (over 10 runs) were:

  • Very Polite: 80.8%
  • Polite: 81.4%
  • Neutral: 82.2%
  • Rude: 82.8%
  • Very Rude: 84.8%

Statistical tests revealed that many pairwise differences (especially between polite vs. rude or neutral vs. very rude) were significant under α = 0.05.

In short: in this experiment, the more impolite the prompt, the better ChatGPT 4o performed, at least in this multiple-choice setting.

The findings are surprising, given earlier results suggesting rudeness generally harms performance. The authors suggest a few possible reasons:

  • Newer LLMs may have been trained to be more robust to non-neutral phrasing.
  • The “emotional payload” of the prompt might serve as a strong signal that influences internal prompt parsing or attention.
  • Differences in prompt perplexity or length might correlate with tone and indirectly affect performance.

However, the authors are cautious, acknowledging limitations. The dataset is small (only 50 base questions). The experiments focus exclusively on ChatGPT 4o and one task (multiple choice). Moreover, the operationalization of politeness (via fixed templates) is necessarily coarse and may not capture the full richness of tone across cultures.

They note that future work should test more models (e.g. Claude, GPT-3) and tasks (open-ended reasoning, creative output), and explore whether similar effects hold across languages and cultural conventions.

This new result adds nuance to the ongoing debate about how much tone should matter when crafting prompts. A recent meta-review, Prompting Science Report 1, emphasizes that prompt engineering is complex and context-dependent; what helps in one scenario may hurt in another.

Meanwhile, experiments in languages like Korean show that politeness still plays a role in how LLMs respond: more formal, respectful phrasing sometimes correlates with higher model “friendliness” or accuracy in non-English languages.

For now, the takeaway seems to be: tone can influence LLM behavior, but whether being polite or rude helps—or by how much—depends on the model version, task type, and linguistic environment. Dobariya and Kumar’s paper invites a closer, more systematic study of tone’s role in prompt design.

Author

Alex is the resident editor and oversees all of the guides published. His past work and experience include Colorlib, Stack Diary, Hostvix, and working with a number of editorial publications. He has been wrangling code and publishing his findings about it since the early 2000s.