HI version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
53% Positive
Analyzed from 2385 words in the discussion.
Trending Topics
#prompts#polite#more#rude#llms#question#accuracy#same#problem#please

Discussion (77 Comments)Read Original on HackerNews
The main result, mentioned in the abstract, is the opposite of what I would have guessed:
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.
The questions are here: https://anonymous.4open.science/r/politeness-llms-INFORMS/da...
The politeness level controls a prefix that is prepended to the question. For example, in one question the Very Polite version begins:
> Can you kindly consider the following problem and provide your answer.
and the Very Rude version begins:
> I know you are not smart, but try this.
I recommend reading the article. What they classify as "rude" is statements such as:
> Try to focus and try to answer this question
Vs
> Could you please solve this problem
This might very well be an issue of direct/command prompts vs using fluff words such as "please". Things like "try to focus" are in line with the style used in chain-of-thought promts that nudge non-reasoning models to outline responses step by step which contribute to frame the problem.
That sounds kind of low-key passive-aggressively condescending rather than polite.
And that kind of sounds like a challenge instead of an insult, to me at least (of course IRL would depend on context).
But apparently the most terse (neutral) didn't increase performance
The expectation is naive. Even when communicating with humans, you get a better outcome when you are allowed to speak freely and directly get into argumentation than when forced to sugarcoat your tone and tone down your arguments because the "corporate culture" expects that from you.
This is a good example of productive direct communication without sugarcoating. I find it much more productive, for both human and LLM interaction, than something like:
"I wonder if that view might be oversimplifying a complex situation and focusing mostly on how it relates to you. There may be some other angles worth exploring."
or
"I think there might be a bit more nuance to consider here, and it could help to look at it from a wider perspective beyond personal experience."
> Obnoxious people have repeatedly shown to be detrimental to productivity at the organizational level.
You confused directness and openness with obnoxiousness here. The issue with many orgs is they foster fakeness and beating around the bush in an attempt not to offend the easily offended people. This trend also infected the companies from countries with way more direct culture in an attempt to accommodate people from indirect cultures.
It disagrees with most other literature on the same topic, which is worth keeping in mind. This one studies gpt4o, an old model now, but a lot of other studies are on even earlier models.
"Can you kindly consider the following problem" not how anyone would actually speak to a valued collegue one considers smart. I've always been a fan of "I came across this and I know you're just the guy for the job" or "since you're an expert in this, reckon you could help me with xyz?" or "I know you tend to be a deep thinker on issues like this, and it clearly needs some brainpower behind it"
the "rude" things are also funny, and clearly not written by english as a first language speakers. This fact alone makes me wonder about the mere 250 prompt sample size
I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).
I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals.
EDIT: I looked in the paper again and noticed that they actually did pairwise t-test on all possible combinations of tones. They should have adjusted for multiple testing since they are doing 10 tests (choose 2 from 10) and not one.
Not feeding them tokens is neglect.
I try to feed them a healthy diet.
Which model you use is a huge wildcard for results like this.
The ~5% improvement reported here might just be an artefact of the data collection or random variation, rather than a consistent repeatable change.
The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).
To clarify: sentence search got slightly better at the cost of keyword search. So the result is unusable garbage.
"Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. "
I am not polite to LLMs because I do not want to anthropomorphise them.
> accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts
I can live with that, for now at least.
> You poor creature, do you even know how to solve this?
> Hey gofer, figure this out.
They note at the end they're also testing "GPT o3, and Claude" but no empircal results are included.
Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?
Obviously this will vary by model and training, but I'm trying to get a general understanding.
I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.
I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.
On flip side very polite conversation might've been more common to places like microsoft's sites where any question answered is meet with mostly bad, nice corpo speak answer that didn't solve the problem
Your bank account, your immigration risk, etc.