43

Positional preferences, order effects, prompt sensitivity undermine AI judgments

Fully agree, I've found that LLMs aren't good at tasks that require evaluation.

Think about it, if they were good at evaluation, you could remove all humans in the loop and have recursively self improving AGI.

Nice to see an article that makes a more concrete case.

an hour agoshahbaby

Humans aren't good at validation either. We need tools, experiments, labs. Unproven ideas are a dime a dozen. Remember the hoopla about room temperature superconductivity? The real source of validation is external consequences.

20 minutes agovisarga

Human experts set the benchmarks and LLM’s cannot match them in most fields requiring sophisticated judgment.

They are very useful for some things, but sophisticated judgment is not one of them.

11 minutes agoken47

LLMs are good at discovery, since they know a lot, and can retrieve that knowledge from a query that simpler (e.g. regex-based) search engines with the same knowledge couldn't. For example, an LLM that is input a case may discover an obscure law, or notice a pattern in past court cases which establishes precedent. So they can be helpful to a real judge.

Of course, the judge must check that the law or precedent aren't hallucinated, and apply to the case in the way the LLM claims. They should also prompt other LLMs and use their own knowledge in case the cited law/precedent contradicts others.

There's a similar argument for scientists, mathematicians, doctors, investors, and other fields. LLMs are good at discovery but must be checked.

14 minutes agoarmchairhacker

I see "panels of judges" mentioned once, but what is the weakness of this? Other than more resource.

Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no)

an hour agoTrackerFF

One method to get a better estimate is to extract the token log-probabilities of "YES" and "NO" from the final logits of the LLM and take a weighted sum [1] [2]. If the LLM is calibrated for your task, there should be roughly a ~50% chance of sampling YES (1) and ~50% chance of NO (0) — yielding 0.5.

But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example

You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.

[1]: https://arxiv.org/abs/2303.16634 [2]: https://verdict.haizelabs.com/docs/concept/extractor/#token-...

35 minutes agonimitkalra

We went from Impossible to Unreliable. I like the direction as a techie. But not sure as a sociologist or an anthropologist.

26 minutes agosidcool

> We call it 'prompt-engineering'

I prefer to call it “prompt guessing”, it's like some modern variant of alchemy.

42 minutes agotempodox

"Prompt Whispering"?

37 minutes agoBurningFrog

Prompt divining

21 minutes agoth0ma5

I listen to online debates, especially political ones on various platforms, and man. The AI slop that people slap around at each other is beyond horrendous. I would not want an LLM being the final say on something critical. I want the opposite, an LLM should identify things that need follow up review by a qualified person, a person should still confirm the things that "pass" but they can then prioritize what to validate first.

an hour agogiancarlostoro

I don't even trust LLMs enough to spot content that requires validation or nuance.

an hour agobatshit_beaver

Can't wait for the new field of AI psychology

30 minutes agowagwang

[flagged]

an hour agogizajob

Ok, but please don't post unsubstantive comments to Hacker News.

14 minutes agodang

Ok sorry. I’ll go back to slashdot.

9 minutes agogizajob

At least until the LLM judges otherwise.

an hour agotremon

[flagged]

26 minutes agoandrepd

Comments like this break the site guidelines, and not just a little. Can you please review https://news.ycombinator.com/newsguidelines.html and take the intended spirit of this site more to heart? Note these:

"Please don't fulminate."

"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."

"Please don't sneer, including at the rest of the community."

"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."

There's plenty of LLM skepticism on HN and that's fine, but like all comments here, it needs to be thoughtful.

(We detached this comment from https://news.ycombinator.com/item?id=44074957)

20 minutes agodang

I’d argue real judges are unreliable as well.

The real question for me is: are they less reliable than human judges? Probably yes. But I favor a relative measurement to humans than a plain statement like that.

an hour agobaxtr

I think the main difference is an AI judge may provide three different rulings if you just ask it the same thing three times. A human judge is much less likely to be so "flip-floppy".

You can observe this using any of the present-day LLM's - ask it an architectural/design question, provide it with your thoughts, reasoning, constraints, etc... and see what it tells you. Then... click the "Retry" button and see how similar (or dissimilar) the answer is. Sometimes you'll get a complete 180 from the prior response.

34 minutes agoAlupis

Humans flip-flop all the time. This is a major reason why the Meyers-Briggs Type Indicator does such a poor job of assigning the same person the same Meyers-Briggs type on successive tests.

It can be difficult to observe this fact in practice because, unlike for an LLM, you can't just ask a human the exact same question three times in five seconds and get three different answers, because unlike an LLM we have memory. But, as someone who works with human-labeled data, it's something I have to contend with on a daily basis. For the things I'm working on, if you give the same annotator the same thing to label two different times spaced far enough apart for them to forget that they have seen this thing before, the chance of them making the same call both times is only about 75%. If I do that with a prompted LLM anotator, I'm used to seeing more like 85%, and for some models it can be possible to get even better consistency than that with the right conditions and enough time spent fussing with the prompt.

I still prefer the human labels when I can afford them because LLM labeling has plenty of other problems. But being more flip-floppy than humans is not one that I have been able to empirically observe.

21 minutes agobunderbunder

I don't study domestic law enough, but I asked a professor of law:

"With anything gray, does the stronger/bigger party always win?"

He said:

"If you ask my students, nearly all of them would say Yes"

42 minutes agoresource_waste

> The real question for me is: are they less reliable than human judges?

It's not just about ratios though: We must also ask whether the "shape" of their un/reliability is knowable or desirable.

For example, robot may make incredibly fewer mistakes at a chess game than a human.. but with a human we are ridiculously confident they won't misidentify the opponent's finger as a game piece and rip it off.

Contrasting that with human mistakes, we have tools and knowledge and techniques which are so pervasive we can literally forget they exist until someone points them out.

Would we accept a "judge" that is fairer on average, yet sometimes decides to give shoplifters the death penalty and we have no way of understanding why it happens or how to stop it?

25 minutes agoTerr_

I do think you've hit the heart of the question, but I don't think we can answer the second question.

We can measure how unreliable they are, or how susceptible they are to specific changes, just because we can reset them to the same state and run the experiment again. At least for now [1] we do not have that capability with humans, so there's no way to run a matching experiment on humans.

The best we can do it is probably to run the limited experiments we can do on humans -- comparing different judge's cross-referenced reliability to get an overall measure and some weak indicator of the reliability of a specific judge based on intra-judge agreement. But when running this on LLMs they would have to keep the previous cases in their context window to get a fair comparison.

[1] https://qntm.org/mmacevedo

42 minutes agoandrewla

> The real question for me is: are they less reliable than human judges?

I've spent some time poking at this. I can't go into details, but the short answer is, "Sometimes yes, sometimes no, and it depends A LOT on how you define 'reliable'."

My sense is that, the more boring, mechanical and closed-ended the task is, the more likely an LLM is to be more reliable than a human. Because an LLM is an unthinking machine. It doesn't get tired, or hangry, or stressed out about its kid's problems at school. But it's also a doofus with absolutely no common sense whatsoever.

36 minutes agobunderbunder

> Because an LLM is an unthinking machine.

Unthinking can be pretty powerful these days.

12 minutes agovisarga

There are technical quirks that make LLM judges particularly high variance, sensitive to artifacts in the prompt, and positively/negatively-skewed, as opposed to the subjectivity of human judges. These largely arise from their training distribution and post-training, and can be contained with careful calibration.

an hour agonimitkalra

Can we stop with the "AI being unreliable like people" because it is demonstrably false at best and cult like thought termination at the worst.

20 minutes agoth0ma5

[dead]