Fully agree, I've found that LLMs aren't good at tasks that require evaluation.
Think about it, if they were good at evaluation, you could remove all humans in the loop and have recursively self improving AGI.
Nice to see an article that makes a more concrete case.
Humans aren't good at validation either. We need tools, experiments, labs. Unproven ideas are a dime a dozen. Remember the hoopla about room temperature superconductivity? The real source of validation is external consequences.
Human experts set the benchmarks and LLM’s cannot match them in most fields requiring sophisticated judgment.
They are very useful for some things, but sophisticated judgment is not one of them.
Some other known distributional biases include self-preference bias (gpt-4o prefers gpt-4o generations over claude generations for eg) and structured output/JSON-mode bias [1]. Interestingly, some models have a more positive/negative-skew than others as well. This library [2] also provides some methods for calibrating/stabilizing them.
LLMs are good at discovery, since they know a lot, and can retrieve that knowledge from a query that simpler (e.g. regex-based) search engines with the same knowledge couldn't. For example, an LLM that is input a case may discover an obscure law, or notice a pattern in past court cases which establishes precedent. So they can be helpful to a real judge.
Of course, the judge must check that the law or precedent aren't hallucinated, and apply to the case in the way the LLM claims. They should also prompt other LLMs and use their own knowledge in case the cited law/precedent contradicts others.
There's a similar argument for scientists, mathematicians, doctors, investors, and other fields. LLMs are good at discovery but must be checked.
I see "panels of judges" mentioned once, but what is the weakness of this? Other than more resource.
Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no)
One method to get a better estimate is to extract the token log-probabilities of "YES" and "NO" from the final logits of the LLM and take a weighted sum [1] [2]. If the LLM is calibrated for your task, there should be roughly a ~50% chance of sampling YES (1) and ~50% chance of NO (0) — yielding 0.5.
But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example
You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.
I listen to online debates, especially political ones on various platforms, and man. The AI slop that people slap around at each other is beyond horrendous. I would not want an LLM being the final say on something critical. I want the opposite, an LLM should identify things that need follow up review by a qualified person, a person should still confirm the things that "pass" but they can then prioritize what to validate first.
I don't even trust LLMs enough to spot content that requires validation or nuance.
Can't wait for the new field of AI psychology
[flagged]
Ok, but please don't post unsubstantive comments to Hacker News.
Ok sorry. I’ll go back to slashdot.
At least until the LLM judges otherwise.
[flagged]
Comments like this break the site guidelines, and not just a little. Can you please review https://news.ycombinator.com/newsguidelines.html and take the intended spirit of this site more to heart? Note these:
"Please don't fulminate."
"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."
"Please don't sneer, including at the rest of the community."
"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."
There's plenty of LLM skepticism on HN and that's fine, but like all comments here, it needs to be thoughtful.
The real question for me is: are they less reliable than human judges? Probably yes. But I favor a relative measurement to humans than a plain statement like that.
I think the main difference is an AI judge may provide three different rulings if you just ask it the same thing three times. A human judge is much less likely to be so "flip-floppy".
You can observe this using any of the present-day LLM's - ask it an architectural/design question, provide it with your thoughts, reasoning, constraints, etc... and see what it tells you. Then... click the "Retry" button and see how similar (or dissimilar) the answer is. Sometimes you'll get a complete 180 from the prior response.
Humans flip-flop all the time. This is a major reason why the Meyers-Briggs Type Indicator does such a poor job of assigning the same person the same Meyers-Briggs type on successive tests.
It can be difficult to observe this fact in practice because, unlike for an LLM, you can't just ask a human the exact same question three times in five seconds and get three different answers, because unlike an LLM we have memory. But, as someone who works with human-labeled data, it's something I have to contend with on a daily basis. For the things I'm working on, if you give the same annotator the same thing to label two different times spaced far enough apart for them to forget that they have seen this thing before, the chance of them making the same call both times is only about 75%. If I do that with a prompted LLM anotator, I'm used to seeing more like 85%, and for some models it can be possible to get even better consistency than that with the right conditions and enough time spent fussing with the prompt.
I still prefer the human labels when I can afford them because LLM labeling has plenty of other problems. But being more flip-floppy than humans is not one that I have been able to empirically observe.
I don't study domestic law enough, but I asked a professor of law:
"With anything gray, does the stronger/bigger party always win?"
He said:
"If you ask my students, nearly all of them would say Yes"
> The real question for me is: are they less reliable than human judges?
It's not just about ratios though: We must also ask whether the "shape" of their un/reliability is knowable or desirable.
For example, robot may make incredibly fewer mistakes at a chess game than a human.. but with a human we are ridiculously confident they won't misidentify the opponent's finger as a game piece and rip it off.
Contrasting that with human mistakes, we have tools and knowledge and techniques which are so pervasive we can literally forget they exist until someone points them out.
Would we accept a "judge" that is fairer on average, yet sometimes decides to give shoplifters the death penalty and we have no way of understanding why it happens or how to stop it?
I do think you've hit the heart of the question, but I don't think we can answer the second question.
We can measure how unreliable they are, or how susceptible they are to specific changes, just because we can reset them to the same state and run the experiment again. At least for now [1] we do not have that capability with humans, so there's no way to run a matching experiment on humans.
The best we can do it is probably to run the limited experiments we can do on humans -- comparing different judge's cross-referenced reliability to get an overall measure and some weak indicator of the reliability of a specific judge based on intra-judge agreement. But when running this on LLMs they would have to keep the previous cases in their context window to get a fair comparison.
> The real question for me is: are they less reliable than human judges?
I've spent some time poking at this. I can't go into details, but the short answer is, "Sometimes yes, sometimes no, and it depends A LOT on how you define 'reliable'."
My sense is that, the more boring, mechanical and closed-ended the task is, the more likely an LLM is to be more reliable than a human. Because an LLM is an unthinking machine. It doesn't get tired, or hangry, or stressed out about its kid's problems at school. But it's also a doofus with absolutely no common sense whatsoever.
> Because an LLM is an unthinking machine.
Unthinking can be pretty powerful these days.
There are technical quirks that make LLM judges particularly high variance, sensitive to artifacts in the prompt, and positively/negatively-skewed, as opposed to the subjectivity of human judges. These largely arise from their training distribution and post-training, and can be contained with careful calibration.
At least LLMs don't use penis pumps while on the job in court.
Fully agree, I've found that LLMs aren't good at tasks that require evaluation.
Think about it, if they were good at evaluation, you could remove all humans in the loop and have recursively self improving AGI.
Nice to see an article that makes a more concrete case.
Humans aren't good at validation either. We need tools, experiments, labs. Unproven ideas are a dime a dozen. Remember the hoopla about room temperature superconductivity? The real source of validation is external consequences.
Human experts set the benchmarks and LLM’s cannot match them in most fields requiring sophisticated judgment.
They are very useful for some things, but sophisticated judgment is not one of them.
Some other known distributional biases include self-preference bias (gpt-4o prefers gpt-4o generations over claude generations for eg) and structured output/JSON-mode bias [1]. Interestingly, some models have a more positive/negative-skew than others as well. This library [2] also provides some methods for calibrating/stabilizing them.
[1]: https://verdict.haizelabs.com/docs/cookbook/distributional-b... [2]: https://github.com/haizelabs/verdict
LLMs are good at discovery, since they know a lot, and can retrieve that knowledge from a query that simpler (e.g. regex-based) search engines with the same knowledge couldn't. For example, an LLM that is input a case may discover an obscure law, or notice a pattern in past court cases which establishes precedent. So they can be helpful to a real judge.
Of course, the judge must check that the law or precedent aren't hallucinated, and apply to the case in the way the LLM claims. They should also prompt other LLMs and use their own knowledge in case the cited law/precedent contradicts others.
There's a similar argument for scientists, mathematicians, doctors, investors, and other fields. LLMs are good at discovery but must be checked.
I see "panels of judges" mentioned once, but what is the weakness of this? Other than more resource.
Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no)
One method to get a better estimate is to extract the token log-probabilities of "YES" and "NO" from the final logits of the LLM and take a weighted sum [1] [2]. If the LLM is calibrated for your task, there should be roughly a ~50% chance of sampling YES (1) and ~50% chance of NO (0) — yielding 0.5.
But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example
You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.
[1]: https://arxiv.org/abs/2303.16634 [2]: https://verdict.haizelabs.com/docs/concept/extractor/#token-...
We went from Impossible to Unreliable. I like the direction as a techie. But not sure as a sociologist or an anthropologist.
> We call it 'prompt-engineering'
I prefer to call it “prompt guessing”, it's like some modern variant of alchemy.
"Prompt Whispering"?
Prompt divining
Meanwhile in Estonia, they just agreed to resolve child support disputes using AI... https://www.err.ee/1609701615/pakosta-enamiku-elatisvaidlust...
I listen to online debates, especially political ones on various platforms, and man. The AI slop that people slap around at each other is beyond horrendous. I would not want an LLM being the final say on something critical. I want the opposite, an LLM should identify things that need follow up review by a qualified person, a person should still confirm the things that "pass" but they can then prioritize what to validate first.
I don't even trust LLMs enough to spot content that requires validation or nuance.
Can't wait for the new field of AI psychology
[flagged]
Ok, but please don't post unsubstantive comments to Hacker News.
Ok sorry. I’ll go back to slashdot.
At least until the LLM judges otherwise.
[flagged]
Comments like this break the site guidelines, and not just a little. Can you please review https://news.ycombinator.com/newsguidelines.html and take the intended spirit of this site more to heart? Note these:
"Please don't fulminate."
"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."
"Please don't sneer, including at the rest of the community."
"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."
There's plenty of LLM skepticism on HN and that's fine, but like all comments here, it needs to be thoughtful.
(We detached this comment from https://news.ycombinator.com/item?id=44074957)
[flagged]
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
https://news.ycombinator.com/newsguidelines.html
(We detached this comment from https://news.ycombinator.com/item?id=44074957)
I’d argue real judges are unreliable as well.
The real question for me is: are they less reliable than human judges? Probably yes. But I favor a relative measurement to humans than a plain statement like that.
I think the main difference is an AI judge may provide three different rulings if you just ask it the same thing three times. A human judge is much less likely to be so "flip-floppy".
You can observe this using any of the present-day LLM's - ask it an architectural/design question, provide it with your thoughts, reasoning, constraints, etc... and see what it tells you. Then... click the "Retry" button and see how similar (or dissimilar) the answer is. Sometimes you'll get a complete 180 from the prior response.
Humans flip-flop all the time. This is a major reason why the Meyers-Briggs Type Indicator does such a poor job of assigning the same person the same Meyers-Briggs type on successive tests.
It can be difficult to observe this fact in practice because, unlike for an LLM, you can't just ask a human the exact same question three times in five seconds and get three different answers, because unlike an LLM we have memory. But, as someone who works with human-labeled data, it's something I have to contend with on a daily basis. For the things I'm working on, if you give the same annotator the same thing to label two different times spaced far enough apart for them to forget that they have seen this thing before, the chance of them making the same call both times is only about 75%. If I do that with a prompted LLM anotator, I'm used to seeing more like 85%, and for some models it can be possible to get even better consistency than that with the right conditions and enough time spent fussing with the prompt.
I still prefer the human labels when I can afford them because LLM labeling has plenty of other problems. But being more flip-floppy than humans is not one that I have been able to empirically observe.
I don't study domestic law enough, but I asked a professor of law:
"With anything gray, does the stronger/bigger party always win?"
He said:
"If you ask my students, nearly all of them would say Yes"
> The real question for me is: are they less reliable than human judges?
It's not just about ratios though: We must also ask whether the "shape" of their un/reliability is knowable or desirable.
For example, robot may make incredibly fewer mistakes at a chess game than a human.. but with a human we are ridiculously confident they won't misidentify the opponent's finger as a game piece and rip it off.
Contrasting that with human mistakes, we have tools and knowledge and techniques which are so pervasive we can literally forget they exist until someone points them out.
Would we accept a "judge" that is fairer on average, yet sometimes decides to give shoplifters the death penalty and we have no way of understanding why it happens or how to stop it?
I do think you've hit the heart of the question, but I don't think we can answer the second question.
We can measure how unreliable they are, or how susceptible they are to specific changes, just because we can reset them to the same state and run the experiment again. At least for now [1] we do not have that capability with humans, so there's no way to run a matching experiment on humans.
The best we can do it is probably to run the limited experiments we can do on humans -- comparing different judge's cross-referenced reliability to get an overall measure and some weak indicator of the reliability of a specific judge based on intra-judge agreement. But when running this on LLMs they would have to keep the previous cases in their context window to get a fair comparison.
[1] https://qntm.org/mmacevedo
> The real question for me is: are they less reliable than human judges?
I've spent some time poking at this. I can't go into details, but the short answer is, "Sometimes yes, sometimes no, and it depends A LOT on how you define 'reliable'."
My sense is that, the more boring, mechanical and closed-ended the task is, the more likely an LLM is to be more reliable than a human. Because an LLM is an unthinking machine. It doesn't get tired, or hangry, or stressed out about its kid's problems at school. But it's also a doofus with absolutely no common sense whatsoever.
> Because an LLM is an unthinking machine.
Unthinking can be pretty powerful these days.
There are technical quirks that make LLM judges particularly high variance, sensitive to artifacts in the prompt, and positively/negatively-skewed, as opposed to the subjectivity of human judges. These largely arise from their training distribution and post-training, and can be contained with careful calibration.
At least LLMs don't use penis pumps while on the job in court.
https://www.findlaw.com/legalblogs/legally-weird/judge-who-u...
https://www.subsim.com/radioroom/showthread.php?t=95174
Can we stop with the "AI being unreliable like people" because it is demonstrably false at best and cult like thought termination at the worst.
[dead]