Positional preferences, order effects, prompt sensitivity undermine AI judgments

I've done experiments and basically what I found was that LLM models are extremely sensitive to .....language. Well, duh but let me explain a bit. They will give a different quality/accuracy of answer depending on the system prompt order, language use, length, how detailed the examples are, etc... basically every variable you can think of is responsible for either improving or causing detrimental behavior in the output. And it makes sense once you really grok that LLM;s "reason and think" in tokens. They have no internal world representation. Tokens are the raw layer on which they operate. For example if you ask a bilingual human what their favorite color is, the answer will be that color regardless of what language they used to answer that question. For an LLM, that answer might change depending on the language used, because its all statistical data distribution of tokens in training that conditions the response. Anyway i don't want to make a long post here. The good news out of this is that once you have found the best way in asking questions of your model, you can consistently get accurate responses, the trick is to find the best way to communicate with that particular LLM. That's why i am hard at work on making an auto calibration system that runs through a barrage of ways in finding the best system prompts and other hyperparameters for that specific LLM. The process can be fully automated, just need to set it all up.

I somewhat agree, but I think that the language example is not a good one. As Anthropic have demonstrated[0], LLMs do have "conceptual neurons" that generalise an abstract concept which can later be translated to other languages.

The issue is that those concepts are encoded in intermediate layers during training, absorbing biases present in training data. It may produce a world model good enough to know that "green" and "verde" are different names for the same thing, but not robust enough to discard ordering bias or wording bias. Humans suffer from that too, albeit arguably less.

[0] https://transformer-circuits.pub/2025/attribution-graphs/bio...

I have learned to take these kinds of papers with a grain of salt, though. They often rest on carefully selected examples that make the behavior seem much more consistent and reliable than it is. For example, the famous "king - man + woman = queen" example from Word2Vec is in some ways more misleading than helpful, because while it worked fine for that case it doesn't necessarily work nearly so well for [emperor, man, woman, empress] or [husband, man, woman, wife].

You get a similar thing with convolutional neural networks. Sometimes they automatically learn image features in a way that yields hidden layers that easy and intuitive to interpret. But not every time. A lot of the time you get a seemingly random garble that belies any parsimonious interpretation.

This Anthropic paper is at least kind enough to acknowledge this fact when they poke at the level of representation sharing and find that, according to their metrics, peak feature-sharing among languages is only about 30% for English and French, two languages that are very closely aligned. Also note that this was done using two cherry-picked languages and a training set that was generated by starting with an English language corpus and then translating it using a different language model. It's entirely plausible that the level of feature-sharing would not be nearly so great if they had used human-generated translations. (edit: Or a more realistic training corpus that doesn't entirely consist of matched translations of very short snippets of text.)

Just to throw even more cold water on it, this also doesn't necessarily mean that the models are building a true semantic model and not just finding correlations upon which humans impose semantic interpretations. This general kind of behavior when training models on cross-lingual corpora generated using direct translations was first observed in the 1990s, and the model in question was singular value decomposition.

I’m convinced that language sharing can be encouraged during training by rewarding correct answers to questions that can only be answered based on synthetic data in another language fed in during a previous pretraining phase.

Interleave a few phases like that and you’d force the model to share abstract information across all languages, not just for the synthetic data but all input data.

I wouldn’t be surprised if this improved LLM performance by another “notch” all by itself, especially for non-English users.

your shrewd idea might make a fine layer back up the Tower of Babel

I've read the paper before I made the statement. And I still made the statement because there are issues with their paper. The first problem is that the way in which anthropic trains their models and the architecture of their models is different from most of the open source models people use. they are still transformer based, but they are not structurally put together the same as most models, so you cant extrapolate their findings on their models to other models. Their training methods also use a lot more regularization of the data trying to weed out targeted biases as much as possible. meaning that the models are trained on more synthetic data which tries to normalize the data as much as possible between languages, tone, etc.. Same goes for their system prompt, their system prompt is treated differently versus open source models which append the system prompt in front of the users query internally. The attention ais applied differently among other things. Second the way that their models "internalize" the world is vastly different then what humans would thing of "building a world model" of reality. Its hard to put it in to words but basically their models do have a underlying representative structure but its not anything that would be of use in the domains humans care about, "true reasoning". Grokking the concept if you will. Honestly I highly suggest folks take a lot of what anthropic studies with a grain of salt. I feel that a lot of information they present is purposely misinterpreted by their teams for media or pr/clout or who knows what reasons. But the biggest reason is the one i stated at the beginning, most models are not of the same ilk as Anthropic models. I would suggest folks focus on reading interpretability research on open source models as those are most likely to be used by corporations for their cheap api costs. And those models have no where near the care and sophistication put in to them as anthropic models.

> I feel that a lot of information they present is purposely misinterpreted by their teams for media or pr/clout or who knows what reasons.

I think it's just the culture of machine learning research at this point. Academics are better about it, but still far from squeaky clean. It can't be squeaky clean, because if you aren't willing to make grand overinflated claims to help attract funding, someone else will be, and they'll get the research funding, so they'll be the ones who get to publish research.

It's like an anthropic principle of AI research. (rimshot)

I found an absolutely fascinating analysis on precisely this topic by an AI researcher who's also a writer: https://archive.ph/jgam4

LLMs can generate convincing editorial letters that give a real sense of having deeply read the work. The problem is that they're extremely sensitive, as you've noticed, to prompting as well as order bias. Present it with two nearly identical versions of the same text, and it will usually choose based on order. And social proof type biases to which we'd hope for machines to be immune can actually trigger 40+ point swings on a 100-point scale.

If you don't mind technical details and occasional swagger, his work is really interesting.

This doesn't match Anthropics research on the subject

> Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.

https://www.anthropic.com/research/tracing-thoughts-language...

Yep, LLMs tell you "what you want to hear." I can usually predict the response I'll get based on how I phrase the question.

I feel like LLMs have a bit of the Clever Hans effect. It takes a lot of my cues as to what it thinks I want it to say or opinion it thinks I want it to have.

Clever Hans was a horse who people thought could do maths by tapping his hoof. But actually he was just reading the body language of the person asking the question. Noticing them tense up as he got to the right number of stamps and stopping - still pretty smart for a horse, but the human was still doing the maths!

What's worse is that it can sometimes (but not always) read through your anti-bias prompts.

    "No, I want your honest opinion." "It's awesome."
    "I'm going to invest $250,000 into this. Tell me what you really think." "You should do it."

    (New Session)

    "Someone pitched to me the idea that..." "Reject it."

Once can see that very easily in image generation models.

The "Elephant" it generates is lot different from "Haathi" (Hindi/Urdu). Same goes for other concepts that have 1-to-1 translation but the results are different.

> For example if you ask a bilingual human what their favorite color is, the answer will be that color regardless of what language they used to answer that question.

It's a very interesting question. Has someone measured it? Bonus point for using a conceal way so the subjects don't realize you care about colors.

Anyway, I don't expect something interesting with colors, but it may be interesting with food (I guess, in particular desserts).

Imagine you live in England and one of your parents is form France and you go there every year to meet your grandparents, and your other parent is from Germany and you go there every year to meet your grandparents. What is your favorite dessert? I guess when you are speaking in one language you are more connected to the memories of the holidays there and the grandparents and you may choose differently.

Doesn't this assume one truth or one final answer to all questions? What if there are many?

What if asking one way means you are likely to have your search satisfied by one truth, but asking another way means you are best served by another wisdom?

EDIT: and the structure of language/thought can't know solid truth from more ambiguous wisdom. The same linguistic structures must encode and traverse both. So there will be false positives and false negatives, I suppose? I dunno, I'm shooting from the hip here :)

I thought embeddings were the internal representation? Does reasoning and thinking get expanded back out into tokens and fed back in as the next prompt for reasoning? Or does the model internally churn on chains of embeddings?

I'd direct you to the 3 blue 1 brown presentation on this topic, but in a nutshell the semantic space for an embedding can become much richer than the initial token mapping due to previous context.. but only during the course of predicting the next token.

Once that's done, all rich nuance achieved during the last token-prediction step is lost, and then rebuilt from scratch again on the next token-prediction step (oftentimes taking a new direction due to the new token, and often more powerfully any changes at the tail of the context window such as lost tokens, messages, re-arrangement due to summarizing, etc).

So if you say "red ball" somewhere in the context window, then during each prediction step that will expand into a semantic embedding that neither matches "red" nor "ball", but that richer information will not be "remembered" between steps, but rebuilt from scratch every time.

There's a certain one-to-oneness between tokens and embeddings. A token expands into a large amount of state, and processing happens on that state and nothing else.

The point is that there isn't any additional state or reasoning. You have a bunch of things equivalent to tokens, and the only trained operations deal with sequences of those things. Calling them "tokens" is a reasonable linguistic choice, since the exact representation of a token isn't core to the argument being made.

[dead]

Fully agree, I've found that LLMs aren't good at tasks that require evaluation.

Think about it, if they were good at evaluation, you could remove all humans in the loop and have recursively self improving AGI.

Nice to see an article that makes a more concrete case.

Humans aren't good at validation either. We need tools, experiments, labs. Unproven ideas are a dime a dozen. Remember the hoopla about room temperature superconductivity? The real source of validation is external consequences.

Human experts set the benchmarks and LLM’s cannot match them in most (maybe any?) fields requiring sophisticated judgment.

They are very useful for some things, but sophisticated judgment is not one of them.

I think there's more nuance, and the way I read the article is more "beware of these shortcomings", instead of "aren't good". LLM-based evaluation can be good. Several models have by now been trained on previous-gen models used in filtering data and validating RLHF data (pairwise or even more advanced). LLama3 is a good example of this.

My take from this article is that there are plenty of gotchas along the way, and you need to be very careful in how you structure your data, and how you test your pipelines, and how you make sure your tests are keeping up with new models. But, like it or not, LLM based evaluation is here to stay. So explorations into this space are good, IMO.

> Positional preferences, order effects, prompt sensitivity undermine AI judgments

If you can read between the lines, that says that there's no actual "judgement" going on. If there was a strong logical underpinning to the output, minor differences in input like the phrasing (but not factual content) of a prompt wouldn't make the quality of the output unpredictable.

You could say the same about human "judgement" then.

Humans display biases very similar to that of LLMs. This is not a coincidence. LLMs are trained on human-generated data. They attempt to replicate human reasoning - bias and all.

There are decisions where "strong logical underpinning" is strong enough to completely drown out the bias and the noise. And then there are decisions that aren't nearly as clear-cut - allowing the bias and the noise to creep into the outcomes. This is true for human and LLM "judgement" both.

Yes and no. People also exhibit these biases, but because degree matters, and because we have no other choice, we still trust them most of the time. That's to say; bias isn't always completely invalidating. I wrote a slightly satirical piece "People are just as bad as my LLMs" here: https://wilsoniumite.com/2025/03/10/people-are-just-as-bad-a...

Word plinko

Some other known distributional biases include self-preference bias (gpt-4o prefers gpt-4o generations over claude generations for eg) and structured output/JSON-mode bias [1]. Interestingly, some models have a more positive/negative-skew than others as well. This library [2] also provides some methods for calibrating/stabilizing them.

[1]: https://verdict.haizelabs.com/docs/cookbook/distributional-b... [2]: https://github.com/haizelabs/verdict

It's considered good form on this forum to disclose your affiliation when you advertise for your employer.

LLMs make impressive graders-of-convenience, but their judgments swing wildly with prompt phrasing and option order. Treat them like noisy crowd-raters: randomize inputs, ensemble outputs, and keep a human in the loop whenever single-digit accuracy points matter.

"and keep a human in the loop whenever single-digit accuracy points matter"

So we are supposed to give up on accuracy now? At least with humans (assuming good actors) I can assume an effort for accuracy and correctness. And I can build trust based on some resume and on former interaction.

With LLMs, this is more like a coin-flip with each prompt. And since the models are updated constantly, its hard to build some sort of resume. In the meantime, people - especially casual users - might just trust outputs, because its convenient. A single digit error is harder to find. The costs of validating outputs increases with increased accuracy of LLMs. And casual users tend to skip validation because "its a computer and computers are correct".

I fear an overall decrease in quality wherever LLMs are included. And any productivity gains are eaten by that.

I see "panels of judges" mentioned once, but what is the weakness of this? Other than more resource.

Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no)

One method to get a better estimate is to extract the token log-probabilities of "YES" and "NO" from the final logits of the LLM and take a weighted sum [1] [2]. If the LLM is calibrated for your task, there should be roughly a ~50% chance of sampling YES (1) and ~50% chance of NO (0) — yielding 0.5.

But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example

You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise.

[1]: https://arxiv.org/abs/2303.16634 [2]: https://verdict.haizelabs.com/docs/concept/extractor/#token-...

Good, but really none of this should be surprising, given that LLMs are a giant text statistic that generate text based on that statistic. Quirks of that statistic will show up as quirks of the output.

When you think about it like that, it doesn't really make sense to assume they have some magical correctness properties. In some sense, they don't classify, they immitate what classification looks like.

> In some sense, they don't classify, they immitate what classification looks like.

I thought I've seen it all when people decided to consider AI a marketing term and started going off about how current mainstream AI products aren't """"real AI"""", but this is next level.

I'm not sure I understand your objection (or if it's even an objection), but just to illustrate what I mean - this is literally how the chat interfaces are implemented (or at least initially they were).

You're not talking with the model, you're talking with some entity that the model is asked to simulate. The system is just cleverly using your input and the statistic to output something that looks like chat with an assistant.

Whether that's real AI or not doesn't really matter. I didn't mean to make it sound like this is not real, just to point out where are the current shortcomings coming from.

It is an objection. I'm not sure if you consider the whole subfield of machine learning that is classification non-existing, or just the fact that LLMs can produce classifications, but either way, I do object.

The objection against the former is trivial and self evident, and was more where my sudden upset came from. *

Against the latter, the model trying to make the overall text that is its context window approach a chat exchange, by adjusting its own output within it accordingly, doesn't make a hypothetical request for classification in there not performed. You either classify or you don't. If it's doing it in a "misguided" way, it doesn't make it not performed. If it's doing it under the pretense of roleplay, it still doesn't make it not performed. Same is true if the LLM is actually secretly a human operator, or if the LLM is just spewing random tokens. Either you got a classified output of your input or you didn't. It doesn't "look like" anything. I can understand if maybe you mean that the designated notion of the person in the exchange it's trying to approximate for is going to affect the classifications it provides when requested one in the context window, but since these models are trained to "act agentic", I'm not sure if that's a useful thing to ponder (as there's no other way to get anything out of them).

I object to the whole "AI is just statistics" notion too. In several situations you want it to do something completely different than what the dataset would support just through rote statistics. That's where you get actual value out of them. One could conveniently recategorize that as just "advanced statistics", or "higher level" statistics, but I think that's a very perverse way of defining statistics. There's very clearly more mathematics involved in LLMs than just statistics. Just the other day, there was a post here trying to regard LLMs as "just topology". Clearly neither of these can be true at the same time, which was consequently explored in the thread too.

> You're not talking with the model, you're talking with some entity

I'm not suggesting I'm actually talking with anyone or anything in particular beyond the antropomorphization.

* What I meant by "people saying AI is not real" is that people claim to regard that the current generation of AI products are not real "artificial intelligences", because they seem to think that it's either "SkyNet" and "Detroit: Become Human", or nothing. Unsurprisingly, these folks don't tend to talk much about OCR, image segmentation and labeling, optical flow, etc. And just like the half a century old field of Artificial Intelligence isn't just some new marketing con that just spawned into existence, classification algorithms aren't some novel snakeoil either.

Edit: typing this all out about how I'm aware I'm not actually having conversations with anyone or anything gave me a feeling of realization. This is not good, because intellectually I was always aware of this, meaning I got subconsciously parasocial with these products and services over time. Really concerned about this all of a sudden lol.

This post was about LLMs so I was specifically referring to LLMs.

When I say "they immitate what classification looks like," I don't mean that the classification somehow isn't real, I'm referring to the specific technique of how it's done.

If you ask LLM "Is this sentence offensive: ...?", the task that it's doing is not simply "test whether this sentence is offensive." It's something like "generate what a plausible answer to this question looks like," part of which is answering the question (usually).

This means that if you ask this question in a way that is more often used with an expectation of a certain answer, LLM will use that as a signal to bias the answer, because "that's how the answers to these questions look like." which is the problem highlighted in the article.

LLMs are good at discovery, since they know a lot, and can retrieve that knowledge from a query that simpler (e.g. regex-based) search engines with the same knowledge couldn't. For example, an LLM that is input a case may discover an obscure law, or notice a pattern in past court cases which establishes precedent. So they can be helpful to a real judge.

Of course, the judge must check that the law or precedent aren't hallucinated, and apply to the case in the way the LLM claims. They should also prompt other LLMs and use their own knowledge in case the cited law/precedent contradicts others.

There's a similar argument for scientists, mathematicians, doctors, investors, and other fields. LLMs are good at discovery but must be checked.

I would add that "hallucinations" aren't even the only failure mode a LLM can have, it can partially or completely miss what its supposed to find in the discovery process and lead you to believe that there just isn't anything worth pursuing in that particular venue.

> it can partially or completely miss what its supposed to find in the discovery process and lead you to believe that there just isn't anything worth pursuing in that particular venue.

The problem is that American and UK legal systems never got forced to prune the sometimes centuries-old garbage. And modern Western legal systems tend to have more explicit laws and regulations instead of prior case law, but still, they also accumulate lots of garbage.

IMHO, all laws and regulations should come with a set expiry date. If the law or regulation is not renewed, it gets dropped off the book. And for legal systems that have case law, court rulings should expire no later than five years, to force a transition to a system where the law-passing body is forced to work for its money.

It’s a statistical database of corpuses, not a logic engine.

Stop treating LLMs like they are capable of logic, reasoning or judgement. They are not, they never will be.

The extent to which they can recall and remix human words to replicate the intent behind those words is an incredible facsimile to thought. It’s nothing short of a mathematical masterpiece. But it’s not intelligence.

If it were communicating it’s results in any less human of an interface than conversational, I truly feel that most people would not be so easily fooled into believing it was capable of logic.

This doesn’t mean that a facsimile of logic like this has no use. Of course it does, we have seen many uses - some incredible, some dangerous and some pointless - but it is important to always know that there is no thought happening behind the facade. Only a replication of statistically similar communication of thought that may or may not actually apply to your prompt.

> We call it 'prompt-engineering'

I prefer to call it “prompt guessing”, it's like some modern variant of alchemy.

Maybe "prompt mixology" would keep inline with the alchemy theme :)

Prompt vibing

"Prompt Whispering"?

Prompt divining

Impromptu prompting

Prompt dowsing

We went from Impossible to Unreliable. I like the direction as a techie. But not sure as a sociologist or an anthropologist.

Also related: In my observations with tool calling the order of your arguments or fields actually can make a positive or negative effect on performance. You really have to be very careful when constructing your contexts. It doesn't help when all these frameworks and protocols hide these things from you.

Disappointing that they didn't benchmark DeepSeek side by side with OpenAI and Gemini although thelist of funders for cip may explain that:

https://www.cip.org/funding-partnerships

Incidentally DeepSeek will give very interesting results if you ask it for a tutorial on prompt engineering - be sure to ask it how to effectively use 'attention anchors' to create 'well-structured prompts', and why rambling disorganized prompts are usually, but not always, detrimental, depending on whether you want 'associative leaps' or not.

P.S. I find this intro very useful:

> "Task: evaluate the structure of the following prompt in terms of attention anchors and likelihood of it generating a well-structured response. Do not actually reply to the prompt, all we need is an analysis of the structure. Prompt begins:"

I listen to online debates, especially political ones on various platforms, and man. The AI slop that people slap around at each other is beyond horrendous. I would not want an LLM being the final say on something critical. I want the opposite, an LLM should identify things that need follow up review by a qualified person, a person should still confirm the things that "pass" but they can then prioritize what to validate first.

I don't even trust LLMs enough to spot content that requires validation or nuance.

LLM's are 100% trust but verify.

Meanwhile in Estonia, they just agreed to resolve child support disputes using AI... https://www.err.ee/1609701615/pakosta-enamiku-elatisvaidlust...

Might as well flip a coin.

News article:

Half our orders would be reversed if there were a higher court: Supreme Court of India https://lawinsider.in/news/half-our-orders-would-be-reversed...

If I were to take this literally, it is like a coin flip.

[flagged]

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

(We detached this comment from https://news.ycombinator.com/item?id=44074957)

[dead]

[flagged]

Ok, but please don't post unsubstantive comments to Hacker News.

Ok sorry. I’ll go back to slashdot.

At least until the LLM judges otherwise.

[flagged]

Comments like this break the site guidelines, and not just a little. Can you please review https://news.ycombinator.com/newsguidelines.html and take the intended spirit of this site more to heart? Note these:

"Please don't fulminate."

"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."

"Please don't sneer, including at the rest of the community."

"When disagreeing, please reply to the argument instead of calling names. 'That is idiotic; 1 + 1 is 2, not 3' can be shortened to '1 + 1 is 2, not 3."

There's plenty of LLM skepticism on HN and that's fine, but like all comments here, it needs to be thoughtful.

(We detached this comment from https://news.ycombinator.com/item?id=44074957)

Can't wait for the new field of AI psychology

I’d argue real judges are unreliable as well.

The real question for me is: are they less reliable than human judges? Probably yes. But I favor a relative measurement to humans than a plain statement like that.

I think the main difference is an AI judge may provide three different rulings if you just ask it the same thing three times. A human judge is much less likely to be so "flip-floppy".

You can observe this using any of the present-day LLM's - ask it an architectural/design question, provide it with your thoughts, reasoning, constraints, etc... and see what it tells you. Then... click the "Retry" button and see how similar (or dissimilar) the answer is. Sometimes you'll get a complete 180 from the prior response.

Humans flip-flop all the time. This is a major reason why the Meyers-Briggs Type Indicator does such a poor job of assigning the same person the same Meyers-Briggs type on successive tests.

It can be difficult to observe this fact in practice because, unlike for an LLM, you can't just ask a human the exact same question three times in five seconds and get three different answers, because unlike an LLM we have memory. But, as someone who works with human-labeled data, it's something I have to contend with on a daily basis. For the things I'm working on, if you give the same annotator the same thing to label two different times spaced far enough apart for them to forget that they have seen this thing before, the chance of them making the same call both times is only about 75%. If I do that with a prompted LLM anotator, I'm used to seeing more like 85%, and for some models it can be possible to get even better consistency than that with the right conditions and enough time spent fussing with the prompt.

I still prefer the human labels when I can afford them because LLM labeling has plenty of other problems. But being more flip-floppy than humans is not one that I have been able to empirically observe.

We're not talking about labeling data though - we're talking about understanding case law, statutory law, facts, balancing conflicting opinions, arguments, a judge's preconceived notions, experiences, beliefs etc. - many of which are assembled over an entire career.

Those things, I'd argue, are far less likely to change if you ask the same judge over and over. I think you can observe this in reality by considering people's political opinions - which can drift over time but typically remain similar for long durations (or a lifetime).

In real life, we usually don't ask the same judge to remake a ruling over and over - our closest analog is probably a judge's ruling/opinion history, which doesn't change nearly as much as an LLM's "opinion" on something. This is how we label SCOTUS Justices, for example, as "Originalist", etc.

Also, unlike a human, you can radically change an LLM's output by just ever-so-slightly altering the input. While humans aren't above changing their mind based on new facts, they are unlikely to take an opposite position just because you reworded your same argument.

I think that that gets back to the whole memory thing. A person is unlikely to forget those kinds of decisions.

But there has been research indicating, for example, that judges' rulings vary with the time of day. In a way that implies that, if it were possible to construct such an experiment, you might find that the same judge given the same case would rule in very different ways depending on whether you present it in the morning or in the afternoon. For example judges tend to hand out significantly harsher penalties toward the end of the work day.

I’d think there’s also a key adversarial problem: a human judge has a conceptual understanding and you aren’t going to be able to slightly tweak your wording to get wildly different outcomes the way LLMs are vulnerable to.

> The real question for me is: are they less reliable than human judges?

I'd caution that it's never just about ratios: We must also ask whether the "shape" of their performance is knowable and desirable. A chess robot's win-rate may be wonderful, but we are unthinkingly confident a human wouldn't "lose" by disqualification for ripping off an opponent's finger.

Would we accept a "judge" that is fairer on average... but gives ~5% lighter sentences to people with a certain color shirt, or sometimes issues the death-penalty for shoplifting? Especially when we cannot diagnose the problem or be sure we fixed it? (Maybe, but hopefully not without a lot of debate over the risks!)

In contrast, there's a huge body of... of stuff regarding human errors, resources we deploy so pervasively it can escape our awareness: Your brain is a simulation and diagnostic tool for other brains, battle-tested (sometimes literally) over millions of years; we intuit many kinds of problems or confounding factors to look for, often because we've made them ourselves; and thousands of years of cultural practice for detection, guardrails, and error-compensating actions. Only a small minority of that toolkit can be reused for "AI."

Yes, I fully agree.

But that’s my point. We have to compare LLM performance to some shape we know.

I do think you've hit the heart of the question, but I don't think we can answer the second question.

We can measure how unreliable they are, or how susceptible they are to specific changes, just because we can reset them to the same state and run the experiment again. At least for now [1] we do not have that capability with humans, so there's no way to run a matching experiment on humans.

The best we can do it is probably to run the limited experiments we can do on humans -- comparing different judge's cross-referenced reliability to get an overall measure and some weak indicator of the reliability of a specific judge based on intra-judge agreement. But when running this on LLMs they would have to keep the previous cases in their context window to get a fair comparison.

[1] https://qntm.org/mmacevedo

> The real question for me is: are they less reliable than human judges?

I've spent some time poking at this. I can't go into details, but the short answer is, "Sometimes yes, sometimes no, and it depends A LOT on how you define 'reliable'."

My sense is that, the more boring, mechanical and closed-ended the task is, the more likely an LLM is to be more reliable than a human. Because an LLM is an unthinking machine. It doesn't get tired, or hangry, or stressed out about its kid's problems at school. But it's also a doofus with absolutely no common sense whatsoever.

> Because an LLM is an unthinking machine.

Unthinking can be pretty powerful these days.

I don't study domestic law enough, but I asked a professor of law:

"With anything gray, does the stronger/bigger party always win?"

He said:

"If you ask my students, nearly all of them would say Yes"

Judges can reason according to principles, and explain this reasoning. LLMs cannot (but they can pretend to, and this pretend chain-of-thought can be marketed as "reasoning"!; see https://news.ycombinator.com/item?id=44069991)

There are technical quirks that make LLM judges particularly high variance, sensitive to artifacts in the prompt, and positively/negatively-skewed, as opposed to the subjectivity of human judges. These largely arise from their training distribution and post-training, and can be contained with careful calibration.

I know the answer and I hate it.

AIs are inferior to humans at their best, but superior to humans as they actually behave in society, due to decision fatigue and other constraints. When it comes to moral judgment in high stakes scenarios, AIs still fail (or can be made to fail) in ways that are not socially acceptable.

Compare an AI to a real-world, overworked corporate decision maker, though, and you'll find that the AI is kinder and less biased. It still sucks, because GI/GO, but it's slightly better, simply because it doesn't suffer emotional fatigue, doesn't take as many shortcuts, and isn't clouded by personal opinions since it's not a person.

At least LLMs don't use penis pumps while on the job in court.

https://www.findlaw.com/legalblogs/legally-weird/judge-who-u...

https://www.subsim.com/radioroom/showthread.php?t=95174

Can we stop with the "AI being unreliable like people" because it is demonstrably false at best and cult like thought termination at the worst.

[dead]