[SWE-bench co-author here]
It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that.
I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)
They don't have to be malicious operators in this case. It just happens.
> malicious
It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.
I care about -expected- performance when picking which model to use, not optimal benchmark performance.
Non-determinism isn’t the same as degradation.
The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls.
In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result.
this is about variance of daily statistics, so I think the suggestions are entirely appropriate in this context.
Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.
Floating point math isn't associative for operations that are associative in normal math.
noob question: why would increased demand result in decreased intelligence?
An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.
This is intentional? I think delivering lower quality than what was advertised and benchmarked is borderline fraud, but YMMV.
Per Anthropic’s RCA linked in Ops post for September 2025 issues:
“… To state it plainly: We never reduce model quality due to demand, time of day, or server load. …”
So according to Anthropic they are not tweaking quality setting due to demand.
And according to Google, they always delete data if requested.
And according to Meta, they always give you ALL the data they have on you when requested.
>And according to Google, they always delete data if requested.
However, the request form is on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard'.
What would you like?
An SLA-style contractually binding agreement.
That's about model quality. Nothing about output quality.
I guess I just don't know how to square that with my actual experiences then.
I've seen sporadic drops in reasoning skills that made me feel like it was January 2025, not 2026 ... inconsistent.
LLMs sample the next token from a conditional probability distribution, the hope is that dumb sequences are less probable but they will just happen naturally.
I wouldn't doubt that these companies would deliberately degrade performance to manage load, but it's also true that humans are notoriously terrible at identifying random distributions, even with something as simple as a coin flip. It's very possible that what you view as degradation is just "bad RNG".
yep stochastic fantastic
these things are by definition hard to reason about
[deleted]
Personally, I'd rather get queued up on a long wait time I mean not ridiculously long but I am ok waiting five minutes to get correct it at least more correct responses.
Sure, I'll take a cup of coffee while I wait (:
i’d wait any amount of time lol.
at least i would KNOW it’s overloaded and i should use a different model, try again later, or just skip AI assistance for the task altogether.
They don't advertise a certain quality. You take what they have or leave it.
If there's no way to check, then how can you claim it's fraud? :)
There is no level of quality advertised, as far as I can see.
> I think delivering lower quality than what was advertised and benchmarked is borderline fraud
welcome to the Silicon Valley, I guess. everything from Google Search to Uber is fraud. Uber is a classic example of this playbook, even.
If you aren't defrauding your customers you will be left behind in 2026
That number is a sliding window, isn't it?
I'd wager that lower tok/s vs lower quality of output would be two very different knobs to turn.
I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load
It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.
Or just reducing the reasoning tokens.
They advertise the Opus 4.5 model. Secretly substituting a cheaper one to save costs would be fraud.
Old school Gemini used to do this. It was super obvious because mid day the model would go from stupid to completely brain dead. I have a screenshot of Google's FAQ on my PC from 2024-09-13 that says this (I took it to post to discord):
> How do I know which model Gemini is using in its responses?
> We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.
> We use various models at hand for specific tasks based on what we think will provide the best experience
... for Google :)
from what I understand this can come from the batching of requests.
So, a known bug?
I've personally witnessed large variability in behaviour even within a given session -- which makes sense as there's nothing stopping Anthropic from shuttling your context/session around load balanced through many different servers, some of which might be quantized heavily to manage load and others not at all.
I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.
TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.
I've defended opus in the last weeks but the degradation is tangible. It feels like it degraded by a generation tbh.
it's just extremely variable
Hope you don't mind the unrelated question:
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.
so basically they know requests using your API key should be treated with care?
[dead]
The last thing a proper benchmark should do is reveal it's own API key.
That's a good thought I hadn't had, actually.
IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.
yes I reached out to them but as you say it's a chicken-and-egg problem.
Thanks!
The degradation may be more significant within the day than at the same time every day.
Sure, but it's still useful insight to see how it performs over time. Of course, cynically, Anthropic could game the benchmark by routing this benchmark's specific prompts to an unadulterated instance of the model.
Sorry what?
"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?
"Oh, you just measured me at bad times each day. On only 50 different queries."
So, what does that mean? I have to pick specific times during the day for Claude to code better?
Does Claude Code have office hours basically?
> Does Claude Code have office hours basically?
Yes. Now pay up or you will be replaced.
Verily, my vichyssoise of verbiage veers most verbose, so let me run that thing out of tokens fast.
Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.
For CC, I suspect it also need to be testing and labeling separate runs against subscription, public API and Bedrock-served models?
It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).
What we could also use is similar stuff for Codex, and eventually Gemini.
Really, the providers themselves should be running these tests and publishing the data.
The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.
> Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
Are you suggesting result accuracy varies with server load?
Stilll relevant over time.
"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"
Aha, so the models do degrade under load.
would be interesting to see what scores it's get when it is actually degraded via the status page, it gets degraded pretty often, so there's at least something to compare or to know at what point Anthropic declares degradation
Why I do not believe this shows Anthropic serves folks a worse model:
1. The percentage drop is too low and oscillating, it goes up and down.
2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.
3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.
I believe the science, but I've been using it daily and it's been getting worse, noticeably.
Any chance you’re just learning more about what the model is and is not useful for?
I dunno about everyone else but when I learn more about what a model is and is not useful for, my subjective experience improves, not degrades.
Not when the product is marketed as a panacea.
Is it possible that your expectations are increasing, not that the model is getting worse?
Possible, though you eventually run into types of issues that you recall the model just not having before. Like accessing a database or not following the SOP you have it read each time it performs X routine task. There are also patterns that are much less ambiguous like getting caught in loops or failing to execute a script it wrote after ten attempts.
4. The graph starts January 8.
Why January 8? Was that an outlier high point?
IIRC, Opus 4.5 was released late november.
People were away for the holidays. What do you want them to do?
Or maybe, juste maybe, that's when they started testing…
Wayback machine has nothing for this site before today, and article is "last updated Jan 29".
A benchmark like this ought to start fresh from when it is published.
I don't entirely doubt the degradation, but the choice of where they went back to feels a bit cherry-picked to demonstrate the value of the benchmark.
Which makes sense, you gotta wait until you get enough data before you can communicate on the said data…
If anything it's coherent with the fact that they very likely didn't have data earlier than January the 8th.
> 1. The percentage drop is too low and oscillating, it goes up and down.
How do you define “too low”, they make sure to communicate about the statistical significance of their measurements, what's the point if people can just claim it's “too low” based on personal vibes…
Lack of transparency as regards "thinking power"-consistency is a big gripe of mine with LLM providers. It's even worse with ChatGPT and the like. E.g. I had to learn the hard way that at >45k input tokens ChatGPT 5.2 Thinking Extended bumps its intelligence down so hard that it can't follow basic instructions (or it somehow truncates the input, losing the instructions). It sucks to lose confidence in an otherwise great tool. I would 100x prefer being forced to back-off, or getting a straight-no, than getting silently downgraded. Transparency is a big deal.
Simply search user prompts for curse words and then measure hostility sentiment. User hostility rises as agents fail to meet expectations.
Maybe im overlooking something obvious but how do you 'simply' scan the content of Claude users their prompts?
GP was making a joke, but Anthropic could implement this if they wanted to. Not a bad metric actually if you can measure it cheaply enough.
I uh might be skewing that as I generally just use a lot of curse words with Claude by default
I feel bad about it but sometimes it's so daft, I can't even xD
It's not my fault, they set high standards!
I'm glad I'm not the only one.
One time I cussed Claude out so hard that it actually quit his doom-loop and fixed the thing.
It's the only time cussing worked, though.
Or there are global events that stress people out .. or their expectations change over time. Not that simple ;)
there are many times where I just do it myself and it thinks it did well.
There was a moment about a week ago where Claude went down for about an hour. And right after it came back up it was clear a lot of people had given up and were not using it.
It was probably 3x faster than usual. I got more done in the next hour with it than I do in half a day usually. It was definitely a bit of a glimpse into a potential future of “what if these things weren’t resource constrained and could just fly”.
I had that exact same feeling during the US holidays where I got to enjoy 2x usage limits and everything just seemed to work well
I had terrible results during the holidays -- it wasn't slow but it was clear they were dealing with the load by quantizing in spots because there were entire chunks of days when the results from it were so terrible I gave up and switched to using Gemini or Codex via opencode.
Noticed the exact same thing a few days ago. So much so that I went on twitter and HN to search for “claude speed boost” to see if there was a known new release. Felt like the time I upgraded from a 2400 baud modem to a 14.4 as a kid - everything was just lightning fast (for a brief shining moment).
I would also regret it if they become that fast; right now I can really take a moment to enjoy the hard work the model is doing for me.
Wouldn't be surprised if they slowly start quantizing their models over time. Makes it easier to scale and reduce operational cost. Also makes a new release have more impact as it will be more notably "better" than what you've been using the past couple of days/weeks.
I don't think so. There are other knobs they can tweak to reduce load that affect quality less than quantizing. Like trimming the conversation length without telling you, reducing reasoning effort, etc.
It sure feels like they do this. They claim they don't, but using it every day for 5-10 hours a day. You notice when something changes.
This last week it seems way dumber than before.
I would be surprised tbh.
Anthropic does not exactly act like they're constrained by infra costs in other areas, and noticeably degrading a product when you're in tight competition with 1 or 2 other players with similar products seems like a bad place to start.
I think people just notice the flaws in these models more the longer they use them. Aka the "honeymoon-hangover effect," a real pattern that has been shown in a variety of real world situations.
I haven't noticed much difference in Claude, but I swear gemini 3 pro preview was better in the first week or two and later started feeling like they quantized it down to hell.
Oooff yes I think that is exactly the kind of shenanigans they might pull.
Ultimately I can understand if a new model is coming in without as much optimization then it'll add pressure to the older models achieving the same result.
Nice plausible deniability for a convenient double effect.
Benchmarks like ARG AGI are super price correlated and cheap to run. I think it's very easy to prove that the models are degrading.
FYI the MarginLab Claude Code degradation tracker is showing a statistically significant ~4% drop in SWE-Bench-Pro accuracy over the past month
I really like the idea, but a "±14.0% significance threshold" is meaningless here.
The larger monthly scale should be the default, or you should get more samples.
Could you elaborate what you think the problems are? I guess they should be using some form of multiple comparison correction?
The daily scale is not statistically significant and is meaningless.
You should lower the confidence interval by either increasing the scale or the evaluations.
What makes the level they chose a “baseline,” against which it would be appropriate to do statistical tests?
I am using API mode, and it's clear that there are times when the Claude model just gives up. And it is very noticeable because the model just does the most dumb things possible.
"You have a bug in line 23." "Oh yes, this solution is bugged, let me delete the whole feature." That one-line fix I could make even with ChatGPT 3.5 can't just happen. Workflows that I use and are very reproducible start to flake and then fail.
After a certain number of tokens per day, it becomes unusable. I like Claude, but I don't understand why they would do this.
Robbing Peter to pay Paul. They are probably resource-constrained, and have determined that it's better to supply a worse answer to more people than to supply a good answer to some while refusing others. Especially knowing that most people probably don't need the best answer 100% of the time.
> Especially knowing that most people probably don't need the best answer 100% of the time.
More: probably don't know if they've got a good answer 100% of the time.
It is interesting to note that this trickery is workable only where the best answers are sufficiently poor. Imagine they ran almost any other kind of online service such email, stock prices or internet banking. Occasionally delivering only half the emails would trigger a customer exodus. But if normal service lost a quarter of emails, they'd have only customers who'd likely never notice half missing.
I encountered the same situation too; Claude has 'become lazy'.
This is why I run my own models. All the inference providers do sneaky things behind the scenes. They will limit the output tokens, turn off attention layers, lower reasoning, or just use a completely different model. I'm actually surprised that Claude Code experienced this, as I've experienced this the least from API and coding agents.
Does it benchmark the underlying code (Opus 4.5) or Claude Code harness? If the second, I would love to see CC versions involved.
I would be curious to see on how it fares against a constant harness.
Claude Code. They mention they are using claude codes CLI in the benchmark, and claude code changes constantly.
I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.
I wouldn't really trust this to be able to benchmark opus itself.
First off, this is a cool project, look forward to some interesting insights.
I would suggest adding some clarification to note that longer measure like 30 pass rate is raw data only while the statistically significant labels apply only to change.
Maybe something like
Includes all trials, significance labels apply only to confidence in change vs baseline.
This strategy seems inspired by TikTok's approach for retaining new uploaders.
TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.
In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.
Yes, but the difference is TikTok didn't sell a particular service version.
Anthropic did sell a particular model version.
I KNEW I WASNT CRAZY
I’m sure there is not enough data here for this to be statistically significant (it seems to oscillate too much and not show real trends or step changes) - BUT
If this measure were hardened up a little, it would be really useful.
It feels like an analogue to an employee’s performance over time - you could see in the graphs when Claude is “sick” or “hungover”, when Claude picks up a new side hustle and starts completely phoning it in, or when it’s gunning for a promotion and trying extra hard (significant parameter changes). Pretty neat.
Obviously the anthropomorphising is not real, but it is cool to think of the model’s performance as being a fluid thing you have to work with, and that can be measured like this.
I’m sure some people, most, would prefer that the model’s performance were fixed over time. But come on, this is way more fun.
Very interesting. I would be curious to understand how granular these updates are being applied to CC + what might be causing things like this. I feel like I can notice a very small degradation but have compensated with more detailed prompts (which I think, perhaps naively, is offsetting this issue).
> more detailed prompts (which I think, perhaps naively, is offsetting this issue).
Is exacerbating this issue ... if the load theory is correct.
The chart would benefit from having weekends highlighted. Or have another chart averaged by a weekday.
Why is this happening?
They're "optimizing" costs wherever possible - reducing compute allocations, quantizing models, doing whatever they can to reduce the cost per token, but vehemently insisting that no such things are occurring, that it's all in the users' heads, and using the weaseliest of corporate weasel speak to explain what's happening. They insist it's not happening, then they say something like "oh, it happened but it was an accident", then they say "yes, it's happening, but it's actually good!" and "we serve the same model day by day, and we've always been at war with Eastasia."
They should be transparent and tell customers that they're trying to not lose money, but that'd entail telling people why they're paying for service they're not getting. I suspect it's probably not legal to do a bait and switch like that, but this is pretty novel legal territory.
[deleted]
I have absolutely no insight knowledge, but I think it's not a bad assumption to have that, it's costly to run the models, when they release a new model they assume that cost and give per user more raw power, when they've captured the new users and wow factor, they start reducing costs by reducing the capacity they provide to users. Rinse and repeat.
There are frequently claims that Anthropic is somehow diluting or dumbing down models in some subtle way. Unfortunately it’s tough to validate these claims without a body of regularly checked evals. This test set should hopefully help settle whether Anthropic is actually making changes under the hood or whether the changes are all in people’s heads.
It’s entirely possible it’s not happening, and this phenomenon of “model degradation” is just user hype meeting reality.
>>> We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone.
Just ignore the continual degradation of service day over day, long after the "infrastructure bugs" have reportedly been solved.
Oh, and I've got a bridge in Brooklyn to sell ya, it's a great deal!
> We never reduce model quality due to demand, time of day, or server load
Forgive me, but as a native English speaker, this sentence says exactly one thing to me; We _do_ reduce model quality, just not for these listed reasons.
If they don't do it, they could put a full stop after the fifth word and save some ~~tokens~~ time.
[delayed]
I have yet to experience any degradation in coding tasks I use to evaluate Opus 4.5, but I did see a rather strange and reproducible worsening in prompt adherence as part of none coding tasks since the third week of January.
Very simple queries, even those easily answered via regular web searching, have begun to consistently not result accurate results with Opus 4.5, despite the same prompts previously yielding accurate results.
One of the tasks that I already thought was fully saturated as most recent releases had no issues in solving it was to request a list of material combinations for fabrics used in bag constructions that utilise a specific fabric base. In the last two weeks, Claude has consistently and reproducibly provided results which deviate from the requested fabric base, making the results inaccurate in a way that a person less familiar with the topic may not notice instantly. There are other queries of this type for other topics I am nerdily familiar with to a sufficient degree to notice such deviations from the prompt like motorcycle history specific queries that I can say this behaviour isn't limited to the topic of fabrics and bag construction.
Looking at the reasoning traces, Opus 4.5 even writes down the correct information, yet somehow provides an incorrect final output anyways.
What makes this so annoying is that in coding tasks, with extensive prompts that require far greater adherence to very specific requirements in a complex code base, Opus 4.5 does not show such a regression.
I can only speculate what may lead to such an experience, but for none coding tasks I have seen regression in Opus 4.5 whereas for coding I did not. Not saying there is none, but I wanted to point it out as such discussions are often primarily focused on coding, where I find it can be easier to see potential regressions where their are none as a project goes on and tasks become inherently more complex.
My coding benchmarks are a series of very specific prompts modifying a few existing code bases in some rather obscure ways, with which I regularly check whether a model does severely deviate from what I'd seen previously. Each run starts with a fresh code base with some fairly simple tasks, then gets increasingly complex with later prompts not yet being implemented by any LLM I have gotten to test. Partly that originated from my subjective experience with LLMs early on, where I found a lot of things worked very well but then as the project went on and I tried more involved things with which the model struggled, I felt like the model was overall worse when in reality, what had changed were simply the requirements and task complexity as the project grew and easier tasks had been completed already. In this type of testing, Opus 4.5 this week got as far and provided a result as good as the model did in December. Of course, past regressions were limited to specific users, so I am not saying that no one is experiencing reproducible regressions in code output quality, merely that I cannot reproduce them in my specific suite.
I've noticed a degradation in Opus 4.5, also with Gemini-3-Pro. For me, it was a sudden rapid decline in adherence to specs in Claude Code. On an internal benchmark we developed, Gemini-3-Pro also dramatically declined. Going from being clearly beyond every other model (as benchmarks would lead you to believe) to being quite mediocre. Delivering mediocre results in chat queries and coding also missing the mark.
I didn't "try 100 times" so it's unclear if this is an unfortunate series of bad runs on Claude Code and Gemini CLI or actual regression.
I shouldn't have to benchmark this sort of thing but here we are.
I definitely noticed a degradation, it feels regressed by a generation.
> We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.
Doesn't really work like that. I'd remove the "statistically significant" labelling because it's misleading.
Would love to see this idea expanded to ever alleged SoTA model currently in production. Any speculation as to why this degradation occurs?
Anecdote, I don't have any proof and it's just a feeling. But around afternoon in GMT+1 compared to the morning/midday, there seems to be a change in the quality of responses, which seems to line up with when the US wakes up. I consistently get (what feels like) worse responses in both Codex and Claude Code in the afternoon/night compared to morning/midday, so much that I usually give up then try the same prompt next morning and get better results. But I guess that might as well be about me being more tired in the night than morning too, as I said, haven't measured this.
It’s the afternoon slump. The AI needs a cup of coffee and to doomscroll for half an hour!
Or a load balancing technique :) Either way, it kicks me off to do other things so maybe it isn't so bad after all.
In medicine there is a concept of reporting adverse effects of medication or interventions which are then collectively studied for Public Health [MedWatch][VAERS][EudraVigilance] and in academia. We should have something like that for all coding agents(and agents in other fields too), given how widely its deployed and affect on "health" in general(not only human). Call it the AI "health" of things benchmark.
I would imagine a sort of hybrid qualities of volunteer efforts like wikipedia, new problems like advent of code and benchmarks like this. The goal? It would be to study the collective effort on the affects of usage to so many areas where AI is used.
Pretty sure someone at Google, OpenAI, and Anthropic met up at a park, leaving their phones in their car, and had a conversation that January 2026, they were all going to silently degrade their models.
They were fighting an arms race that was getting incredibly expensive and realized they could get away with spending less electricity and there was nothing the general population could do about it.
Grok/Elon was left out of this because he would leak this idea at 3am after a binge.
Finally someone did it! We need this for all models.
That will be great if there's RSS support.
any chance we can get something like this for codex cli that'd be cool too compare
My personal conspiracy theory is that they choose who to serve a degraded model to based on social graph analysis and sentiment analysis, maximizing for persuasion while minimizing compute.
IMO this strategy seems inspired by TikTok's approach for retaining new uploaders.
TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.
In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.
Of course, your suggestion (better service for users who know how to speak Proper English) would be the cherry on top of this strategy.
From what I've seen on HackerNews, Anthropic is all-in on social media manipulation and social engineering, so I suspect that your assumption holds water.
Sounds more like a sound business plan than a conspiracy theory.
It sounds like fraud to me
Does it say anywhere in their terms of service that they guarantee the quality of the model, or promise not to modify it?
This is probably entirely down to subtle changes to CC prompts/tools.
I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.
Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?
Honest, good-faith question.
Is CC getting better, or are you getting better at using it? And how do you know the difference?
I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.
I agree with you, it's personally hard to tell.
For me I've noticed it getting nothing but better over the past couple months, but I've been working on my workflows and tooling.
For example, I used to use plan mode and would put everything in a single file and then ask it to implement it in a new session.
Switching to the 'superpowers' plugin with its own skills to brainstorm and write plans and execute plans with batches and tasks seems to have made a big improvement and help catch things I wouldn't have before. There's a "get shit done" plugin that's similar that I want to explore as well.
The code output always looks good to me for the most part though and I've never thought that it's getting dumber anything, so I feel like a lot of the improvements I see are because of a skill issue on my part trying to use everything. Obviously it doesn't help there's a new way to do things every two weeks though.
Good-faith answer: I can't be certain. But I've been using CC since its release, and Cursor before that (and actually going all the way back to GPT3 to do codegen in the Playground). After getting used to the CC workflow, the way that I use it has been pretty consistent. To be specific, I use basically the same AGENTS.md with small modifications for each project, and I live almost exclusively in Plan mode and the best model (currently Opus 4.5).
My initial prompting is boilerplate at this point, and looks like this:
(Explain overall objective / problem without jumping to a solution)
(Provide all the detail / file references / past work I can think of)
(Ask it "what questions do you have for me before we build a plan?")
And then go back and forth until we have a plan.
Compared to my work with CC six months ago, it's just much more capable, able to solve more nuanced bugs, and less likely to generate spaghetti code.
The easiest way would be to quantize the model, and serve different quants based on the current demand. Higher volumes == worse quant == more customers served per GPU
That's why benchmarks are useful. We all suffer from the shortcomings of human perception.
Benchmarks shortcomings are no worse... they inevitably measure something that is only close to the thing you actually care about, not the thing you actually care about. It's entirely plausible that this decreased benchmark score is because Anthropic's initial prompting of the model was overtuned to the benchmark and as they're gaining more experience with real world use they are changing the prompt to do better at that and consequentially worse at the benchmark.
I wonder how best we can measure the usefulness of models going forward.
Thumbs up or down? (could be useful for trends)
Usage growth from the same user over time? (as an approximation)
Tone of user responses? (Don't do this... this is the wrong path... etc.)
Benchmarks measure what they measure. But your subjective experience also matters.
I was going to ask, are all other variables accounted for? Are we really comparing apples to apples here? Still worth doing obviously, as it serves a good e2e evaluations, just for curiosity's sake.
[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)
Load just makes LLMs behave less deterministically and likely degrade. See: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
They don't have to be malicious operators in this case. It just happens.
> malicious
It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.
I care about -expected- performance when picking which model to use, not optimal benchmark performance.
Non-determinism isn’t the same as degradation.
The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls.
In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result.
this is about variance of daily statistics, so I think the suggestions are entirely appropriate in this context.
Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.
Not deterministic. https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
It takes a different code path for efficiency.
e.g
if (batch_size > 1024): kernel_x else: kernel_y
Floating point math isn't associative for operations that are associative in normal math.
noob question: why would increased demand result in decreased intelligence?
An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.
This is intentional? I think delivering lower quality than what was advertised and benchmarked is borderline fraud, but YMMV.
Per Anthropic’s RCA linked in Ops post for September 2025 issues:
“… To state it plainly: We never reduce model quality due to demand, time of day, or server load. …”
So according to Anthropic they are not tweaking quality setting due to demand.
And according to Google, they always delete data if requested.
And according to Meta, they always give you ALL the data they have on you when requested.
>And according to Google, they always delete data if requested.
However, the request form is on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard'.
What would you like?
An SLA-style contractually binding agreement.
That's about model quality. Nothing about output quality.
I guess I just don't know how to square that with my actual experiences then.
I've seen sporadic drops in reasoning skills that made me feel like it was January 2025, not 2026 ... inconsistent.
LLMs sample the next token from a conditional probability distribution, the hope is that dumb sequences are less probable but they will just happen naturally.
I wouldn't doubt that these companies would deliberately degrade performance to manage load, but it's also true that humans are notoriously terrible at identifying random distributions, even with something as simple as a coin flip. It's very possible that what you view as degradation is just "bad RNG".
yep stochastic fantastic
these things are by definition hard to reason about
Personally, I'd rather get queued up on a long wait time I mean not ridiculously long but I am ok waiting five minutes to get correct it at least more correct responses.
Sure, I'll take a cup of coffee while I wait (:
i’d wait any amount of time lol.
at least i would KNOW it’s overloaded and i should use a different model, try again later, or just skip AI assistance for the task altogether.
They don't advertise a certain quality. You take what they have or leave it.
If there's no way to check, then how can you claim it's fraud? :)
There is no level of quality advertised, as far as I can see.
> I think delivering lower quality than what was advertised and benchmarked is borderline fraud
welcome to the Silicon Valley, I guess. everything from Google Search to Uber is fraud. Uber is a classic example of this playbook, even.
If you aren't defrauding your customers you will be left behind in 2026
That number is a sliding window, isn't it?
I'd wager that lower tok/s vs lower quality of output would be two very different knobs to turn.
I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load
It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.
Or just reducing the reasoning tokens.
They advertise the Opus 4.5 model. Secretly substituting a cheaper one to save costs would be fraud.
Old school Gemini used to do this. It was super obvious because mid day the model would go from stupid to completely brain dead. I have a screenshot of Google's FAQ on my PC from 2024-09-13 that says this (I took it to post to discord):
> How do I know which model Gemini is using in its responses?
> We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.
> We use various models at hand for specific tasks based on what we think will provide the best experience
... for Google :)
from what I understand this can come from the batching of requests.
So, a known bug?
I've personally witnessed large variability in behaviour even within a given session -- which makes sense as there's nothing stopping Anthropic from shuttling your context/session around load balanced through many different servers, some of which might be quantized heavily to manage load and others not at all.
I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.
TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.
I've defended opus in the last weeks but the degradation is tangible. It feels like it degraded by a generation tbh.
it's just extremely variable
Hope you don't mind the unrelated question:
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
https://mafia-arena.com
Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.
so basically they know requests using your API key should be treated with care?
[dead]
The last thing a proper benchmark should do is reveal it's own API key.
That's a good thought I hadn't had, actually.
IMO it should need a third party running the LLM anyway. Otherwise the evaluated company could notice they're receiving the same requests daily and discover benchmarking that way.
yes I reached out to them but as you say it's a chicken-and-egg problem.
Thanks!
The degradation may be more significant within the day than at the same time every day.
Sure, but it's still useful insight to see how it performs over time. Of course, cynically, Anthropic could game the benchmark by routing this benchmark's specific prompts to an unadulterated instance of the model.
Sorry what?
"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?
"Oh, you just measured me at bad times each day. On only 50 different queries."
So, what does that mean? I have to pick specific times during the day for Claude to code better?
Does Claude Code have office hours basically?
> Does Claude Code have office hours basically?
Yes. Now pay up or you will be replaced.
Verily, my vichyssoise of verbiage veers most verbose, so let me run that thing out of tokens fast.
Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.
For CC, I suspect it also need to be testing and labeling separate runs against subscription, public API and Bedrock-served models?
It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).
What we could also use is similar stuff for Codex, and eventually Gemini.
Really, the providers themselves should be running these tests and publishing the data.
The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.
> Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
Are you suggesting result accuracy varies with server load?
Stilll relevant over time.
"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"
Aha, so the models do degrade under load.
would be interesting to see what scores it's get when it is actually degraded via the status page, it gets degraded pretty often, so there's at least something to compare or to know at what point Anthropic declares degradation
Why I do not believe this shows Anthropic serves folks a worse model:
1. The percentage drop is too low and oscillating, it goes up and down.
2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.
3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.
I believe the science, but I've been using it daily and it's been getting worse, noticeably.
Any chance you’re just learning more about what the model is and is not useful for?
I dunno about everyone else but when I learn more about what a model is and is not useful for, my subjective experience improves, not degrades.
Not when the product is marketed as a panacea.
Is it possible that your expectations are increasing, not that the model is getting worse?
Possible, though you eventually run into types of issues that you recall the model just not having before. Like accessing a database or not following the SOP you have it read each time it performs X routine task. There are also patterns that are much less ambiguous like getting caught in loops or failing to execute a script it wrote after ten attempts.
4. The graph starts January 8.
Why January 8? Was that an outlier high point?
IIRC, Opus 4.5 was released late november.
People were away for the holidays. What do you want them to do?
Or maybe, juste maybe, that's when they started testing…
Wayback machine has nothing for this site before today, and article is "last updated Jan 29".
A benchmark like this ought to start fresh from when it is published.
I don't entirely doubt the degradation, but the choice of where they went back to feels a bit cherry-picked to demonstrate the value of the benchmark.
Which makes sense, you gotta wait until you get enough data before you can communicate on the said data…
If anything it's coherent with the fact that they very likely didn't have data earlier than January the 8th.
> 1. The percentage drop is too low and oscillating, it goes up and down.
How do you define “too low”, they make sure to communicate about the statistical significance of their measurements, what's the point if people can just claim it's “too low” based on personal vibes…
Lack of transparency as regards "thinking power"-consistency is a big gripe of mine with LLM providers. It's even worse with ChatGPT and the like. E.g. I had to learn the hard way that at >45k input tokens ChatGPT 5.2 Thinking Extended bumps its intelligence down so hard that it can't follow basic instructions (or it somehow truncates the input, losing the instructions). It sucks to lose confidence in an otherwise great tool. I would 100x prefer being forced to back-off, or getting a straight-no, than getting silently downgraded. Transparency is a big deal.
Sounds like you ran into the Maximum Effective Context Window: https://arxiv.org/abs/2509.21361?context=cs.AI
Simply search user prompts for curse words and then measure hostility sentiment. User hostility rises as agents fail to meet expectations.
Maybe im overlooking something obvious but how do you 'simply' scan the content of Claude users their prompts?
GP was making a joke, but Anthropic could implement this if they wanted to. Not a bad metric actually if you can measure it cheaply enough.
I uh might be skewing that as I generally just use a lot of curse words with Claude by default
I feel bad about it but sometimes it's so daft, I can't even xD
It's not my fault, they set high standards!
I'm glad I'm not the only one.
One time I cussed Claude out so hard that it actually quit his doom-loop and fixed the thing.
It's the only time cussing worked, though.
Or there are global events that stress people out .. or their expectations change over time. Not that simple ;)
there are many times where I just do it myself and it thinks it did well.
There was a moment about a week ago where Claude went down for about an hour. And right after it came back up it was clear a lot of people had given up and were not using it.
It was probably 3x faster than usual. I got more done in the next hour with it than I do in half a day usually. It was definitely a bit of a glimpse into a potential future of “what if these things weren’t resource constrained and could just fly”.
I had that exact same feeling during the US holidays where I got to enjoy 2x usage limits and everything just seemed to work well
I had terrible results during the holidays -- it wasn't slow but it was clear they were dealing with the load by quantizing in spots because there were entire chunks of days when the results from it were so terrible I gave up and switched to using Gemini or Codex via opencode.
Noticed the exact same thing a few days ago. So much so that I went on twitter and HN to search for “claude speed boost” to see if there was a known new release. Felt like the time I upgraded from a 2400 baud modem to a 14.4 as a kid - everything was just lightning fast (for a brief shining moment).
I would also regret it if they become that fast; right now I can really take a moment to enjoy the hard work the model is doing for me.
Wouldn't be surprised if they slowly start quantizing their models over time. Makes it easier to scale and reduce operational cost. Also makes a new release have more impact as it will be more notably "better" than what you've been using the past couple of days/weeks.
I don't think so. There are other knobs they can tweak to reduce load that affect quality less than quantizing. Like trimming the conversation length without telling you, reducing reasoning effort, etc.
It sure feels like they do this. They claim they don't, but using it every day for 5-10 hours a day. You notice when something changes.
This last week it seems way dumber than before.
I would be surprised tbh.
Anthropic does not exactly act like they're constrained by infra costs in other areas, and noticeably degrading a product when you're in tight competition with 1 or 2 other players with similar products seems like a bad place to start.
I think people just notice the flaws in these models more the longer they use them. Aka the "honeymoon-hangover effect," a real pattern that has been shown in a variety of real world situations.
I haven't noticed much difference in Claude, but I swear gemini 3 pro preview was better in the first week or two and later started feeling like they quantized it down to hell.
Oooff yes I think that is exactly the kind of shenanigans they might pull.
Ultimately I can understand if a new model is coming in without as much optimization then it'll add pressure to the older models achieving the same result.
Nice plausible deniability for a convenient double effect.
Benchmarks like ARG AGI are super price correlated and cheap to run. I think it's very easy to prove that the models are degrading.
FYI the MarginLab Claude Code degradation tracker is showing a statistically significant ~4% drop in SWE-Bench-Pro accuracy over the past month
I really like the idea, but a "±14.0% significance threshold" is meaningless here.
The larger monthly scale should be the default, or you should get more samples.
Could you elaborate what you think the problems are? I guess they should be using some form of multiple comparison correction?
The daily scale is not statistically significant and is meaningless. You should lower the confidence interval by either increasing the scale or the evaluations.
Codex is doing better. Why is everyone silent on Codex? https://marginlab.ai/trackers/codex/
Does this use a claude subscription or key, and has the account been used for anything else that day?
On HN a few days ago there was a post suggesting that Claude gets dumber throughout the day: https://bertolami.com/index.php?engine=blog&content=posts&de...
What makes the level they chose a “baseline,” against which it would be appropriate to do statistical tests?
I am using API mode, and it's clear that there are times when the Claude model just gives up. And it is very noticeable because the model just does the most dumb things possible.
"You have a bug in line 23." "Oh yes, this solution is bugged, let me delete the whole feature." That one-line fix I could make even with ChatGPT 3.5 can't just happen. Workflows that I use and are very reproducible start to flake and then fail.
After a certain number of tokens per day, it becomes unusable. I like Claude, but I don't understand why they would do this.
Robbing Peter to pay Paul. They are probably resource-constrained, and have determined that it's better to supply a worse answer to more people than to supply a good answer to some while refusing others. Especially knowing that most people probably don't need the best answer 100% of the time.
> Especially knowing that most people probably don't need the best answer 100% of the time.
More: probably don't know if they've got a good answer 100% of the time.
It is interesting to note that this trickery is workable only where the best answers are sufficiently poor. Imagine they ran almost any other kind of online service such email, stock prices or internet banking. Occasionally delivering only half the emails would trigger a customer exodus. But if normal service lost a quarter of emails, they'd have only customers who'd likely never notice half missing.
I encountered the same situation too; Claude has 'become lazy'.
This is why I run my own models. All the inference providers do sneaky things behind the scenes. They will limit the output tokens, turn off attention layers, lower reasoning, or just use a completely different model. I'm actually surprised that Claude Code experienced this, as I've experienced this the least from API and coding agents.
Does it benchmark the underlying code (Opus 4.5) or Claude Code harness? If the second, I would love to see CC versions involved.
I would be curious to see on how it fares against a constant harness.
There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157
So it would be wonderful to measure these.
Claude Code. They mention they are using claude codes CLI in the benchmark, and claude code changes constantly.
I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.
I wouldn't really trust this to be able to benchmark opus itself.
First off, this is a cool project, look forward to some interesting insights.
I would suggest adding some clarification to note that longer measure like 30 pass rate is raw data only while the statistically significant labels apply only to change.
Maybe something like Includes all trials, significance labels apply only to confidence in change vs baseline.
This strategy seems inspired by TikTok's approach for retaining new uploaders.
TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.
In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.
Yes, but the difference is TikTok didn't sell a particular service version.
Anthropic did sell a particular model version.
I KNEW I WASNT CRAZY
I’m sure there is not enough data here for this to be statistically significant (it seems to oscillate too much and not show real trends or step changes) - BUT
If this measure were hardened up a little, it would be really useful.
It feels like an analogue to an employee’s performance over time - you could see in the graphs when Claude is “sick” or “hungover”, when Claude picks up a new side hustle and starts completely phoning it in, or when it’s gunning for a promotion and trying extra hard (significant parameter changes). Pretty neat.
Obviously the anthropomorphising is not real, but it is cool to think of the model’s performance as being a fluid thing you have to work with, and that can be measured like this.
I’m sure some people, most, would prefer that the model’s performance were fixed over time. But come on, this is way more fun.
Very interesting. I would be curious to understand how granular these updates are being applied to CC + what might be causing things like this. I feel like I can notice a very small degradation but have compensated with more detailed prompts (which I think, perhaps naively, is offsetting this issue).
> more detailed prompts (which I think, perhaps naively, is offsetting this issue).
Is exacerbating this issue ... if the load theory is correct.
The chart would benefit from having weekends highlighted. Or have another chart averaged by a weekday.
Why is this happening?
They're "optimizing" costs wherever possible - reducing compute allocations, quantizing models, doing whatever they can to reduce the cost per token, but vehemently insisting that no such things are occurring, that it's all in the users' heads, and using the weaseliest of corporate weasel speak to explain what's happening. They insist it's not happening, then they say something like "oh, it happened but it was an accident", then they say "yes, it's happening, but it's actually good!" and "we serve the same model day by day, and we've always been at war with Eastasia."
They should be transparent and tell customers that they're trying to not lose money, but that'd entail telling people why they're paying for service they're not getting. I suspect it's probably not legal to do a bait and switch like that, but this is pretty novel legal territory.
I have absolutely no insight knowledge, but I think it's not a bad assumption to have that, it's costly to run the models, when they release a new model they assume that cost and give per user more raw power, when they've captured the new users and wow factor, they start reducing costs by reducing the capacity they provide to users. Rinse and repeat.
There are frequently claims that Anthropic is somehow diluting or dumbing down models in some subtle way. Unfortunately it’s tough to validate these claims without a body of regularly checked evals. This test set should hopefully help settle whether Anthropic is actually making changes under the hood or whether the changes are all in people’s heads.
It’s entirely possible it’s not happening, and this phenomenon of “model degradation” is just user hype meeting reality.
https://www.anthropic.com/engineering/a-postmortem-of-three-...
>>> We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone.
Just ignore the continual degradation of service day over day, long after the "infrastructure bugs" have reportedly been solved.
Oh, and I've got a bridge in Brooklyn to sell ya, it's a great deal!
> We never reduce model quality due to demand, time of day, or server load
Forgive me, but as a native English speaker, this sentence says exactly one thing to me; We _do_ reduce model quality, just not for these listed reasons.
If they don't do it, they could put a full stop after the fifth word and save some ~~tokens~~ time.
[delayed]
I have yet to experience any degradation in coding tasks I use to evaluate Opus 4.5, but I did see a rather strange and reproducible worsening in prompt adherence as part of none coding tasks since the third week of January.
Very simple queries, even those easily answered via regular web searching, have begun to consistently not result accurate results with Opus 4.5, despite the same prompts previously yielding accurate results.
One of the tasks that I already thought was fully saturated as most recent releases had no issues in solving it was to request a list of material combinations for fabrics used in bag constructions that utilise a specific fabric base. In the last two weeks, Claude has consistently and reproducibly provided results which deviate from the requested fabric base, making the results inaccurate in a way that a person less familiar with the topic may not notice instantly. There are other queries of this type for other topics I am nerdily familiar with to a sufficient degree to notice such deviations from the prompt like motorcycle history specific queries that I can say this behaviour isn't limited to the topic of fabrics and bag construction.
Looking at the reasoning traces, Opus 4.5 even writes down the correct information, yet somehow provides an incorrect final output anyways.
What makes this so annoying is that in coding tasks, with extensive prompts that require far greater adherence to very specific requirements in a complex code base, Opus 4.5 does not show such a regression.
I can only speculate what may lead to such an experience, but for none coding tasks I have seen regression in Opus 4.5 whereas for coding I did not. Not saying there is none, but I wanted to point it out as such discussions are often primarily focused on coding, where I find it can be easier to see potential regressions where their are none as a project goes on and tasks become inherently more complex.
My coding benchmarks are a series of very specific prompts modifying a few existing code bases in some rather obscure ways, with which I regularly check whether a model does severely deviate from what I'd seen previously. Each run starts with a fresh code base with some fairly simple tasks, then gets increasingly complex with later prompts not yet being implemented by any LLM I have gotten to test. Partly that originated from my subjective experience with LLMs early on, where I found a lot of things worked very well but then as the project went on and I tried more involved things with which the model struggled, I felt like the model was overall worse when in reality, what had changed were simply the requirements and task complexity as the project grew and easier tasks had been completed already. In this type of testing, Opus 4.5 this week got as far and provided a result as good as the model did in December. Of course, past regressions were limited to specific users, so I am not saying that no one is experiencing reproducible regressions in code output quality, merely that I cannot reproduce them in my specific suite.
I've noticed a degradation in Opus 4.5, also with Gemini-3-Pro. For me, it was a sudden rapid decline in adherence to specs in Claude Code. On an internal benchmark we developed, Gemini-3-Pro also dramatically declined. Going from being clearly beyond every other model (as benchmarks would lead you to believe) to being quite mediocre. Delivering mediocre results in chat queries and coding also missing the mark.
I didn't "try 100 times" so it's unclear if this is an unfortunate series of bad runs on Claude Code and Gemini CLI or actual regression.
I shouldn't have to benchmark this sort of thing but here we are.
I definitely noticed a degradation, it feels regressed by a generation.
> We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.
Doesn't really work like that. I'd remove the "statistically significant" labelling because it's misleading.
Would love to see this idea expanded to ever alleged SoTA model currently in production. Any speculation as to why this degradation occurs?
Anecdote, I don't have any proof and it's just a feeling. But around afternoon in GMT+1 compared to the morning/midday, there seems to be a change in the quality of responses, which seems to line up with when the US wakes up. I consistently get (what feels like) worse responses in both Codex and Claude Code in the afternoon/night compared to morning/midday, so much that I usually give up then try the same prompt next morning and get better results. But I guess that might as well be about me being more tired in the night than morning too, as I said, haven't measured this.
It’s the afternoon slump. The AI needs a cup of coffee and to doomscroll for half an hour!
Or a load balancing technique :) Either way, it kicks me off to do other things so maybe it isn't so bad after all.
In medicine there is a concept of reporting adverse effects of medication or interventions which are then collectively studied for Public Health [MedWatch][VAERS][EudraVigilance] and in academia. We should have something like that for all coding agents(and agents in other fields too), given how widely its deployed and affect on "health" in general(not only human). Call it the AI "health" of things benchmark.
I would imagine a sort of hybrid qualities of volunteer efforts like wikipedia, new problems like advent of code and benchmarks like this. The goal? It would be to study the collective effort on the affects of usage to so many areas where AI is used.
[MedWatch](https://www.fda.gov/safety/medwatch-fda-safety-information-a...)
[VAERS](https://www.cdc.gov/vaccine-safety-systems/vaers/index.html)
[EudraVigilance](https://www.ema.europa.eu/en/human-regulatory-overview/resea...)
Pretty sure someone at Google, OpenAI, and Anthropic met up at a park, leaving their phones in their car, and had a conversation that January 2026, they were all going to silently degrade their models.
They were fighting an arms race that was getting incredibly expensive and realized they could get away with spending less electricity and there was nothing the general population could do about it.
Grok/Elon was left out of this because he would leak this idea at 3am after a binge.
Finally someone did it! We need this for all models.
That will be great if there's RSS support.
any chance we can get something like this for codex cli that'd be cool too compare
My personal conspiracy theory is that they choose who to serve a degraded model to based on social graph analysis and sentiment analysis, maximizing for persuasion while minimizing compute.
IMO this strategy seems inspired by TikTok's approach for retaining new uploaders.
TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.
In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.
Of course, your suggestion (better service for users who know how to speak Proper English) would be the cherry on top of this strategy.
From what I've seen on HackerNews, Anthropic is all-in on social media manipulation and social engineering, so I suspect that your assumption holds water.
Sounds more like a sound business plan than a conspiracy theory.
It sounds like fraud to me
Does it say anywhere in their terms of service that they guarantee the quality of the model, or promise not to modify it?
https://www.anthropic.com/legal/consumer-terms
https://www.anthropic.com/legal/commercial-terms
This is probably entirely down to subtle changes to CC prompts/tools.
I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.
Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?
Honest, good-faith question.
Is CC getting better, or are you getting better at using it? And how do you know the difference?
I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.
I agree with you, it's personally hard to tell.
For me I've noticed it getting nothing but better over the past couple months, but I've been working on my workflows and tooling.
For example, I used to use plan mode and would put everything in a single file and then ask it to implement it in a new session.
Switching to the 'superpowers' plugin with its own skills to brainstorm and write plans and execute plans with batches and tasks seems to have made a big improvement and help catch things I wouldn't have before. There's a "get shit done" plugin that's similar that I want to explore as well.
The code output always looks good to me for the most part though and I've never thought that it's getting dumber anything, so I feel like a lot of the improvements I see are because of a skill issue on my part trying to use everything. Obviously it doesn't help there's a new way to do things every two weeks though.
Good-faith answer: I can't be certain. But I've been using CC since its release, and Cursor before that (and actually going all the way back to GPT3 to do codegen in the Playground). After getting used to the CC workflow, the way that I use it has been pretty consistent. To be specific, I use basically the same AGENTS.md with small modifications for each project, and I live almost exclusively in Plan mode and the best model (currently Opus 4.5).
My initial prompting is boilerplate at this point, and looks like this:
(Explain overall objective / problem without jumping to a solution)
(Provide all the detail / file references / past work I can think of)
(Ask it "what questions do you have for me before we build a plan?")
And then go back and forth until we have a plan.
Compared to my work with CC six months ago, it's just much more capable, able to solve more nuanced bugs, and less likely to generate spaghetti code.
The easiest way would be to quantize the model, and serve different quants based on the current demand. Higher volumes == worse quant == more customers served per GPU
That's why benchmarks are useful. We all suffer from the shortcomings of human perception.
Benchmarks shortcomings are no worse... they inevitably measure something that is only close to the thing you actually care about, not the thing you actually care about. It's entirely plausible that this decreased benchmark score is because Anthropic's initial prompting of the model was overtuned to the benchmark and as they're gaining more experience with real world use they are changing the prompt to do better at that and consequentially worse at the benchmark.
I wonder how best we can measure the usefulness of models going forward.
Thumbs up or down? (could be useful for trends) Usage growth from the same user over time? (as an approximation) Tone of user responses? (Don't do this... this is the wrong path... etc.)
Benchmarks measure what they measure. But your subjective experience also matters.
I was going to ask, are all other variables accounted for? Are we really comparing apples to apples here? Still worth doing obviously, as it serves a good e2e evaluations, just for curiosity's sake.
[dead]