Cursor Introduces Composer 2.5

https://twitter.com/cursor_ai/status/2056415413077233983

Kudos to the team. Please consider making the model available via API!

> Composer 2.5 is built on the same open-source checkpoint as Composer 2, Moonshot's Kimi K2.5.

Really nice to see they're giving credit to the company and I am optimistic Kimi K open models soon will outperform Opus models

Only because last time they tried to hide it lol

I kind of want to try it, to see if and how far they can take an open model and improve it but I really don’t miss the Cursor user experience. Constant UI changes, half-baked features, smaller and smaller limits, useless AI change attribution; I think I’ll wait for others to report if it’s any good.

Noticed recently they keep opening their “Agents” window when the project was last opened in the VSCode fork window in the hopes I’ll just continue working in that when the UI is totally different and missing things I need.

For a professional tool it’s getting egregious how little respect they have for my workflows and flow state they way they keep moving, changing iconography and flipping switches of the UI.

It’s clearly being ran by someone who comes from a social app or sales app growth hacking background.

Yeah I have a soft spot for Cursor because it was my first tool that unlocked huge productivity with AI, but I avoid doing anything there now.

Should try their CLI!

Isn't there a cli version of cursor by now?

It's a bit better than the VSCode fork, but still much worse than competition:

- lags constantly,

- if you type while it's generating you'll get missed inputs,

- 'plan mode' doesn't clear context before starting work,

- you can't directly edit the plan, you can only ask the bot to do it,

- you can't immediately whitelist commands, only accept once or allow all.

Yes

https://cursor.com/cli

Damn do I feel the UI changes being a pain point.

It’s a near constant regression in my workflows. “Multiple agents” got destroyed recently, and the new interface for it some sort of command isn’t as good or reliable. Then you’ve got modals everywhere[1] and truncated bits (like long branch names) that make it insanely frustrating to use.

They’re constantly changing the UI without actually improving it at all. I’ll likely cancel it and use opencode for personal stuff with Deepseek and only use it at work because I have to. There was a time when I appreciated the harness but it’s becoming less useful, or at least noticeable, over time… all the while the actual UI becomes substantially more painful and awkward to use (like @ in the “agents” window being completely unable to find a file because it’s some sort of “global” scope).

One thing that surprises me about this whole segment is that JetBrains haven’t eaten these folks lunch. Their IDEs are leagues better than VSCode but their AI integration is awful by comparison (and the bar is low). I can’t even see how much of the context window I have left.

[1] it’s insane I have to answer questions in a tiny input box I cannot resize or adjust the size of. Let alone the fact the text area I input prompts into cannot be resized. Truly feels like the UI/UX is done by people without any experience.

> Truly feels like the UI/UX is done by people

To me it feels like it's done entirely by an LLM, starting from the product vision.

Good point.

One of the things I've came to appreciate about the cli tools like Codex or Claude is that the interface is so limited that every feature they release is still limited and constrained to the same UX limitations, whereas those "funkier" IDEs change from month to month giving me further fatigue.

If these benches from their site hold up (they likely wont)

Wouldn't this compress ai revenue like 15x quickly

If they really have a 4.7 opus high equivalent at 1/16 the cost wouldn't this significantly effect all the current capex and planing

Maybe they are getting elon to cover cost

The way I have read their benchmark results is that they trained a model to work insanely well in their coding workflow. It’s not a general purpose model.

One of the surprisingly hardest problems to solve is to get a model to use the tools you give it access to.

AI revenue has been going up while the cost per token has been rapidly falling. The Jevons paradox applies here. The cheaper software is, the more software is written. There is not a finite demand for software.

> AI revenue has been going up while the cost per token has been rapidly falling

Every model release now has been straight price increases since what GPT 4 ? When was the last time a new flagship model decreased prices compared to the previous one ?

1. GPT 4 has gotten 6x cheaper over it's evolution (from initial release to Turbo to 4o). Maybe you meant "Only since 4o and only since its final release". Alas.

2. We are not interested in how different model naming schemes relate to prices, we are interested in the capabilities. So if you want to learn something about price development you need comparative levels of capabilities, and then look at the prices. 4o is not comparable to 5.5 in the first regard. It is (according to the benchmarks) maybe more comparable to current 5 nano - which is 98% cheaper.

Opus 4.5 became significantly cheaper directly per token

You are right I forgot about that ! I think my point still stands - price per token is not decreasing for frontier capabilities, in fact it's increasing.

token efficiency

Not seeing that either, tried really using Opus 4.7 today, and it ended up at $50 for the same kida thing that came out to $25 last week with Opus 4.6.

each model is different and nothing should be taken for granted, run your evals for your use cases. I'm not using Opus 4.7 for almost anything. I've seen very good improvements in GPTs since 5.2 and Opus 4.5 to 4.6 was quite an upgrade.

Models consume more tokens than ever for the same tasks.

This is conjecture. There is a reason both openai and anthropic refuse to comment on inference costs. If it were falling so much, they would use it to brag. I really don't understand why so many people keep repeating it without any actual data for the frontier models.

Apart from that, I'm not sure if focusing on tokens is even a good idea, because they are so different from model to model. I'd almost consider them a red herring now.

We could look at tasks instead. Is there anything even remotely suggesting that your typical task you give an LLM now costs less in inference than before?

The problem with this is that we do not know the actual cost. For all we know they might be pulling an Anthropic. Subsidizing costs to get users, then increasing them later on.

They're offering a model based on Kimi K2.5 for $0.50/M input and $2.50/M output while the cheapest third-party provider on OpenRouter charges $0.40/M input and $1.90/M output https://openrouter.ai/moonshotai/kimi-k2.5 Those third-party providers have little incentive to subsidize their customers, so Cursor probably has a margin >20% on their inference cost.

The real money furnace is the training, not just of models that get released, but also experimental training runs that fail to move benchmarks and are quietly thrown away. E.g. Cursor claim that 85% of the compute for Composer 2.5 comes from additional training on top of Kimi K2.5, where I'm not sure how they determined that, but it can't have been cheap. Then they say "Together with SpaceXAI, we're training a significantly larger model from scratch, using 10x more total compute."

So yes, they're probably attempting to replicate the Anthropic playbook of paying a large upfront cost for a very good model, and then rapidly acquiring paying customers, hoping that the inference margin will be enough to cover the training cost.

this thing is so awesome on fast mode, so far i am impressed, some of its observations feel similar to opus.

i use gpt 5.5 and opus 4.7 a lot every day, if i can get good results at this speed, hopefully the usage level holds up on my team plan haha

> compress ai revenue like 15x

that roughly just puts it on par with OpenAI and Anthropic subscriptions in terms of pricing per token

I'm not sure that to be the case, it seems like bringing capabilities up and costs down merely serves to induce more demand.

I hope people soon wake up to the fact that they use user data for model fine tuning.

The model is (like Composer 2) based on Kimi K2.5 and they claim SOTA performance for 1/10th of the cost. The tweet also mentions that they've started a new model from scratch on Colossus 2 (xAI/SpaceX Cluster). Really impressive how they've made this jump from being called the vscode fork with no moat just a couple of months ago.

> Really impressive how they've made this jump from being called the vscode fork with no moat just a couple of months ago.

Impressive, yes. But they still don't have a moat...

I am not sure we should dismiss what they have today. Nobody has yet to come close with a full package ide that works well for coding. Is that not a moat? It is easy for my to in my head discount it, thinking that I could build something myself but between autocomplete and their workflow for agent use, it feels like they have some tangible moat emerging.

If we ignore cost (which is kinda hard to ignore), I feel Codex kinda' does it for me. Sure it's not really an editor but I find I don't need that _that much_ and it's easy to launch an external editor (they actually have the feature).

The ironic thing is that half a year ago, after trying factory.ai I thought chat-first interface was a stupid idea that will never work.

Have you tried Zed?

I haven’t tried Cursor, so don’t know how they compare, but I like Zed a lot.

Anyway, would love to see a comparison from someone who has used a recent version of each.

A few years ago I tried Zed when it was still pretty early, but eventually settled on Cursor. I gave Zed another shot a few days ago because Cursor’s worktree support still feels pretty weak.

In my setup I use multiple agents like Claude Code and Codex, and Zed’s ACP support makes it pretty nice to manage them all as “threads” in one place. Worktree switching also feels much smoother.

Overall the experience was pretty good, but the way the agent and editor are integrated still feels a bit lacking, and tab completion is the big one for me. Cursor’s tab completion is still the best I’ve used.

So now I’m using both. For work that needs a lot of focus and careful iteration, I use Cursor. For things that are easy to split into worktrees and hand off to agents, I use Zed with Claude/Codex.

Isn't a large user base and the data collected from those users a moat of sorts?

A moat is when you have something other's can't easily get.

Every MAG 7 / FAANG company already has more users and more data...

That's not a moat.

That's traction.

That's not X.

That's Y.

Been a bit out of the loop.

What's wrong with using very short sentences like 'That's not X. That's Y.'?

Commonly used phrase by LLMs. Gives people slop vibes these days.

I fear the day that large parts of perfectly valid English language and punctuation are off limits for humans to use because LLMs use them too (having learned them from humans), and somebody will always whine and post low effort "slop" comments that are much more annoying and less useful than the slop itself, or even incorrectly whine about human written text that happens to match your hyper-sensitive slop detector.

Plus you are always running the risk of being rude and insulting when incorrectly labeling text actually written by humans as slop — making a jackass of yourself — and opening yourself up to being trolled by humans purposefully inserting em-dashes and catch phrases just to trigger you. That's not clever. That's gullible.

How much cognitive and physical effort and time do you put into trying to figure out if everything you read is slop, then complaining about it? If that's your job or calling in life, you could be easily replaced with AI. Find something more creative to do with your time.

If you really object to low effort slop, and not just relish it as an opportunity to whine, then how about instead of posting low effort whines about slop, you put in the actual effort to do something about it, and rewrite the slop in a way that won't trigger your slop detector, then post that instead, to train AI not to write slop.

Is your problem that it's slop, or that it's AI generated? Because your whining about low effort AI generated slop without contributing to the conversation or addressing the point of the comment you're replying to is just low effort human generated slop.

Please don't post slop while complaining about slop.

Honestly the data itself is probably worth heaps even in the company itself collapses. Early attention engineering when humans were still in the loop!!!

> Early attention engineering when humans were still in the loop

Exactly. Cursor was the first product used by tons of devs on real codebases. Just the signal "acceptance rate" is huge and can't be easily captured w/ synthetic data.

And its still just a vscode fork

Cursor 3 is a complete rewrite, its no longer a fork.

How much the RL they are doing really improves Kimi K2.5 is to be seen. So, right now, the ground truth is that they combined what they had with a strong open weights model. The RL improvement may be both marginal (since may folks report strong results with vanilla K2.6) and may mostly bias the model towards coding tasks: when a model like this is trained to be generalist, there is a tension between being good at one thing and the other, in terms of SFT and RL. You can see this in the DeepSeek v4 Flash training report for instance but it is a known fact. So if you have the GPUs and a decent RL pipeline that does not run the model you can indeed specialize it a bit more for a given task at the expenses of tasks people will not do inside Cursor. But, so far, the measurable reality is that Cursor uses an open weight model like most could do, and the RL story could be partilly a marketing move to call to Composer 2.5 more than a real strong gain, given that there is no way to verify and K2.5 was already strong. And we also know that they had to partner to do the training, which is also not a good news.

Since the frontier is only 8-month ahead of DeepSeek, it is hard to see how model training can be a moat as all the tricks are available from open labs in China. You really just need <100m to bootstrap at this point.

This was the only way forward.

In my opinion cursor actually has one of the best harnesses again at the moment.

They are still a vscode fork with no moat? Like they lost about 70% of users in half a year which goes to show how there is not even the tiniest of moat.

I feel like they've been targeting enterprise pretty hard. I know my company uses them, and the companies that hire us also use Cursor.

All enterprises I know use GitHub copilot as they already have Office, Teams, … wonder how will it change with the recent pricing changes

I can tell my company wants nothing with them.

Cursor will definitely win the enterprise for coding. Enterprises aren't going to trust a TUI

Why not? That makes no sense to me.

>Really impressive how they've made this jump from being called the vscode fork with no moat just a couple of months ago.

With so much money and computing from SpaceX, is not so impressive.

why is that part impressive specifically? they got purchased by SpaceX, they have access to infinite compute and cash now.

& now they're still losing all of their users to Claude Code and Codex.

>& now they're still losing all of their users to Claude Code and Codex.

Why pay for Cursor when I can use GLM 5.1, Kimi K2.6, MiniMax M2.7, Xiaomi MiMo V2.5 Pro and Deepseek v4 for cheap and use whatever harness I want, including Claude Code.

It's not like Cursor harness is the best out there.

And even if I want to edit the code, I don't need to run the agent harness in an IDE.

It's still a VsCode fork just now with a Kimi fine tune and still no moat...

I won't debate that it turns out none of this mattered when it came to being as successful company though and kinda makes anyone who tried to roll their own instead of fork look a little silly.

"No moat", well...

How I see this is that its so important to bundle the model with the right tooling.

Like a racecar, having the best engine doesn't help if the rest of the car lacks other winning properties (reliability, aerodynics etc).

So for Cursor, which IMO, they put themself in a strong position by having both a solid IDE __and__ a solid+cost efficient model. Those two working great in combination for the task they are designed to solve (coding) is more important than benchmarks

[deleted]

I doubt it's a brand new model. It's likely just Kimi K2.5 further trained on coding.

They didn't say it's a new model... in fact they said exactly what you just said.

They set themselves up for flack when they use whatever these evals are… they did the same for composer 2 which was evaled in close competition with frontier models, spoiler alert, it wasn’t even close in practice.

So now 2.5 is supposed to compete with opus 4.7? Sure…

they say it themselves in the post - behavior dimensions "not well captured by existing benchmarks". that was the exact problem with composer 2. not dumber on individual tasks, just bad at session-level decisions like when to stop editing, how much context to carry forward, when to re-read a file vs assume. you don't catch any of that in an isolated eval.

As I have said before in prior composer threads. The proof is in the usage. I am inclined to somewhat believe the results as I use composer and also take the results for the given context. It’s not a general purpose sota model. It’s a model that runs inexpensively in their coding workflow that is creating results similar to opus or gpt.

Well is that a statement about the quality of Opus 4.7 or about compose 2.5? :P

A lot of people saying Cursor have no moat. Sure. Neither do OpenAI or Anthropic.

You could say they have a sort of anti-moat (drawbridge?) since you can use their product to create a competitor. But that's true of most dev tools, in a sense.

[dead]

Seems like a promising and useful model but its probably scary how much customer data they fed into it to reach this performance

It's a bit confusing to me why they'd make this 'fast' version the default, as it appears to be much more expensive than Composer 2. Wasn't it supposed to be a very cheap alternative to SOTA models?

Isn’t it a really cheap alternative to sota models (according to benchmarks)?

Ok this might be weird but I've moved everyone in my 4 person team to our team plan and costs seem to have sky rocketed compared to the individual plans. Where before most people spent 20-100 USD, now the total bill is more like 1k USD. I haven't gone into the details but it feels like I'm being scammed.

We moved off Cursor and onto Codex + Claude Code. Cost went from multiple thousand per engineer per month to about $500

My company is shifting us from Cursor to Claude due to increased costs.

I did some monitoring. 15 accounts, 300 millions tokens input, 200k output went to 0 the 5h quota in 7 hours. 4 parallel tasks.

I think 300 million is too low. For reference before I could do more than 1 billion on same conditions.

Check which model you're using.

The fast version of composer is the default now (which costs ~x3 as much).

Keep in mind I believe there is a larger buffer given to personal plans. If they have 50% extra with the personal plan you now only get 25%.

My cursor costs sky rocketed recently too

Full details https://cursor.com/blog/composer-2-5

Thanks! Link belatedly changed above.

I wonder why they didn’t train off Kimi 2.6, I hope is it because they already had a good base and not that they messed up that relationship.

> and not that they messed up that relationship.

There's nothing to mess up. The license is MIT w/ attribution, and the attribution clause can be easily sidestepped w/o any legal repercussions. The "drama" was simply content creators going nuts over some misunderstandings and poor comms from some kimi related devs.

That's 3.0

[deleted]

Surprised this got pushed off the front page so quickly! It’s exciting to see what the Cursor team has been able to do with significantly fewer resources than the frontier labs.

I do wish they weren’t joining xAI. Something tells me there will be a contingent of researchers that departs Cursor if that merger is consummated.

It set off the flamewar detector, a,k.a. the overheated discussion detector. We'll turn that off.

Thanks, dang! The blog post[1] might be a better source than the twitter thread. Also I regret my typo above (lab -> labs) but too late now!

[1] https://cursor.com/blog/composer-2-5

Thanks! I had been just about to add that maybe the link wasn't the most informative. We've switched it now from https://twitter.com/cursor_ai/status/2056415413077233983.

As for the typo, s's are cheap and I've added one :)

I think anybody will be much better by acquiring a coding plan from Kimi.com and using Kimi K2.6, with whatever harness they like, including Claude Code, instead of paying more for Cursor's version of Kimi K2.5.

Can you please train Qwen 3.5 like 0.8B to 9B using the same training techniques

Will this be the cursor's last dance? LoL

It's always great that more companies are throwing their hat in the ring, especially focusing on value (latency + intelligence + cost)

this feels super bullish on cursor/spacexai's ability to train a frontier level model. could be truly SOTA on coding given that their RL data is this powerful

It's a bit odd that they're not comparing it against Sonnet

I don't think so. They're comparing it to the highest tier available models from Anthropic and OpenAI. Generally speaking, Opus is better than Sonnet in almost every way, so why have the redundancy?

Price to performance?

The tweet specifies that the new model is geared towards long-running tasks, which is what you'd use a model like Opus for anyway.

Did they just upgrade Kimi 2.5 to 2.6?

still uses 2.5

I don't know why their model isn't on Openrouter yet. They must not have enough capacity to offer it.

Their previous Composer was already marketed as a cheap model capable of competing with SOTA on most tasks. The evals they shared back then backed this up but in my day-to-day usage it fell short across the board. Canceled my cursor subscription and switched to Claude Code a few weeks ago. It has its own shortcomings but in terms of model capability and UX quality Cursor will have a hard time competing in the long term. Elon Musk will be a very good way out for them.

Non-x link: https://cursor.com/blog/composer-2-5 (https://news.ycombinator.com/item?id=48182126)

Congratulations on the launch! I'm interested in trying Cursor but it's very confusing what I should buy. What does the Pro $20 plan get me in usage if I only use Composer 2.5? How fast is the model?

I use $20 plan on daily basis for more than a year now, and have yet to exhaust that limit. The plan includes $20 in api costs for non-Cursor premium models and $20 for Composer and Auto models provided by Cursor themselves.

That said, I am pretty old-fashioned coder and use LLM mostly to overcome the blank page problem, which means I review and often rewrite LLM output by hand and avoid prompt loops for a single task.

People who are aiming to not read code any more might find this $20 plan lacking for their needs, however for my needs it fits perfectly.

The limits are probably even higher than that, i seem to get about 100$+ of usage on composer and about 45-50 usd on non composer models

[dead]