what i've learned running multi-agent workflows...
>use the expensive models for planning/design and the cheaper models for implementation
>stick with small/tightly scoped requests
>clear the context window often and let the AGENTS.md files control the basics
> By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.
Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.
There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://news.ycombinator.com/item?id=46886265 . No idea if it works but the authors are credible. Agentic recursive LLMs look most promising to me right now. See https://arxiv.org/abs/2512.24601 for an intro to them.
> Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.
disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.
I think that's reasonable, but then they should have the ability for the agent to, on the next call, override it. Even if it requires the agent to have read the file once or something.
In the absence of that you end up with what several of the harnesses ended up doing, where an agent will use a million tool calls to very slowly read a file in like 200 line chunks. I think they _might_ have fixed it now (or agent-fixes, my agent harness might be fixing it), but Codex used to do this and it made it unbelievably slow.
You’re describing peek.
An agent needs to be able to peek before determining “Can I one shot this or does it need paging?”
> when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
On the other hand, I despise that it automatically pipes things through output-limiting things like `grep` with a filter, `head`, `tail`, etc. I would much rather it try to read a full grep and then decide to filter-down from there if the output is too large -- that's exactly what I do when I do the same workflow I told it to do.
Why? Because piping through output liming things can hide the scope of the "problem" I'm looking at. I'd rather see the scope of that first so I can decide if I need to change from a tactical view/approach to a strategic view/approach. It would be handy if the agents could do the same thing -- and I suppose they could if I'm a little more explicit about it in my tool/prompt.
In my experience this is what Claude 4.5 (and 4.6) basically does, depending on why its grepping it in the first place. It'll sample the header, do a line count, etc. This is because the agent can't backtrack mid-'try to read full file'. If you put the 50,000 lines into the context, they are now in the context.
The brain trims it's context through forgetting details that do not matter
LLMs will have to eventually cross this hurdle before they become our replacements
the quadratic curve makes sense but honestly what kills us more is the review cost - AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep. we burn more time auditing AI output than we save on writing it, and that compounds. the API costs are predictable at least
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.
Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.
I mean in theory yes, good abstractions solve a lot - but in practice you're rarely starting from a clean slate. you're integrating with third-party APIs that have weird edge cases, working with legacy code that wasn't designed for what you're doing now, dealing with requirements that change mid-implementation. even with great abstractions the real world bleeds through. and AI doesn't know which abstractions are 'right' for your specific context, it just pattern-matches what looks reasonable. so you end up reviewing not just for bugs but to make sure it's not subtly incompatible with your architecture
What surprises me is that this obvious inefficiency isn't competed out of the market. Ie this is clearly such a suboptimal use of time and yet lots of companies do it and don't get competed out by other ones that don't do this
I think the issue is everyone's stuck in the same boat - the alternative to using AI and spending time reviewing is just writing it yourself, which takes even longer. so even if it's not a net win, it's still better than nothing. plus a lot of companies aren't actually measuring the review overhead properly - they see 'AI wrote 500 lines in 2 minutes' and call it a productivity win without tracking the 3 hours spent debugging it later. the inefficiency doesn't get competed out because everyone has the same constraints and most aren't measuring it honestly
Short term gets faster more competitive results than long term.
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.
I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
yeah test-driven constraints help a lot - we've been moving that direction too, basically treating the agent like a junior dev who needs guard rails. the build+test gates catch the obvious stuff. but the trickier part is when tests pass but the code still isn't what you wanted - like it works but takes a fundamentally wrong approach, or adds unnecessary complexity. those are harder to catch with automation. re: LLM vs AI terminology - fair point, though I think the ship has sailed on general usage. most people just say AI to mean 'the thing that writes code' regardless of what's under the hood
[dead]
I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.
Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.
Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.
> Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully? Expectations should be defined, input validated, and anything that's unexpected should be rejected. Resilience against poorly formatted or otherwise nonsensical input is a pretty basic requirement.
I hope I severely misunderstood what you meant to say because we can't be having serious discussions about how amazing this technology is if we're silently dropping the standards to make it happen.
yeah you're spot on - the whole "can't fault AI for bugs" mindset is exactly the problem. like, if a junior dev shipped code that crashed on malformed input we'd send it back for proper validation, why would we accept worse from AI? I keep seeing this pattern where people lower their bar because the AI "mostly works" but then you get these silent failures or weird edge case explosions that are way harder to debug than if you'd just written defensive code from the start. honestly the scariest bugs aren't the ones that blow up in your face, it's the ones that slip through and corrupt data or expose something three deploys later
I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully?
Because I, the spec writer, didn't think of it. I would have made the same mistake if I wrote the code.
If you'd write the code yourself, you'd be much more likely to remember to handle those cases as well.
> Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
This is likely the future.
That being said: "I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.".
If you are spending a lot of time fixing syntax, have you looked into linters? If you are spending too much time thinking about how to structure the code, how about spending some days coming up with some general conventions or simply use existing ones.
If you are getting so much productivity from LLMs, it is worth checking if you were simply unproductive relative to your average dev in the first place. If that's the case, you might want to think, what is going to happen to your productivity gains when everyone else jumps on the LLM train. LLMs might be covering for your unproductivity at the code level, but you might still be dropping the ball in non-code areas. That's the higher level pattern I would be thinking about.
I was a good dev but I did not love the code itself. I loved the outcome. Other devs would have done better on leetcode and they would have produced better code syntax than me.
I’ve always been more of a product/business person who saw code as a way to get to the end goal.
That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.
Hence, LLMs have been far better for me in terms of productivity.
> I’ve always been more of a product/business person who saw code as a way to get to the end goal.
That’s what code always is. A description on how the computer can help someone faster to the end goal. Devs care a little more about the description, because end goals change and rewriting the whole thing from scratch is costly and time-consuming.
> That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.
I believe that coder exists only in your imagination. All the good ones I know are great communicators. Clarity of thought is essential to writing good code.
> > That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.
> I believe that coder exists only in your imagination. All the good ones I know are great communicators. Clarity of thought is essential to writing good code.
Clarity of thought does not make you a good communicator with respect to communicating with business people. People, for example, say about me that I am really good at communicating to people who are in deep love of research, but when I present arguments of similar clarity to business people, my often somewhat abstract considerations typically go over their heads.
I believe that coder exists only in your imagination. All the good ones I know are great communicators. Clarity of thought is essential to writing good code.
I don't think so. These coders exist everywhere. Plenty of great coders are great at writing the code itself but not at the business aspects. Many coders simply do not care about the business or customers part. To them, the act of coding and producing quality code and the process of writing software is the goal. IE. These people are most likely to decline building a feature that customers and the business desperately need because it might cause the code base to become harder to maintain. These people will also want to refactor more than building new features. In the past, these people had plenty of value. In the era of LLMs, I think these people have less value than business/product oriented devs.
> Many coders simply do not care about the business or customers part.
These coders may exist, but they are in my experience not that common. Most coders do care about the business or customers part, but think very differently about these aspects than business people, and thus come to very different conclusions how to handle these topics.
In my experience, it's rather exactly these programmers who are often in conflict with business people
- because they care about such topics
- because they come to different conclusions than the business people, and
- because these programmers care so much about these business-related topics, they are very vocal and sometimes confrontative with their opinions.
In other words: coders who barely care about these business-related aspects are often much easier to handle for business-minded people.
You have way more trust in test suites than I do. How complex is the code you’re working with? In my line of work most serious bugs surface in complex interactions between different subsystems that are really hard to catch in a test suite. Additionally in my experience the bugs AI produces are completely alien. You can have perfect code for large functions and then somewhere in the middle absolutely nonsensical mistakes. Reviewing AI code is really hard because you can’t use your normal intuitions and really have to check everything meticulously.
If it’s hard to catch with a comprehensive suit of test, what makes you think you can catch them by hand coding?
My context window is larger than an LLM's and I remember more implicit contracts between different systems.
A great lot of thinking about the code, which you can only do if you’re very familiar with it. Writing the code is trivial. I spend nearly all my work hours thinking about edge cases.
If I vibe coded an app without looking at every line, I'm very familiar with how the app works and should work. It's just a different level of abstraction. It doesn't mean I'm not thinking about edge cases if I'm vibe coding. In fact, I might think about them more since I will have more time to think about them without having to write the code.
The review cost problem is really an observability problem in disguise.
You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically. If an AI agent generates code that passes your existing test suite, linter, and type checker, you've reduced the review surface to "does this do what I asked" rather than "did it break something."
The teams I've seen succeed with coding agents treat them like a junior dev with commit access gated behind CI. The agent proposes, CI validates, human reviews intent not implementation. The ones struggling are the ones doing code review line-by-line on AI output, which defeats the purpose entirely.
The real hidden cost isn't the API calls or the review time - it's the observability gap. Most teams have no idea what their agents are actually doing across runs. No cost-per-task tracking, no quality metrics per model, no way to spot when an agent starts regressing. You end up flying blind and the compounding costs you mention are a symptom of that.
> You should have test coverage, type checking, and integration tests that catch the edge cases automatically.
You should assume that if you are going to cover edge cases your tests will be tens to hundredths times as big as the code tested. It is the case for several database engines (MariaDB has 24M of C++ in sql directory and 288M of tests in mysql-test directory), it was the case when I developed VHDL/Verilog simulator. And not everything can be covered with type checking, many things, but not all.
AMD's FPU had hundredths of millions test cases for its' FPU and formal modeling caught several errors [1].
SQLite used to have 1100 LOC of tests per one LOC of C code, now the multiplier is smaller, but still is big.
> You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically.
Because tests are always perfect and fetch every corner-case, and are even detecting all unusual behaviour they are not testing for? Seems unrealistic. But explains the sharp rise of AI-slop and self-inflicted harm.
That's a lovely idea but it's just not possible to have tests that are guaranteed to catch everything. Even if you can somehow cover every single corner case that might ever arise (which you can't), there's no way for a test to automatically distinguish between "this got 2x slower because we have to do more work and that's an acceptable tradeoff" and "this got 2x slower because the new code is poorly written."
I'd absolutely want to review every single line of code made by a junior dev because their code quality is going to be atrocious. Just like with AI output.
Sure, you can go ahead and just stick your head in the sand and pretend all that detail doesn't exist, look only at the tests and the very high level structure. But, 2 years later you have an absolutely unmaintainable mess where the only solution is to nuke it from orbit and start from scratch, because not even AI models are able to untangle it.
I feel like there are really two camps of AI users: those who don't care about code quality and implementation, only intent. And those who care about both. And for the former camp, it's usually not because they are particularly pedantic personalities, but because they have to care about it. "Move fast and break things" webapps can easily be vibe coded without too much worry, but there are many systems which cannot. If you are personally responsible, in monetary and/or legal aspects, you cannot blame the AI for landing you in trouble, just as much as a carpenter cannot blame his hammer for doing a shit job.
Nice article. I think a key part of the conversation is getting people to start thinking in terms of evals [1] and observability but it's been quite tough to combat the hype of "but X magic product just solves what you mentioned as a concern for you".
You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.
I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).
The cache gets read at every token generated, not at every turn on the conversation.
Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.
What's in the prompt cache?
The prompt cache caches KV Cache states based on prefixes of previous prompts and conversations. Now, for a particular coding agent conversation, it might be more involved in how caching works (with cache handles and so on), I'm talking about the general case here. This is a way to avoid repeating the same quadratic cost computing over the prompt. Typically, LLM providers have much lower pricing for reading from this cache than computing again.
Since the prompt cache is (by necessity, this is how LLMs work) prefix of a prompt, if you have repeated API calls in some service, there is a lot of savings possible by organizing queries to have less commonly varying things first, and more varying things later. For example, if you included the current date and time as the first data point in your call, then that would force a recomputation every time.
Way too much. This has got to be the most expensive and most lacking in common sense way to make software ever devised.
I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.
Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.
To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.
So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.
While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.
> the time it takes to generate the Millionth output token is the same as the first output token.
This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.
> cached input tokens are almost virtually free naturally
No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.
Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.
Now consider 100k users doing basically this, all day long. This is not free and can't become free.
Caching might be free, but I think making caching cost nothing at the API level is not a great idea either considering that LLM attention is currently more expensive with more tokens in context.
Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.
GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.
This matches my experience running coding agents at scale. The cached token pricing is indeed somewhat artificial - in practice, for agent workflows with repeated context (like reading the same codebase across multiple tasks), you can achieve near-zero input costs through strategic caching. The real cost optimization isn't just token pricing but minimizing the total tokens flowing through the loop through better tool design.
Are you hosting your own infrastructure for coding agents? At least from first glance, sharing actual codebase context across compacts / multiple tasks seems pretty hard to pull off with good cost-benefit unless you have vertical integration from the inference all the way to the coding agent harness.
I'm saying this because the current external LLM providers like OpenAI tend to charge quite a bit for longer-term caching, plus the 0.1x cache read cost multiplied by # LLM calls, so I doubt context sharing would actually be that beneficial considering you won't need all the repeated context every time, so caching context results in longer context for each agentic task which might increase API costs by more overall than you save by caching.
Very awesome to see these numbers, to see this explored so. Nice job exe.dev.
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
TFA is talking about being quadratic in dollar cost as the conversation goes on, not quadratic in time complexity as n gets larger.
Edit to add: I see you are a new account and all your comments thus far are of a similar format, which seems highly suspicious. In the unlikely event you are a human, please read the hacker news guidelines https://news.ycombinator.com/newsguidelines.html
[flagged]
> Instead of feeding 500 lines of tool output back into the next prompt
Applies for everything with LLMs.
Somewhere along the idea, it seems like most people got the idea that "More text == better understanding" whereas reality seems to be the opposite, the less tokens you can give the LLM with only the absolute essentials, the better.
The trick is to find the balance, but "more == better" which many users seem to operate under seems to be making things worse, not better.
Another new LLM slop account on HN..
> Too little and the agent loses coherence.
Obviously you don't have to throw the data away, if the initial summary was missing some important detail, the agent can ask for additional information from a subthread/task/tool call.
what i've learned running multi-agent workflows... >use the expensive models for planning/design and the cheaper models for implementation >stick with small/tightly scoped requests >clear the context window often and let the AGENTS.md files control the basics
> By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.
Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.
There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://news.ycombinator.com/item?id=46886265 . No idea if it works but the authors are credible. Agentic recursive LLMs look most promising to me right now. See https://arxiv.org/abs/2512.24601 for an intro to them.
> Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.
disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.
I think that's reasonable, but then they should have the ability for the agent to, on the next call, override it. Even if it requires the agent to have read the file once or something.
In the absence of that you end up with what several of the harnesses ended up doing, where an agent will use a million tool calls to very slowly read a file in like 200 line chunks. I think they _might_ have fixed it now (or agent-fixes, my agent harness might be fixing it), but Codex used to do this and it made it unbelievably slow.
You’re describing peek.
An agent needs to be able to peek before determining “Can I one shot this or does it need paging?”
> when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
On the other hand, I despise that it automatically pipes things through output-limiting things like `grep` with a filter, `head`, `tail`, etc. I would much rather it try to read a full grep and then decide to filter-down from there if the output is too large -- that's exactly what I do when I do the same workflow I told it to do.
Why? Because piping through output liming things can hide the scope of the "problem" I'm looking at. I'd rather see the scope of that first so I can decide if I need to change from a tactical view/approach to a strategic view/approach. It would be handy if the agents could do the same thing -- and I suppose they could if I'm a little more explicit about it in my tool/prompt.
In my experience this is what Claude 4.5 (and 4.6) basically does, depending on why its grepping it in the first place. It'll sample the header, do a line count, etc. This is because the agent can't backtrack mid-'try to read full file'. If you put the 50,000 lines into the context, they are now in the context.
The brain trims it's context through forgetting details that do not matter
LLMs will have to eventually cross this hurdle before they become our replacements
the quadratic curve makes sense but honestly what kills us more is the review cost - AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep. we burn more time auditing AI output than we save on writing it, and that compounds. the API costs are predictable at least
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.
Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.
I mean in theory yes, good abstractions solve a lot - but in practice you're rarely starting from a clean slate. you're integrating with third-party APIs that have weird edge cases, working with legacy code that wasn't designed for what you're doing now, dealing with requirements that change mid-implementation. even with great abstractions the real world bleeds through. and AI doesn't know which abstractions are 'right' for your specific context, it just pattern-matches what looks reasonable. so you end up reviewing not just for bugs but to make sure it's not subtly incompatible with your architecture
What surprises me is that this obvious inefficiency isn't competed out of the market. Ie this is clearly such a suboptimal use of time and yet lots of companies do it and don't get competed out by other ones that don't do this
I think the issue is everyone's stuck in the same boat - the alternative to using AI and spending time reviewing is just writing it yourself, which takes even longer. so even if it's not a net win, it's still better than nothing. plus a lot of companies aren't actually measuring the review overhead properly - they see 'AI wrote 500 lines in 2 minutes' and call it a productivity win without tracking the 3 hours spent debugging it later. the inefficiency doesn't get competed out because everyone has the same constraints and most aren't measuring it honestly
Short term gets faster more competitive results than long term.
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.
I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
yeah test-driven constraints help a lot - we've been moving that direction too, basically treating the agent like a junior dev who needs guard rails. the build+test gates catch the obvious stuff. but the trickier part is when tests pass but the code still isn't what you wanted - like it works but takes a fundamentally wrong approach, or adds unnecessary complexity. those are harder to catch with automation. re: LLM vs AI terminology - fair point, though I think the ship has sailed on general usage. most people just say AI to mean 'the thing that writes code' regardless of what's under the hood
[dead]
I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.
Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.
Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.
> Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully? Expectations should be defined, input validated, and anything that's unexpected should be rejected. Resilience against poorly formatted or otherwise nonsensical input is a pretty basic requirement.
I hope I severely misunderstood what you meant to say because we can't be having serious discussions about how amazing this technology is if we're silently dropping the standards to make it happen.
yeah you're spot on - the whole "can't fault AI for bugs" mindset is exactly the problem. like, if a junior dev shipped code that crashed on malformed input we'd send it back for proper validation, why would we accept worse from AI? I keep seeing this pattern where people lower their bar because the AI "mostly works" but then you get these silent failures or weird edge case explosions that are way harder to debug than if you'd just written defensive code from the start. honestly the scariest bugs aren't the ones that blow up in your face, it's the ones that slip through and corrupt data or expose something three deploys later
If you'd write the code yourself, you'd be much more likely to remember to handle those cases as well.
> Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
This is likely the future.
That being said: "I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.".
If you are spending a lot of time fixing syntax, have you looked into linters? If you are spending too much time thinking about how to structure the code, how about spending some days coming up with some general conventions or simply use existing ones.
If you are getting so much productivity from LLMs, it is worth checking if you were simply unproductive relative to your average dev in the first place. If that's the case, you might want to think, what is going to happen to your productivity gains when everyone else jumps on the LLM train. LLMs might be covering for your unproductivity at the code level, but you might still be dropping the ball in non-code areas. That's the higher level pattern I would be thinking about.
I was a good dev but I did not love the code itself. I loved the outcome. Other devs would have done better on leetcode and they would have produced better code syntax than me.
I’ve always been more of a product/business person who saw code as a way to get to the end goal.
That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.
Hence, LLMs have been far better for me in terms of productivity.
> I’ve always been more of a product/business person who saw code as a way to get to the end goal.
That’s what code always is. A description on how the computer can help someone faster to the end goal. Devs care a little more about the description, because end goals change and rewriting the whole thing from scratch is costly and time-consuming.
> That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.
I believe that coder exists only in your imagination. All the good ones I know are great communicators. Clarity of thought is essential to writing good code.
> > That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.
> I believe that coder exists only in your imagination. All the good ones I know are great communicators. Clarity of thought is essential to writing good code.
Clarity of thought does not make you a good communicator with respect to communicating with business people. People, for example, say about me that I am really good at communicating to people who are in deep love of research, but when I present arguments of similar clarity to business people, my often somewhat abstract considerations typically go over their heads.
> Many coders simply do not care about the business or customers part.
These coders may exist, but they are in my experience not that common. Most coders do care about the business or customers part, but think very differently about these aspects than business people, and thus come to very different conclusions how to handle these topics.
In my experience, it's rather exactly these programmers who are often in conflict with business people
- because they care about such topics
- because they come to different conclusions than the business people, and
- because these programmers care so much about these business-related topics, they are very vocal and sometimes confrontative with their opinions.
In other words: coders who barely care about these business-related aspects are often much easier to handle for business-minded people.
You have way more trust in test suites than I do. How complex is the code you’re working with? In my line of work most serious bugs surface in complex interactions between different subsystems that are really hard to catch in a test suite. Additionally in my experience the bugs AI produces are completely alien. You can have perfect code for large functions and then somewhere in the middle absolutely nonsensical mistakes. Reviewing AI code is really hard because you can’t use your normal intuitions and really have to check everything meticulously.
If it’s hard to catch with a comprehensive suit of test, what makes you think you can catch them by hand coding?
My context window is larger than an LLM's and I remember more implicit contracts between different systems.
A great lot of thinking about the code, which you can only do if you’re very familiar with it. Writing the code is trivial. I spend nearly all my work hours thinking about edge cases.
If I vibe coded an app without looking at every line, I'm very familiar with how the app works and should work. It's just a different level of abstraction. It doesn't mean I'm not thinking about edge cases if I'm vibe coding. In fact, I might think about them more since I will have more time to think about them without having to write the code.
The review cost problem is really an observability problem in disguise.
You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically. If an AI agent generates code that passes your existing test suite, linter, and type checker, you've reduced the review surface to "does this do what I asked" rather than "did it break something."
The teams I've seen succeed with coding agents treat them like a junior dev with commit access gated behind CI. The agent proposes, CI validates, human reviews intent not implementation. The ones struggling are the ones doing code review line-by-line on AI output, which defeats the purpose entirely.
The real hidden cost isn't the API calls or the review time - it's the observability gap. Most teams have no idea what their agents are actually doing across runs. No cost-per-task tracking, no quality metrics per model, no way to spot when an agent starts regressing. You end up flying blind and the compounding costs you mention are a symptom of that.
AMD's FPU had hundredths of millions test cases for its' FPU and formal modeling caught several errors [1].
[1] https://www.cs.utexas.edu/~moore/acl2/v6-2/INTERESTING-APPLI...
SQLite used to have 1100 LOC of tests per one LOC of C code, now the multiplier is smaller, but still is big.
> You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically.
Because tests are always perfect and fetch every corner-case, and are even detecting all unusual behaviour they are not testing for? Seems unrealistic. But explains the sharp rise of AI-slop and self-inflicted harm.
That's a lovely idea but it's just not possible to have tests that are guaranteed to catch everything. Even if you can somehow cover every single corner case that might ever arise (which you can't), there's no way for a test to automatically distinguish between "this got 2x slower because we have to do more work and that's an acceptable tradeoff" and "this got 2x slower because the new code is poorly written."
I'd absolutely want to review every single line of code made by a junior dev because their code quality is going to be atrocious. Just like with AI output.
Sure, you can go ahead and just stick your head in the sand and pretend all that detail doesn't exist, look only at the tests and the very high level structure. But, 2 years later you have an absolutely unmaintainable mess where the only solution is to nuke it from orbit and start from scratch, because not even AI models are able to untangle it.
I feel like there are really two camps of AI users: those who don't care about code quality and implementation, only intent. And those who care about both. And for the former camp, it's usually not because they are particularly pedantic personalities, but because they have to care about it. "Move fast and break things" webapps can easily be vibe coded without too much worry, but there are many systems which cannot. If you are personally responsible, in monetary and/or legal aspects, you cannot blame the AI for landing you in trouble, just as much as a carpenter cannot blame his hammer for doing a shit job.
Nice article. I think a key part of the conversation is getting people to start thinking in terms of evals [1] and observability but it's been quite tough to combat the hype of "but X magic product just solves what you mentioned as a concern for you".
You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.
I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).
[1] https://ai-evals.io/
128k tokens sounds great until you see the bill
The cache gets read at every token generated, not at every turn on the conversation.
Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.
What's in the prompt cache?
The prompt cache caches KV Cache states based on prefixes of previous prompts and conversations. Now, for a particular coding agent conversation, it might be more involved in how caching works (with cache handles and so on), I'm talking about the general case here. This is a way to avoid repeating the same quadratic cost computing over the prompt. Typically, LLM providers have much lower pricing for reading from this cache than computing again.
Since the prompt cache is (by necessity, this is how LLMs work) prefix of a prompt, if you have repeated API calls in some service, there is a lot of savings possible by organizing queries to have less commonly varying things first, and more varying things later. For example, if you included the current date and time as the first data point in your call, then that would force a recomputation every time.
Way too much. This has got to be the most expensive and most lacking in common sense way to make software ever devised.
I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.
Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.
To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.
So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.
While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.
> the time it takes to generate the Millionth output token is the same as the first output token.
This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.
> cached input tokens are almost virtually free naturally
No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.
Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.
Now consider 100k users doing basically this, all day long. This is not free and can't become free.
Caching might be free, but I think making caching cost nothing at the API level is not a great idea either considering that LLM attention is currently more expensive with more tokens in context.
Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.
GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.
This matches my experience running coding agents at scale. The cached token pricing is indeed somewhat artificial - in practice, for agent workflows with repeated context (like reading the same codebase across multiple tasks), you can achieve near-zero input costs through strategic caching. The real cost optimization isn't just token pricing but minimizing the total tokens flowing through the loop through better tool design.
Are you hosting your own infrastructure for coding agents? At least from first glance, sharing actual codebase context across compacts / multiple tasks seems pretty hard to pull off with good cost-benefit unless you have vertical integration from the inference all the way to the coding agent harness.
I'm saying this because the current external LLM providers like OpenAI tend to charge quite a bit for longer-term caching, plus the 0.1x cache read cost multiplied by # LLM calls, so I doubt context sharing would actually be that beneficial considering you won't need all the repeated context every time, so caching context results in longer context for each agentic task which might increase API costs by more overall than you save by caching.
Very awesome to see these numbers, to see this explored so. Nice job exe.dev.
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
TFA is talking about being quadratic in dollar cost as the conversation goes on, not quadratic in time complexity as n gets larger.
Edit to add: I see you are a new account and all your comments thus far are of a similar format, which seems highly suspicious. In the unlikely event you are a human, please read the hacker news guidelines https://news.ycombinator.com/newsguidelines.html
[flagged]
> Instead of feeding 500 lines of tool output back into the next prompt
Applies for everything with LLMs.
Somewhere along the idea, it seems like most people got the idea that "More text == better understanding" whereas reality seems to be the opposite, the less tokens you can give the LLM with only the absolute essentials, the better.
The trick is to find the balance, but "more == better" which many users seem to operate under seems to be making things worse, not better.
Another new LLM slop account on HN..
> Too little and the agent loses coherence.
Obviously you don't have to throw the data away, if the initial summary was missing some important detail, the agent can ask for additional information from a subthread/task/tool call.