I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!
I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.
I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.
If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.
OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.
Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.
Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.
It didn’t read as AI to me :)
why the long -'s
Because I like them?
reminds me of that one guy complaining that everyone is calling them an AI when AI was trained on their grammar style.
Already made a harness for Claude to make R/W plans, not write once like they are usually implemented. They can modify themselves as they work through the task at hand. Also relying on a collection of patterns for writing coding task plans which evolves by reflection. Everything is designed so I could run Claude in yolo-mode in a sandbox for long stretches of time.
But will harness build desktop Linux for us?
Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.
That’s when the future really starts hitting you.
Aha! A true cybernetics enthusiast. I didn't say that because I didn't want to scare people off ;)
So deep your comment. Asking for a friend, how did you manage to have the em dash — in your keyboard ?
Em dashes are used often by LLMs, because humans use them often. On mac keyboards its easily typed. I know this is oversimplifying the situation, but I don't see the usefulness of the constant witch-hunting for allegedly LLM-generated text. For text we are long beyond the point, where we can differenciate between human generated and machine generated. We're even at the point, where it gets somewhat hard to identify machine generated audio and visuals.
On a Mac, it's alt-dash in case you weren't being facetious
Extra pedantic: that’s the en dash, the em dash is option-shift-hyphen
Technically option-shift-dash. option-dash is an en-dash.
I made "tilth" a few days ago, since I'm consistently trying to get the LLMs to use tools more efficiently and spend less tokens doing it -- original tilth post from Monday: https://news.ycombinator.com/item?id=46952321
Great post. A few choice quotes:
> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.
> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.
> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.
You’re absolutely right! This isn’t your average engineering advice— it’s like painting the reader a vivid tapestry of the author’s mind.
Please stop; I just can't any more! Yes, I'm absolutely right.
My personal favorite: That’s not a threat. It’s free R&D.
During my first LLM experiments in Emacs using gptel, I also found that the LLM has considerable difficulties changing source code files with the Unix patch tool.
As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).
Worked like a charm.
Sounds interesting, do you have the code to share.
It’s funny to see where we are on model improvements.
Back when I was maintaining a coding harness around the time of Claude 3.5 we tried hash prefixes we tried line number prefixes we tried a lot of different approaches to making the model better at selecting edit blocks and ultimately at-least then fuzzy string matching won out.
We got lines-with-anchors working fine as a replacement strategy, the problem was that when you don't make the model echo what it's replacing, it's literally dumber at writing the replacement; we lost more in test failures + retries than we gained in faster outputs.
Makes sense when you think about how powerful the "think before answering" principle is for LLMs, but it's still frustrating
Shows how much room for improvement there is on the harness level.
Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.
Love the pragmatic mix of content based addressing + line numbers. Beautiful.
Indeed. The biggest waste might be the overuse of MCP for everything. Sure it makes the initial development easier but then for every connection you're using a hundred billion dollar parameter model to decide how to make the call when it's usually completely unnecessary and then prone to random errors. MCP is the hammer that can make literally everything look like a nail...
I see this ranting against MCP all the time, and I don't get it, maybe I'm missing something. I'm currently using an MCP in Cursor to give agents read-only access to my staging and prod databases, as well as BugSnag's MCP so it can look up errors that happen in those environments. It works great. What should I be using for this if not MCP?
i haven't dug into the article but your comment reminded me about the ClaudeCode Superpowers plugin. I find the plugin great but it's quite "expensive", I use the pay-as-you-go account with CC because i've just been trying it out personally and the superpowers plugin spends a lot of money, relative to regular CC, with all the back and forth.
With CC you can do a /cost to see how much your session cost in dollar terms, that's a good benchmark IMO for plugins, .md files for agents, and so on. Minimize the LLM cost in the way you'd minimize typical resource usage on a computer like cpu, ram, storage etc.
you can actually go the other way and spend more tokens to solve more complex problems (multi-agent) by letting agents work with smaller problems
The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732
Mario, the creator of Pi terminal agent, has this great blog post[0]. He talks about how TerminalBench's highest scores comes from using the Terminus 2 harness which uses tmux under the hood.
When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.
Which, IMHO, should be why we should be able to change them freely or make our own. Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.
The reason Anthropic is pushing on the closed harness is that they're not confident with their ability to win on model quality long term, so they're trying to build lock-in. They can capture some additional telemetry owning the harness as well, but given the amount of data the agent loop already transmits, that borders on unethical spyware (which might be part of the reason they're afraid to open source).
Ultimately the market is going to force them to open up and let people flex their subs.
> Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.
I’ll probably get downvoted for this, but am I the only one who thinks it’s kind of wild how much anger is generated by these companies offering discounted plans for use with their tools?
At this point, there would be less anger and outrage on HN if they all just charged us the same high per-token rate and offered no discounts or flat rate plans.
No, you're not the only one. The outraged entitlement is pretty funny tbh. How dare they dictate that they'll only subsidize your usage if you use their software!!
Also another place where having it change out from underneath you can drastically alter the quality of your work in unexpected ways.
Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.
Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.
Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.
---
The right route is open models and open harnesses, ideally on local hardware.
> Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.
I don’t assume this at all. In fact, the opposite has been happening in my experience: I try multiple providers at the same time and the $20/month plans have only been getting better with the model improvements and changes. The current ChatGPT $20/month plan goes a very long way even when I set it to “Extra High” whereas just 6 months ago I felt like the $20/month plans from major providers were an exercise in bouncing off rate limits for anything non-trivial.
Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.
> Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.
This time also crosses over with the frontier labs raising ever larger and larger rounds. If Anthropic IPO (which I honestly doubt), then we may get a better sense of actual prices in the market, as it's unlikely the markets will continue letting them spend more and more money each year without a return.
At this point subsidizing Chinese open-weights vendors by paying for them is just the right thing to do. Maybe they too might go closed-weights when they become SotA, but they're now pretty close and haven't done it.
I am wondering what kinds of harness are best for GLM, Deepseek, Qwen, Kimi.
OpenCode is great in general. At least one of them is specifically trained on CC - I think it was Qwen - so for those that should give best results.
Claude Code better than opencode for GLM models for me.
The harness is effectively the agent's 'body'. Swapping the brain (model) is good, but if the body (tools/environment) is locked down or inefficient, the brain can't compensate. Local execution environments that standardize the tool interface are going to be critical for avoiding that lock-in.
My personal notes (not the author): have been way faster performance wise which is honestly the biggest improvement over correctless. I've posted https://github.com/can1357/oh-my-pi before, but didn't seem to gain traction. It's a great little agent.
I've just started messing around with pi, but haven't fully dug in yet. How would you compare oh-my-pi? I see it has a lot of other bells and whistles built in.
Are they portable bit by bit back to pi, or is there enough differences that they can't? how about normal pi extensions, can they be used in omp?
Some of the stuff definitely looks interesting.
the differences are documented but it is mostly 1:1, never used normal pi, but night and day difference compared to omp, don't forget omp setup python.
I'm into it! This looks like an experimentation platform. OpenCode is beginning to feel like handcuffs. Let me hack!
The logical end state of this line of reasoning is a collective action problem that dooms the frontier lab establishment. You can't devote model capacity to having an attention transformer match nested delimiters or cope with bash and be maximally capable, you can't mix authentication, authorization, control plane, and data plane into an ill specified soup and be secure enough for any that isn't a pilot or toy ever.
If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.
The harness is the model "body", it's weight the cognition. Like in nature they develop together and the iteration of natural selection works at both.
If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.
But they rely on distilling the output of american leader models. Which will probably train against their own harness.
Someone has to do the baseline training, development, and innovation. it can't be clones all the way down
Why not? Humans are (very nearly) clones all the way down.
Citation needed, SOTA labs surely has technical protection and legaleese against using them for training. It's been done in th past but what indicates this is still the case?
My experience as well. People worry our profession is being reduced to "prompt engineer", but actually I get the feeling that programming will soon be mainly about designing and building harnesses for specific tasks.
Personal opinion is that LLMs are definitely not as magical as people think they are, they fill a specific niche of problem-solving, and harnesses are necessary to corral your problem into the niche that they are extremely good at solving.
The more I dive into this space the more I think that developers will still be in heavy demand—just operating in a different level of abstraction most of the time. We will need to know our CS fundamentals, experience will still matter, juniors will still be needed. It’s just that a lot of time time the actual code being generated will come from our little helper buddies. But those things still need a human in the seat to drive them.
I keep asking myself “could my friends and family be handed this and be expected to build what I’m building on them” and the answer is an immediate “absolutely not”. Could a non technical manager use these tools do build what I’m building? Absolutely not. And when I think about it, it’s for the exact same reason it’s always been… they just aren’t a developer. They just don’t “think” in the way required to effectively control a computer.
LLMs are just another way to talk to a machine. They aren’t magic. All the same fundamental principles that apply to probably telling a machine what to do still apply. It’s just a wildly different mechanism.
That all being said, I think these things will dramatically speed up the pace that software eats the world. Put LLMs into a good harness and holy shit it’s like a superpower… but to get those superpowers unlocked you still have to know the basis, same as before. I think this applies to all other trades too. If you are a designer you still have to what good design is and how to articulate it. Data scientists still need to understand the basics of their trade… these tools just give them superpowers.
Whether or not this assertion remains true in two or three years remains to be seen but look at the most popular tool. Claude code is a command line tool! Their gui version is pretty terrible in comparison. Cursor is an ide fork of vscode.
These are highly technical tools requiring somebody that knows file systems, command lines, basic development like compilers, etc. they require you to know a lot of stuff most people simply don’t. The direction I think these tools will head is far closer to highly sophisticated dev tooling than general purpose “magic box” stuff that your parents can use to… I dunno… vibe code the next hit todo app.
I believe you’re arriving at the wrong conclusion because you’re comparing to an opposite instead of to someone slightly worse than you. Will this enable people at the edge to perform like you? That’s the question. Will there be more developers? Will they compete with you?
> LLMs are just another way to talk to a machine. They aren’t magic.
I will still opt for a scriptable shell. A few scripts, and I have a custom interface that can be easily composed. And could be run on a $100 used laptop from ebay.
the harness bottleneck is real - I've been working on ai code security stuff and the biggest issue isn't model capability, it's that most tools treat the output as gospel. they'll take a suggested fix and apply it without checking if it even compiles, let alone if it introduces new vulns. I've seen fixes that patch one CVE but break auth logic entirely.
the edit tool point hits though. when you give the model a better interface to express changes (structured diffs vs free-form patches), error rates drop. but nobody talks about this because benchmarks measure "did it solve the problem" not "how many attempts" or "what's the blast radius when it fails". idk maybe I'm just jaded from debugging too many of these.
Yeah I invented a similar method for information extraction attribution around 2022, I would place custom markers in a document so the extraction model can reference them together with the answer and be unique on the document to be able to locate it.
What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?
When you buy a subscription plan, you’re buying use of the harness, not the underlying compute / tokens. Buying those on their own is way more expensive. This is probably because:
* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.
* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).
* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.
He wasn't using the regular paid api (ie per token pricing). He was using the endpoints for their subscribed customers (ie paid per month and heavily subsidized).
I assume he was using Gemini the same way as he was Claude when I make the following statement.
I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.
Indeed, that's why Anthropic, OpenAI and other LLM providers are known to adhere to published APIs to gather the world's data, obeying licensing and ROBOTS.txt.
It's truly disgusting.
I was under the impression that they do obey robots.txt now? There are clearly a lot of dumb agents that don’t, but didn’t think it was the major AI labs.
After 3 years of pirating and scraping the entire world by doing the above, I guess they have everything that they now need or want.
So then it's better to start obeying ROBOTS.txt as a ladder pull through a "nicely behaved" image advantage.
Obeying robots.txt (now) is still better than not obeying it, regardless of what they did before.
The alternative is to say that bugs shouldn’t be fixed because it’s a ladder pull or something. But that’s crazy. What’s the point of complaining if not to get people to fix things?
Why does Google/Facebook et al arbitrarily enforce one human per account?
It’s because they want to study you.
They want the data!
>What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?
Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!
On first principles it would seem that the "harness" is a myth. Surely a model like Opus 4.6/Codex 5.3 which can reason about complex functions and data flows across many files would trip up over top level function signatures it needs to call?
I see a lot of evidence to the contrary though. Anyone know what the underlying issue here is?
Humans have a demonstrated ability to program computers by flipping switches on the front panel.
Like a good programming language, a good harness offers a better affordance for getting stuff done.
Even if we put correctness aside, tooling that saves time and tokens is going to be very valuable.
Isn't 'the harness' essentially just prompting?
It's completely understandable that prompting in better/more efficient means would produce different results.
No, it's also a suite of tools beyond what's available in bash, tailored to context management.
One example of where raw IQ diverges from real world results is Gemini 3. The difference is post-training, which I argue is more of a product competency than a pure model IQ competency.
Gemini 3 is arguably the smartest model but no amount of smartness can reason about better agentic coding habits, if that makes sense?
The models generalized "understanding" and "reasoning" is the real myth that makes us take a step back and offload the process deterministic computing and harnesses.
So the new implementation always operates at the line level, replacing one or more lines. That's not ideal for some refactorings like rename where search and replace is faster.
Edit
Checking ohmypi The model has access to str replace too so this is just a edit till
You forgot to mention your tool does worse for 8/16 LLMs compared to replace?
Problem is, replace has been around for so long, most LLMs are tuned for it now
One of the first things I add to my claude instructions file is to stop using grep, its awfully slow, just use ripgrep instead, you can just type the word of what you're looking for from the project root and find it all in one shot. Claude likes to go folder by folder with grep and it drives me crazy.
"You're absolutely right!"
At this point I'd take a contract with Anthropic to have Claude code pick better tooling.
Getting banned from Gemini while attempting to improve Gemini is the most Googley thing ever :D imagine letting your automated "trust and safety" systems run amok so that they ban the top 0.01% of your users with no recourse. Google really knows how score an own-goal.
I really don't understand what is his usage pattern would have triggered that obviously automated ban. Can somebody let me know what they might think is adversarial enough to be considered 'hacking' or similar by a bot?
I ran into this from the other direction. I built a small SRE agent for my cloud infra and just kind of walked into hand-rolling some of the tools rather than using what exists today. I provided an edit_file tool that felt like it was of reasonable capability, but in practice the agent was regularly 'trying' to do a one line change and submitting PRs that hallucinated 3/4s of the file.
Seeing how bad the results are when you're casually approaching something makes it very evident that it's a topic that can be optimized.
Underrated is how much improving harnesses, not just models, has a lot to do with productive uses of LLMs at tasks like coding in the last year.
Great work, but concurrency is lost.
With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.
Have you tested followup edits on the same files?
(not the author) it works fine most of the time been using it alongside an active agent and haven't ran into too many noticable problems. The token savings alone are worth it.
Serializing writes is probably fine and the hashes should only change if you're updating the same line, right?
You probably don't want to use the line number though unless you need to disambiguate
But your write tool implementation can take care of that
I wonder if we'll get to "VI for LLMs" - if the model was trained on using that kind of text navigation and you show context around cursor when it navigates.
Would also be worth having special tokens for this kind of navigation.
I always thought ed would be a perfect match. Line-based instead of having to manage cursor movements.
I bet it’s good enough at VI already
> Treating harnesses as solved, or even inconsequential, is very short-sighted
Is it possible that burning extra tokens is the point, since they get paid more?
Given the fierce competition, I would imagine a better performing model generates more revenue than burning extra tokens
they have pretty fierce competition though, so i doubt this is intentional. my guess is they just have a million things to do and that isn't at the top of the list
That doesn't make sense with subscriptions.
Putting it out there: if any frontier model provider starts allowing any agent to use their $20/month plan, we will all switch to you. We don't want to be forced into 1 harness, we want OAuth, and we want respectable limits without excessive budgets.
Yep this has been my experience with browser agents as well. One little change in the harness/agentic loop and the model suddenly becomes a whole lot smarter at navigating the web. I was also able to build a better browser agent than ‘claude —chrome’ in just a few afternoons just by tweaking the harness.
I feel like cursors solution is still the best answer. Let the model suggest edits in whatever format it prefers using as few "extra" tokens as possible and have a small model figure it out. I don't use cursor anymore but when I did it was impressive how consistently it worked, I think there was a single time it failed. 70b might be overkill though...
Someone should try prompting the same LLM in use, to suggest an edit as a subagent.
Great article and tbh I thought it would’ve been implemented that way makes sense to hash and save mainly context I don’t expect them to care about token usage
How about Kimi tho how can I play with it?
Arguably I would think that the last year was mainly inner harness improvement instead model improvement but I could be wrong, just feels like that to me
I feel the baseline comparison should be relative to the intuitive and simple "line-numbers only" schema.
It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.
The issue is when the file changed between when the LLM read the file and when it wrote to the file. Just using line numbers will clobber a file if that happens. The hashes prevent that from being an issue.
Point taken.
it starts writing to the wrong part of the file after multiple edits.
This is very nicely done. We have seen the same issue at a higher level of getting separators right when generating multiple files in a single inference call.
My experience exactly! I’ve recently become so tired of the Claude harness that I switched to OpenCode (which is extremely good compared to Claude). However, OpenCode is also tedious to change, and it inherits all the “good stuff,” like treating agents as Markdown files and all the dancing around with hooks/plugins/skills scattered all over the place. Getting stuck again and again, I’ve ultimately come to the conclusion that this must be solved by writing my own damn coding agent, with extensibility that’s acceptable for real-world engineering.
Give Pi[1] a try. Comes pretty barebones out of the box, yet still provides a decent default experience. Extension points are all TypeScript if you want. There are a lot of examples[2] and some 3rd party extensions[3].
I'll point out that if you want permission prompts for certain behavior, you have to add that yourself. There's at least one example.
Edit: Just noticed the article's author is using a fork of Pi.
Harness is where the open source should shine. It doesn't require millions of dollars of compute but the search space is vast and explorable with limited budgets.
really enjoyed reading this, although I'm a dumb farmer and it took me a while to understand lol
Why not just use line numbers?
Forces you to read after every write. E.g. you edit line 15 to be two lines. Then now you need arithmetic for later vs earlier lines or you need to read full file to reindex by line number.
Good point!
I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.
I was wondering the same thing.
I feel a lot of confusion at which coding harness is best and what options to use. tbh I have mostly used standard aider and I don't know what the consensus is on this tool.
I feel I want to write my own and that maybe in the future a lot of developers will have custom harnesses and have highly customized versions as each user of these models wants to use these things in a way that's unique to their brain, much like how emacs is so great for the customization but one persons emacs config is often not what another wants or only wants a subset and then write their own features.
As an aside what is the feeling on all the various ai coding tools, does aider suck is that aider-ce/cecli are better or are the bespoke tools for each model like claudeCode and such better.
I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approach
> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.
This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.
I mean they want to make money right? CC is a cool tool, but obviously they want you to use the api eventually if you’re even remotely a power user, 200/month for all you can eat tokens (well, until some arbitrary limit of the day kicks in) just doesn’t make sense when compared to api prices.
In other words, CC should be seen as a software subscription.
The token limit is the same whether used in CC or in other harnesses.
Sure, but then Anthropic loses the possibility to upsell, show ads, telemetry, brag about number of users and how long they use it etc etc. Not necessarily what’s in there today, but what can be in there tomorrow.
They also get the ability to much better fine tune backoffs etc from a purely technical side of things.
Is there a skill file I can use for these edits?
[dead]
[dead]
I agree with this article completely, nice to see it presented quantitatively.
>re "only" the harness changed
In our experience, AI's are like amnesiacs who can barely remember what they did three minutes ago (their last autonomous actions might still be in their context if you're lucky), with no chance at remembering what they did three days ago. As such, the "harness" determines their entire memory and is the single most important determinant of their outcome.
The best harness is a single self-contained, well-commented, obvious, and tiny code file followed by a plain explanation of what it does and what it's supposed to do, the change request, how you want it to do it (you have to say it with so much force and confidence that the AI is afraid of getting yelled at if they do anything else) and a large amount of text devoted to asking the AI not to break what is already working. Followed by a request to write a test that passes. Followed by asking for its judgment about whether it broke what was already working on or not. All in one tiny crisp prompt.
With such a harness, it's able to not break the code one time in twenty. If you use reverse psychology and ask it to do the opposite of what you want, it rises to fifty-fifty odds you'll get what you're trying to do.
Don't believe me? You can watch the livestream (see my previous comments).
I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!
I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.
I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.
If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.
OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.
Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.
Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.
It didn’t read as AI to me :)
why the long -'s
Because I like them?
reminds me of that one guy complaining that everyone is calling them an AI when AI was trained on their grammar style.
This happened to the female speaker with her voice, which I find terrifying: https://www.youtube.com/watch?v=qO0WvudbO04
how do you make them?
2026 is the year of the harness.
Already made a harness for Claude to make R/W plans, not write once like they are usually implemented. They can modify themselves as they work through the task at hand. Also relying on a collection of patterns for writing coding task plans which evolves by reflection. Everything is designed so I could run Claude in yolo-mode in a sandbox for long stretches of time.
But will harness build desktop Linux for us?
Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.
That’s when the future really starts hitting you.
Aha! A true cybernetics enthusiast. I didn't say that because I didn't want to scare people off ;)
So deep your comment. Asking for a friend, how did you manage to have the em dash — in your keyboard ?
Em dashes are used often by LLMs, because humans use them often. On mac keyboards its easily typed. I know this is oversimplifying the situation, but I don't see the usefulness of the constant witch-hunting for allegedly LLM-generated text. For text we are long beyond the point, where we can differenciate between human generated and machine generated. We're even at the point, where it gets somewhat hard to identify machine generated audio and visuals.
On a Mac, it's alt-dash in case you weren't being facetious
Extra pedantic: that’s the en dash, the em dash is option-shift-hyphen
Technically option-shift-dash. option-dash is an en-dash.
https://joeldueck.com/manually-type-punctuation.html
https://joeldueck.com/ai-is-right-about-em-dashes.html
Does your friend have an iPhone? The default iOS keyboard has automatically converted double dashes into an emdash for at least seven years now.
I use Compose - - - on Linux and my cellphone (Unexpected Keyboard). Mac is Alt-_.
I implemented this hash (read and edit) approach in tilth if you want to test it out.
https://github.com/jahala/tilth
I made "tilth" a few days ago, since I'm consistently trying to get the LLMs to use tools more efficiently and spend less tokens doing it -- original tilth post from Monday: https://news.ycombinator.com/item?id=46952321
Great post. A few choice quotes:
> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.
> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.
> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.
You’re absolutely right! This isn’t your average engineering advice— it’s like painting the reader a vivid tapestry of the author’s mind.
Please stop; I just can't any more! Yes, I'm absolutely right.
My personal favorite: That’s not a threat. It’s free R&D.
During my first LLM experiments in Emacs using gptel, I also found that the LLM has considerable difficulties changing source code files with the Unix patch tool.
As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).
Worked like a charm.
Sounds interesting, do you have the code to share.
It’s funny to see where we are on model improvements.
Back when I was maintaining a coding harness around the time of Claude 3.5 we tried hash prefixes we tried line number prefixes we tried a lot of different approaches to making the model better at selecting edit blocks and ultimately at-least then fuzzy string matching won out.
Yes, very similar results here (http://brokk.ai)
We got lines-with-anchors working fine as a replacement strategy, the problem was that when you don't make the model echo what it's replacing, it's literally dumber at writing the replacement; we lost more in test failures + retries than we gained in faster outputs.
Makes sense when you think about how powerful the "think before answering" principle is for LLMs, but it's still frustrating
Shows how much room for improvement there is on the harness level.
Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.
Love the pragmatic mix of content based addressing + line numbers. Beautiful.
Indeed. The biggest waste might be the overuse of MCP for everything. Sure it makes the initial development easier but then for every connection you're using a hundred billion dollar parameter model to decide how to make the call when it's usually completely unnecessary and then prone to random errors. MCP is the hammer that can make literally everything look like a nail...
I see this ranting against MCP all the time, and I don't get it, maybe I'm missing something. I'm currently using an MCP in Cursor to give agents read-only access to my staging and prod databases, as well as BugSnag's MCP so it can look up errors that happen in those environments. It works great. What should I be using for this if not MCP?
i haven't dug into the article but your comment reminded me about the ClaudeCode Superpowers plugin. I find the plugin great but it's quite "expensive", I use the pay-as-you-go account with CC because i've just been trying it out personally and the superpowers plugin spends a lot of money, relative to regular CC, with all the back and forth.
With CC you can do a /cost to see how much your session cost in dollar terms, that's a good benchmark IMO for plugins, .md files for agents, and so on. Minimize the LLM cost in the way you'd minimize typical resource usage on a computer like cpu, ram, storage etc.
you can actually go the other way and spend more tokens to solve more complex problems (multi-agent) by letting agents work with smaller problems
The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732
Mario, the creator of Pi terminal agent, has this great blog post[0]. He talks about how TerminalBench's highest scores comes from using the Terminus 2 harness which uses tmux under the hood.
When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.
0. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/
Which, IMHO, should be why we should be able to change them freely or make our own. Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.
The reason Anthropic is pushing on the closed harness is that they're not confident with their ability to win on model quality long term, so they're trying to build lock-in. They can capture some additional telemetry owning the harness as well, but given the amount of data the agent loop already transmits, that borders on unethical spyware (which might be part of the reason they're afraid to open source).
Ultimately the market is going to force them to open up and let people flex their subs.
> Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.
I’ll probably get downvoted for this, but am I the only one who thinks it’s kind of wild how much anger is generated by these companies offering discounted plans for use with their tools?
At this point, there would be less anger and outrage on HN if they all just charged us the same high per-token rate and offered no discounts or flat rate plans.
No, you're not the only one. The outraged entitlement is pretty funny tbh. How dare they dictate that they'll only subsidize your usage if you use their software!!
Also another place where having it change out from underneath you can drastically alter the quality of your work in unexpected ways.
Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.
Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.
Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.
---
The right route is open models and open harnesses, ideally on local hardware.
> Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.
I don’t assume this at all. In fact, the opposite has been happening in my experience: I try multiple providers at the same time and the $20/month plans have only been getting better with the model improvements and changes. The current ChatGPT $20/month plan goes a very long way even when I set it to “Extra High” whereas just 6 months ago I felt like the $20/month plans from major providers were an exercise in bouncing off rate limits for anything non-trivial.
Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.
> Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.
This time also crosses over with the frontier labs raising ever larger and larger rounds. If Anthropic IPO (which I honestly doubt), then we may get a better sense of actual prices in the market, as it's unlikely the markets will continue letting them spend more and more money each year without a return.
At this point subsidizing Chinese open-weights vendors by paying for them is just the right thing to do. Maybe they too might go closed-weights when they become SotA, but they're now pretty close and haven't done it.
I am wondering what kinds of harness are best for GLM, Deepseek, Qwen, Kimi.
OpenCode is great in general. At least one of them is specifically trained on CC - I think it was Qwen - so for those that should give best results.
Claude Code better than opencode for GLM models for me.
The harness is effectively the agent's 'body'. Swapping the brain (model) is good, but if the body (tools/environment) is locked down or inefficient, the brain can't compensate. Local execution environments that standardize the tool interface are going to be critical for avoiding that lock-in.
My personal notes (not the author): have been way faster performance wise which is honestly the biggest improvement over correctless. I've posted https://github.com/can1357/oh-my-pi before, but didn't seem to gain traction. It's a great little agent.
I've just started messing around with pi, but haven't fully dug in yet. How would you compare oh-my-pi? I see it has a lot of other bells and whistles built in.
Are they portable bit by bit back to pi, or is there enough differences that they can't? how about normal pi extensions, can they be used in omp?
Some of the stuff definitely looks interesting.
the differences are documented but it is mostly 1:1, never used normal pi, but night and day difference compared to omp, don't forget omp setup python.
I'm into it! This looks like an experimentation platform. OpenCode is beginning to feel like handcuffs. Let me hack!
The logical end state of this line of reasoning is a collective action problem that dooms the frontier lab establishment. You can't devote model capacity to having an attention transformer match nested delimiters or cope with bash and be maximally capable, you can't mix authentication, authorization, control plane, and data plane into an ill specified soup and be secure enough for any that isn't a pilot or toy ever.
If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.
The harness is the model "body", it's weight the cognition. Like in nature they develop together and the iteration of natural selection works at both.
If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.
But they rely on distilling the output of american leader models. Which will probably train against their own harness.
Someone has to do the baseline training, development, and innovation. it can't be clones all the way down
Why not? Humans are (very nearly) clones all the way down.
Citation needed, SOTA labs surely has technical protection and legaleese against using them for training. It's been done in th past but what indicates this is still the case?
My experience as well. People worry our profession is being reduced to "prompt engineer", but actually I get the feeling that programming will soon be mainly about designing and building harnesses for specific tasks.
Personal opinion is that LLMs are definitely not as magical as people think they are, they fill a specific niche of problem-solving, and harnesses are necessary to corral your problem into the niche that they are extremely good at solving.
The more I dive into this space the more I think that developers will still be in heavy demand—just operating in a different level of abstraction most of the time. We will need to know our CS fundamentals, experience will still matter, juniors will still be needed. It’s just that a lot of time time the actual code being generated will come from our little helper buddies. But those things still need a human in the seat to drive them.
I keep asking myself “could my friends and family be handed this and be expected to build what I’m building on them” and the answer is an immediate “absolutely not”. Could a non technical manager use these tools do build what I’m building? Absolutely not. And when I think about it, it’s for the exact same reason it’s always been… they just aren’t a developer. They just don’t “think” in the way required to effectively control a computer.
LLMs are just another way to talk to a machine. They aren’t magic. All the same fundamental principles that apply to probably telling a machine what to do still apply. It’s just a wildly different mechanism.
That all being said, I think these things will dramatically speed up the pace that software eats the world. Put LLMs into a good harness and holy shit it’s like a superpower… but to get those superpowers unlocked you still have to know the basis, same as before. I think this applies to all other trades too. If you are a designer you still have to what good design is and how to articulate it. Data scientists still need to understand the basics of their trade… these tools just give them superpowers.
Whether or not this assertion remains true in two or three years remains to be seen but look at the most popular tool. Claude code is a command line tool! Their gui version is pretty terrible in comparison. Cursor is an ide fork of vscode.
These are highly technical tools requiring somebody that knows file systems, command lines, basic development like compilers, etc. they require you to know a lot of stuff most people simply don’t. The direction I think these tools will head is far closer to highly sophisticated dev tooling than general purpose “magic box” stuff that your parents can use to… I dunno… vibe code the next hit todo app.
I believe you’re arriving at the wrong conclusion because you’re comparing to an opposite instead of to someone slightly worse than you. Will this enable people at the edge to perform like you? That’s the question. Will there be more developers? Will they compete with you?
> LLMs are just another way to talk to a machine. They aren’t magic.
I will still opt for a scriptable shell. A few scripts, and I have a custom interface that can be easily composed. And could be run on a $100 used laptop from ebay.
the harness bottleneck is real - I've been working on ai code security stuff and the biggest issue isn't model capability, it's that most tools treat the output as gospel. they'll take a suggested fix and apply it without checking if it even compiles, let alone if it introduces new vulns. I've seen fixes that patch one CVE but break auth logic entirely.
the edit tool point hits though. when you give the model a better interface to express changes (structured diffs vs free-form patches), error rates drop. but nobody talks about this because benchmarks measure "did it solve the problem" not "how many attempts" or "what's the blast radius when it fails". idk maybe I'm just jaded from debugging too many of these.
Yeah I invented a similar method for information extraction attribution around 2022, I would place custom markers in a document so the extraction model can reference them together with the answer and be unique on the document to be able to locate it.
What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?
When you buy a subscription plan, you’re buying use of the harness, not the underlying compute / tokens. Buying those on their own is way more expensive. This is probably because:
* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.
* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).
* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.
He wasn't using the regular paid api (ie per token pricing). He was using the endpoints for their subscribed customers (ie paid per month and heavily subsidized).
I assume he was using Gemini the same way as he was Claude when I make the following statement.
I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.
Indeed, that's why Anthropic, OpenAI and other LLM providers are known to adhere to published APIs to gather the world's data, obeying licensing and ROBOTS.txt.
It's truly disgusting.
I was under the impression that they do obey robots.txt now? There are clearly a lot of dumb agents that don’t, but didn’t think it was the major AI labs.
After 3 years of pirating and scraping the entire world by doing the above, I guess they have everything that they now need or want.
So then it's better to start obeying ROBOTS.txt as a ladder pull through a "nicely behaved" image advantage.
Obeying robots.txt (now) is still better than not obeying it, regardless of what they did before.
The alternative is to say that bugs shouldn’t be fixed because it’s a ladder pull or something. But that’s crazy. What’s the point of complaining if not to get people to fix things?
Why does Google/Facebook et al arbitrarily enforce one human per account?
It’s because they want to study you.
They want the data!
>What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?
Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!
On first principles it would seem that the "harness" is a myth. Surely a model like Opus 4.6/Codex 5.3 which can reason about complex functions and data flows across many files would trip up over top level function signatures it needs to call?
I see a lot of evidence to the contrary though. Anyone know what the underlying issue here is?
Humans have a demonstrated ability to program computers by flipping switches on the front panel.
Like a good programming language, a good harness offers a better affordance for getting stuff done.
Even if we put correctness aside, tooling that saves time and tokens is going to be very valuable.
Isn't 'the harness' essentially just prompting?
It's completely understandable that prompting in better/more efficient means would produce different results.
No, it's also a suite of tools beyond what's available in bash, tailored to context management.
One example of where raw IQ diverges from real world results is Gemini 3. The difference is post-training, which I argue is more of a product competency than a pure model IQ competency.
Gemini 3 is arguably the smartest model but no amount of smartness can reason about better agentic coding habits, if that makes sense?
The models generalized "understanding" and "reasoning" is the real myth that makes us take a step back and offload the process deterministic computing and harnesses.
So the new implementation always operates at the line level, replacing one or more lines. That's not ideal for some refactorings like rename where search and replace is faster.
Edit
Checking ohmypi The model has access to str replace too so this is just a edit till
You forgot to mention your tool does worse for 8/16 LLMs compared to replace?
Problem is, replace has been around for so long, most LLMs are tuned for it now
One of the first things I add to my claude instructions file is to stop using grep, its awfully slow, just use ripgrep instead, you can just type the word of what you're looking for from the project root and find it all in one shot. Claude likes to go folder by folder with grep and it drives me crazy.
"You're absolutely right!"
At this point I'd take a contract with Anthropic to have Claude code pick better tooling.
Getting banned from Gemini while attempting to improve Gemini is the most Googley thing ever :D imagine letting your automated "trust and safety" systems run amok so that they ban the top 0.01% of your users with no recourse. Google really knows how score an own-goal.
I really don't understand what is his usage pattern would have triggered that obviously automated ban. Can somebody let me know what they might think is adversarial enough to be considered 'hacking' or similar by a bot?
Filed an issue for codex
https://github.com/openai/codex/issues/11601
I ran into this from the other direction. I built a small SRE agent for my cloud infra and just kind of walked into hand-rolling some of the tools rather than using what exists today. I provided an edit_file tool that felt like it was of reasonable capability, but in practice the agent was regularly 'trying' to do a one line change and submitting PRs that hallucinated 3/4s of the file.
Seeing how bad the results are when you're casually approaching something makes it very evident that it's a topic that can be optimized.
Underrated is how much improving harnesses, not just models, has a lot to do with productive uses of LLMs at tasks like coding in the last year.
Great work, but concurrency is lost.
With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.
Have you tested followup edits on the same files?
(not the author) it works fine most of the time been using it alongside an active agent and haven't ran into too many noticable problems. The token savings alone are worth it.
Serializing writes is probably fine and the hashes should only change if you're updating the same line, right?
You probably don't want to use the line number though unless you need to disambiguate
But your write tool implementation can take care of that
I wonder if we'll get to "VI for LLMs" - if the model was trained on using that kind of text navigation and you show context around cursor when it navigates.
Would also be worth having special tokens for this kind of navigation.
I always thought ed would be a perfect match. Line-based instead of having to manage cursor movements.
I bet it’s good enough at VI already
> Treating harnesses as solved, or even inconsequential, is very short-sighted
Is it possible that burning extra tokens is the point, since they get paid more?
Given the fierce competition, I would imagine a better performing model generates more revenue than burning extra tokens
they have pretty fierce competition though, so i doubt this is intentional. my guess is they just have a million things to do and that isn't at the top of the list
That doesn't make sense with subscriptions.
Putting it out there: if any frontier model provider starts allowing any agent to use their $20/month plan, we will all switch to you. We don't want to be forced into 1 harness, we want OAuth, and we want respectable limits without excessive budgets.
Yep this has been my experience with browser agents as well. One little change in the harness/agentic loop and the model suddenly becomes a whole lot smarter at navigating the web. I was also able to build a better browser agent than ‘claude —chrome’ in just a few afternoons just by tweaking the harness.
I feel like cursors solution is still the best answer. Let the model suggest edits in whatever format it prefers using as few "extra" tokens as possible and have a small model figure it out. I don't use cursor anymore but when I did it was impressive how consistently it worked, I think there was a single time it failed. 70b might be overkill though...
Someone should try prompting the same LLM in use, to suggest an edit as a subagent.
Great article and tbh I thought it would’ve been implemented that way makes sense to hash and save mainly context I don’t expect them to care about token usage
How about Kimi tho how can I play with it?
Arguably I would think that the last year was mainly inner harness improvement instead model improvement but I could be wrong, just feels like that to me
I feel the baseline comparison should be relative to the intuitive and simple "line-numbers only" schema.
It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.
The issue is when the file changed between when the LLM read the file and when it wrote to the file. Just using line numbers will clobber a file if that happens. The hashes prevent that from being an issue.
Point taken.
it starts writing to the wrong part of the file after multiple edits.
This is very nicely done. We have seen the same issue at a higher level of getting separators right when generating multiple files in a single inference call.
My experience exactly! I’ve recently become so tired of the Claude harness that I switched to OpenCode (which is extremely good compared to Claude). However, OpenCode is also tedious to change, and it inherits all the “good stuff,” like treating agents as Markdown files and all the dancing around with hooks/plugins/skills scattered all over the place. Getting stuck again and again, I’ve ultimately come to the conclusion that this must be solved by writing my own damn coding agent, with extensibility that’s acceptable for real-world engineering.
Give Pi[1] a try. Comes pretty barebones out of the box, yet still provides a decent default experience. Extension points are all TypeScript if you want. There are a lot of examples[2] and some 3rd party extensions[3].
I'll point out that if you want permission prompts for certain behavior, you have to add that yourself. There's at least one example.
Edit: Just noticed the article's author is using a fork of Pi.
[1]: https://shittycodingagent.ai/
[2]: https://github.com/badlogic/pi-mono/tree/main/packages/codin...
[3]: https://github.com/nicobailon
Before you build you own, try pi. It is what you are looking for.
[0] https://shittycodingagent.ai/
Harness is where the open source should shine. It doesn't require millions of dollars of compute but the search space is vast and explorable with limited budgets.
really enjoyed reading this, although I'm a dumb farmer and it took me a while to understand lol
Why not just use line numbers?
Forces you to read after every write. E.g. you edit line 15 to be two lines. Then now you need arithmetic for later vs earlier lines or you need to read full file to reindex by line number.
Good point!
I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.
I was wondering the same thing.
I feel a lot of confusion at which coding harness is best and what options to use. tbh I have mostly used standard aider and I don't know what the consensus is on this tool.
I feel I want to write my own and that maybe in the future a lot of developers will have custom harnesses and have highly customized versions as each user of these models wants to use these things in a way that's unique to their brain, much like how emacs is so great for the customization but one persons emacs config is often not what another wants or only wants a subset and then write their own features.
As an aside what is the feeling on all the various ai coding tools, does aider suck is that aider-ce/cecli are better or are the bespoke tools for each model like claudeCode and such better.
I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approach
read_toc tool:
...
....update_content tool:
{
Great article, recommend reading all of it.
> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.
This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.
I mean they want to make money right? CC is a cool tool, but obviously they want you to use the api eventually if you’re even remotely a power user, 200/month for all you can eat tokens (well, until some arbitrary limit of the day kicks in) just doesn’t make sense when compared to api prices. In other words, CC should be seen as a software subscription.
The token limit is the same whether used in CC or in other harnesses.
Sure, but then Anthropic loses the possibility to upsell, show ads, telemetry, brag about number of users and how long they use it etc etc. Not necessarily what’s in there today, but what can be in there tomorrow. They also get the ability to much better fine tune backoffs etc from a purely technical side of things.
Is there a skill file I can use for these edits?
[dead]
[dead]
I agree with this article completely, nice to see it presented quantitatively.
>re "only" the harness changed
In our experience, AI's are like amnesiacs who can barely remember what they did three minutes ago (their last autonomous actions might still be in their context if you're lucky), with no chance at remembering what they did three days ago. As such, the "harness" determines their entire memory and is the single most important determinant of their outcome.
The best harness is a single self-contained, well-commented, obvious, and tiny code file followed by a plain explanation of what it does and what it's supposed to do, the change request, how you want it to do it (you have to say it with so much force and confidence that the AI is afraid of getting yelled at if they do anything else) and a large amount of text devoted to asking the AI not to break what is already working. Followed by a request to write a test that passes. Followed by asking for its judgment about whether it broke what was already working on or not. All in one tiny crisp prompt.
With such a harness, it's able to not break the code one time in twenty. If you use reverse psychology and ask it to do the opposite of what you want, it rises to fifty-fifty odds you'll get what you're trying to do.
Don't believe me? You can watch the livestream (see my previous comments).
Baby steps toward Utopia.