Agents that run while I sleep

This _all_ (waves hands around) sounds like alot of work and expense for something that is meant to make programming easier and cheaper.

Writing _all_ (waves hands around various llm wrapper git repos) these frameworks and harnesses, built on top of ever changing models sure doesn't feel sensible.

I don't know what the best way of using these things is, but from my personal experience, the defaults get me a looong way. Letting these things churn away overnight, burning money in the process, with no human oversight seems like something we'll collectively look back at in a few years and laugh about, like using PHP!

> sounds like alot of work and expense for something that is meant to make programming easier and cheaper.

Not if you are an AI gold rush shovel salesman.

From the article:

> I've run Claude Code workshops for over 100 engineers in the last six months

I am not laughing about PHP. To this very day many of my best projects are built on PHP. And while last 7 years I have spent in full stack JavaScript/TypeScript environment it has never produced the same things I was actually able to do with PHP.

I actually feel that things I built 15 years ago in PHP were better than anything I am trying to achieve with modern things that gets outdated every 6 months.

what in God's Name could you do in PHP that you can't do in a modern framework?

Nothing; but PHP, in experienced hands, will be waaay more productive for small-to-medium things. One issue is that experienced hands are increasingly hard to come by. Truly big, complicated things, built by large teams or numbers of teams, teams with a lot of average brains or AIs trained on average brains, will be better off in something like Typescript/React. And everyone wants to work on the big complicated stuff. So the "modern frameworks" will continue to dominate while smaller, more niche shops will wonder why they waste their time.

I worked at a startup, they built their API in PHP because it was easy and fast. Now they're successful, app doesn't scale, high latency etc. What does their php code do? 95% of it is calling a DB.

You're telling me today with LLM power multiplier it's THAT much faster to write in PHP compared to something that can actually have a future?

by future do you mean Future<T> or metaphorical future? :)

You can build those things in modern frameworks, it will just be more headache and will feel outdated in 6 months.

Where are my backbone apps? In the trash? Me ember apps? Next to them. My create-react-apps? On top of those. My Next apps? Being trashed as we speak. My rails apps? Online and making money every year with minimal upgrade time. What the hell was I thinking.

6 years ago I was writing apps in typescript and react, if I was starting a new project today I'd write it in typescript and react.

People bicker about PHP and Javascript, sorry Typescript, like they aren't both mule languages peoppe pick up to get work done. They both matured really well through years of production use.

They are in the same group, similar pedigree. If you were programming purely for the art of it, you would have had time to discover much nicer languages than either, but that's not what most people are doing so it doesn't really matter. They're different but they're about as good as eachother.

Not have to "build" anything. You edit code and it is already deployed on your dev instance.

Deploying to production is just scp -rv * production:/var/www/

Beautifully simple. No npm build crap.

You trade having to compile for actually having code that can scale

Not sure what you’re talking about, I scaled to millions of users on a pair of boxes with PHP, and its page generation time absolutely crushed Rails/Django times. Apache with mod PHP auto scales wonderfully.

You can always tell claude to use red-green-refactor and that really is a step-up from "yeah don't forget to write tests and make sure they pass" at the end of the prompt, sure. But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.

> But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.

I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.

Reward hacking is very real and hard to guard against.

The trick is, with the setup I mentioned, you change the rewards.

The concept is:

Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.

Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.

Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.

It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.

This seems quite amazing really, thanks for sharing

What is the scope of projects / features you’ve seen this be successful at?

Do you have a step before where an agent verifies that your new feature spec is not contradictory, ambiguous etc. Maybe as reviewed with regards to all the current feature sets?

Do you make this a cycle per step - by breaking down the feature to small implementable and verifiable sub-features and coding them in sequence, or do you tell it to write all the tests first and then have at it with implementation and refactoring?

Why not refactor-red-green-refactor cycle? E.g. a lot of the time it is worth refactoring the existing code first, to make a new implementation easier, is it worth encoding this into the harness?

You guys are describing wonderful things, but I've yet to see any implementation. I tried coding my own agents, yet the results were disappointing.

What kind of setup do you use ? Can you share ? How much does it cost ?

We have a very uncomplicated setup with claude code. A CLAUDE.md with instructions and notes about the repo and how to run stuff. We also do code reviews with Claude Code, but in a separate session.

It works wonderfully well. Costs about $200USD per developer per month as of now.

rlm-workflow does all that TDD for you: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

(I built it)

Why make powershell a requirement? I like powershell, but Python is very common and already installed on many dev systems.

Thanks for sharing. What does RLM stand for? Any idea why the socket security test fails?

Recursive language models: https://github.com/doubleuuser/rlm-workflow

If you are not spending 5-10k dollars a month for interesting projects, you likely won't see interesting results

Sounds a lot like paying for online ads, they don't work because you're not paying enough, when in reality bots, scrapers and now agents are just running up all the clicks.

You pay more to try and get above that noise and hope you'll reach an actual human.

The new "fast mode" that burns tokens at 6 times the rate is just scary because that's what everyone still soon say we all need to be using to get results.

It feels like everyone's gone mad.

Here I am mostly writing code by hand, with some AI assistant help. I have a Claude subscription but only use it occasionally because it can take more time to review and fix the generated code as it would to hand-write it. Claude only saves me time on a minority of tasks where it's faster to prompt than hand-write.

And then I read about people spending hundreds or thousands of dollars a month on this stuff. Doesn't that turn your codebase into an unreadable mess?

I can't really tell if this is sarcasm or not.

Check out Mike Pocock’s work, he’s done excellent work writing about red green refactor and has a GitHub repo for his skills. Read and take what you need from his tdd skill and incorporate it into your own tdd skill tailored for your project.

This is just ai slop. If you follow what the actual designers of Claude/GPT tell you it flys in the face of building out over engineered harnesses for agents.

I agree with this. There is not a lot of harnesses/wrapping needed for Claude Code.

You don't need a harness beyond Claude Code, but honestly it's foolish to think you shouldn't be building out extra skills to help your workflow. A TDD skill that does red-green-refactoring is using Claude Code exactly as how it's meant to be used. They pioneered skills.

Works better than standard claude / gpt, which doesn't do red-green-refactor. Doesn't seem like slop when it meaningfully changes the results for the better, consistently. Really is a game-changer. You should consider trying it.

I do do TDD but using skills in this way is an anti-pattern for a multitude of reasons.

I don't think just saying it's an anti-pattern for a multitude of reasons and then not naming any is sufficiently going to convince anyone it's an anti-pattern.

This is in fact precisely what skills is meant for and is the opposite of an anti-pattern, but more like best practice now. It's explicitly using the skills framework precisely how it was meant to be used.

This is very interesting, but like sibling comments, I'm very curious as to how you run this in practice. Do you just tell Claude/Copilot to do what you describe?

And do you have any prompts to share?

You don't need most of this. Prompts are also normally what you would say to another engineer.

* There is a lot of duplication between A & B. Refactor this.

* Look at ticket X and give me a root cause

* Add support for three new types of credentials - Basic Auth, Bearer Token and OAuth Client Creds

Claude.md has stuff like "Here's how you run the frontend. here's how u run backend. This module support frontend. That module is batch jobs. Always start commit messages with ticket number. Always run compile at the top level. When you make code changes, always add tests" etc etc

They never do.

Someone directly down from you suggested looking up Mike Postock's TDD skill, so I did:

https://github.com/mattpocock/skills/blob/main/tdd%2FSKILL.m...

Everything below quoted from that skill, and serves as a much better rebuttal than I had started writing:

DO NOT write all tests first, then all implementation. This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code."

This produces crap tests:

Tests written in bulk test imagined behavior, not actual behavior You end up testing the shape of things (data structures, function signatures) rather than user-facing behavior Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine

You outrun your headlights, committing to test structure before understanding the implementation

Correct approach:

Vertical slices via tracer bullets.

One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.

How do you make sure Red Team doesn't just write subtly broken tests?

This seems like a tremendous amount of planning, babysitting, verification, and token cost just to avoid writing code and tests yourself.

It's assigning yourself the literal worst parts of the job - writing specs, docs, tests and reading someone else's code.

There's a real disconnect. I was talking to a junior developer and they were telling me how Claude is so much smarter than them and they feel inferior.

I couldn't relate. From my perspective as a senior, Claude is dumb as bricks. Though useful nonetheless.

I believe that if you're substantially below Claude's level then you just trust whatever it says. The only variables you control are how much money you spend, how much markdown you can produce, and how you arrange your agents.

But I don't understand how the juniors on HN have so much money to throw at this technology.

Yes with the reward of: I don't understand this code and didn't learn anything incrementally about the feature I "planned".

How do you define visibility rules? Is that possible for subagents?

AFAIK Claude doesn't support it, but if you're willing to go the extra mile, you can get creative with some bash script: https://pastebin.com/raw/m9YQ8MyS (generated this a second ago - just to get the point across )

To be clear, I don't do this. I never saw an agent cheat by peeking or something. I really did look through their logs.

I'd be very interested to see claude code and other tools support this pattern when dispatching agents to be really sure.

> To be clear, I don't do this.

How do you know that it works then? Are you using a different tool that does support it?

So what do you do? Do you define roles somewhere and tell the agent to assign these roles to subagents?

Fun to see you not on tildes.

Setting up a clean room is one of the only ways to do Evals on agentic harnesses. Especially prevalent with Windsurf which doesn’t have an easy CLI start.

So how? The easiest answer when allowed is docker. Literally new image per prompt. There’s also flags with Claude to not use memory and from there you can use -p to have it just be like a normal cli tool. Windsurf requires manual effort of starting it up in a new dir.

Sounds interesting, but I'm not quite getting the relevance for people writing code with an agent. Should I be doing evals?

Well I mean yes. I think people ought be aware for how the harnesses compare for their stacks. But clean room applies for this RGR situation too

you are replying to a bot, that's why.

> Reward hacking is very real and hard to guard against.

Is it really about rewards? Im genuinely curious. Because its not a RL model.

I'm noticing terms related to DL/RL/NLP are being used more and more informally as AI takes over more of the cultural zeitgeist and people want to use the fancy new terms of the era, even if inaccurately. A friend told me he "trained and fine tuned a custom agent" for his work when what he meant was he modified a claude.md file.

Respectfully, your friend doesn't know what he is talking about and is saying things that just "feel right" (vibe talking??). Which might be exactly how technical terms lose their meaning so perhaps you're exactly right.

There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

And with that comes reward hacking - which isn't really about looking for more reward but rather that the model has learned patterns of behavior that got reward in the train env.

That is, any kind of vulnerability in the train env manifests as something you'd recognize as reward hacking in the real world: making tests pass _no matter what_ (because the train env rewarded that behavior), being wildly sycophantic (because the human evaluators rewarded that behavior), etc.

> There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

Hm, as i understand it, parts of the training of e.g. ChatGPT could be called RL models. But the subject to be trained/fine tuned is still a seq2seq next token predictor transformer neural net.

RL is simply a broad category of training methods. It's not really an architecture per se: modern GPTs are trained first on reconstruction objective on massive text corpora (the 'large language' part), then on various RL objectives +/- more post-training depending on which lab.

> Is it really about rewards? Im genuinely curious. Because its not a RL model.

Ha, good point. I was using it informally (you could handwave and call it an intrinsic reward if a model is well aligned to completing tasks as requested), but I hadn't really thought about it.

Searching around, it seems like I'm not alone, but it looks like "specification gaming" is also sometimes used, like: https://deepmind.google/blog/specification-gaming-the-flip-s...

They probably meant goal hacking. (I just made that up)

I refer to it as ‘wanking’. It’s doing something that’s unproductive but that’s incentivised by its architecture.

A refactor should not affect the tests at all should it? If it does, it's more than a refactor.

It can if your refactor needs to deal with interface changes, like moving methods around, changing argument order etc... all these need to propagate to the tests

Your tests are an assertion that 'no matter what this will never change'. If your interface can change then you are testing implementation details instead of the behavior users care about.

the above is really hard. A lot of tdd 'experts' don't understand is and teach fragile tests that are not worth having.

Sure if you are changing your interfaces a lot you either are leaking abstractions or you aren't designing your interfaces well.

But things evolve with time. Not only your software is required to do things it wasn't originally designed to do, but your understanding of the domain evolve, and what once was fine becomes obsolete or insufficient.

https://www.hyrumslaw.com/

your implementation is your interface. its a bit naive or hating-your-users to assume your tests are what your users care about. theyre dealing with everything, regardless of what youve tested or not.

Hyrum's law is about the real consumers/users (inadvertently) depending on any observable behaviour they can get their hands on.

TDD/BDD tests are meant to define the intended contract of a system.

These are not the same thing.

Refactoring is changing the design of the code without affecting the behaviour.

You can change an interface and not change the behaviour.

I have rarely heard such a rigid interpretation such as this.

It depends on what you mean by "refactor" and how exactly you're testing, I guess, but that's not really at the heart of the point. red-green-refactor could also be used for adding new features, for instance, or an entire codebase, I guess.

I’m telling it to use red/green tdd [1] and it will write test that don’t fail and then says “ah the issue is already fixed” and then move on. You really have to watch it very closely. I’m having a huge problem with bad tests in my system despite a “governance model” that I always refer it to which requires red/green tdd.

[1] https://simonwillison.net/guides/agentic-engineering-pattern...

I've been able to encode Outside-in Test Driven Development into a repeatable workflow. Claude Code follows it to a T, and I've gotten great results. I've written about it more here, and created a repo people can use out of the box to try it out:

https://www.joegaebel.com/articles/principled-agentic-softwa... https://github.com/JoeGaebel/outside-in-tdd-starter

This sounds interesting. Can you go a bit deeper or provide references on how to implement the green/red/refactor subagent pattern?

It’s not an agentic pattern, it’s an approach to test driven development.

You write a failing test for the new functionality that you’re going to add (which doesn’t exist yet, so the test is red). You then write the code until the test passes (that is, goes green).

What has worked better for me is splitting authority, not just prompts. One agent can touch app code, one can only write failing tests plus a short bug hypothesis, and one only reviews the diff and test output. Also make test files read only for the coding agent. That cuts out a surprising amount of self-grading behavior.

How do you limit access like that?

I built rlm-workflow which has stage gating, TDD and sub-agent support: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

That's the cool bit - you don't have to. CC is perfectly well aware and competent to implement it; just tell it to.

[deleted]

"So this is how liberty dies... with thunderous applause.” - Padmé Amidala

s/liberty/knowledge

So more stuff happens with this approach but how do you know what it generates is correct?

Good idea, and an improvement, but you still have that fundamental issue: you don't really know what code has been written. You don't know the refactors are right, in alignment with existing patterns etc.

How exactly do you set up your CC sessions to do this?

thats a great idea - i have been using codex to do my code reviews since i have it to give better critique on code written by claude but havent tried it with testing yet!

codex/gpt is a stubborn model, doubt it would accept claude reviews or counter it. have seen cases where claude is more willing to comply if shared feedback though its just sycophancy too.

Am I supposed to be impressed by this? I think people are now just using agents for the sake of it. I'm perfectly happy running two simple agents, one for writing and one for reviewing. I don't need to go be writing code at faster than light speed. Just focusing on the spec, and watching the agent as it does its work and intervening when it goes sideways is perfectly fine with me. I'm doing 5-7x productivity easily, and don't need more than that.

I also spend most of my time reviewing the spec to make sure the design is right. Once I'm done, the coding agent can take 10 minutes or 30 minutes. I'm not really in that much of a rush.

Yes I'm still not really understanding this "run agents overnight" thing. Most of the time if I use claude it's done in 5-20 minutes. I've never wanted to have work done for me overnight...tomorrow is already plenty of time for more work, it's not going anywhere, and my employer isn't paying me to produce overnight.

The only counter I have to this is that there are some workflows that have test environments, everything can't or shouldn't just run locally. Sometimes these test take time, and instead of babysitting the model to write code and run the build+deploy+test manually, you can send it off to work until the kinks are worked out.

Add to that I have worked on many projects that take more than 20 minutes to fully build and run tests... unfortunately. And I would consider that part of the job of implementing a feature, and to reduce cycles I have to take.

After the "green" signal I will manually review or send off some secondary reviews in other models. Is it wasteful? Probably. But its pretty damn fun (as long as I ignore the elephant in the room.)

> Am I supposed to be impressed by this?

No. But it is noteworthy. A lot of what one previously needed a SWE to do can now be brute forced well enough with AI. (Granted, everything SWEs complained about being tedious.)

From the customer’s perspective, waiting for buggy code tomorrow from San Francisco, buggy code tonight from India or buggy code from an AI at 4AM aren’t super different for maybe two thirds of use cases.

> A lot of what one previously needed a SWE to do can now be brute forced well enough with AI. (Granted, everything SWEs complained about being tedious.)

Only if you ignore everything they generate. Look at all the comments saying that the agent hallucinates a result, generates always-passing tests, etc. Those are absolutely true observations -- and don't touch on the fact that tests can pass, the red/green approach can give thumbs up and rocket emojis all day long, and the code can still be shitty, brittle and riddled with security and performance flaws. And so now we have people building elaborate castles in the sky to try to catch those problems. Except that the things doing the catching are themselves prone to hallucination. And around we go.

So because a portion of (IMO always bad, but previously unrecognized as bad) coders think that these random text generators are trustworthy enough to run unsupervised, we've moved all of this chaotic energy up a level. There's more output, certainly, but it all feels like we've replaced actual intelligent thought with an army of monkeys making Rube Goldberg machines at scale. It's going to backfire.

> coders think that these random text generators are trustworthy enough to run unsupervised, we've moved all of this chaotic energy up a level

But it works well enough for most use cases. Most of what we do isn’t life or death.

> But it works well enough for most use cases.

So does the code produced by any bad engineer.

So either we’re finally admitting that all of that leetcode screening and engineer quality gating was a farce, or it wasn’t, and you’re wrong.

I think the answer is in the middle, but the pendulum has swung too far in the “doesn’t matter” direction.

> we’re finally admitting that all of that leetcode screening and engineer quality gating was a farce, or it wasn’t, and you’re wrong

We’re admitting a bit of both. Offshoring just became more instantaneous, secure and efficient. There will still be folks who overplay their hand.

Macroeconomically speaking, I don’t see why we need more software engineers in the future than we have today, and that’s probably a conservative estimate.

What I want to know is, what has this increase in code generation led to? What is the impact?

I don't mean 'Oh I finally have the energy to do that side project that I never could'.

Afterall, the trade-offs have to be worth something... right? Where's the 1-person billion dollar firms at That Mr Altman spoke about?

The way I think of it is code has always been an intermediary step between a vision and an object of value. So is there an increase in this activity that yields the trade-offs to be a net benefit?

I went the same way. At first I was splitting off work trees and running all the agents that I could afford, then I realized I just can't keep up with it all, running few agents around one issue in one directory is fast enough. Way faster than before and I can still follow what's happening.

> off work trees and running all the agents that I could afford,

I still think that we, programmers, having to pay money in order to write code is a travesti. And I'm not talking about paying the license for the odd text editor or even for an operating system, I'm talking about day-to-day operations. I'm surprised that there isn't a bigger push-back against this idea.

What is strange about paying for tools that improve productivity? Unless you consider your own time worthless you should always be open to spending more to gain more.

No stock backed company will be paying developers more regardless of much more productive these tools make us. You'll be lucky if they pay for the proper Claude Max plan themselves considering most wouldn't even spring for IntelliJ.

Your own time is worthless if you’re not spending it doing something that makes more money. You don’t make more money increasing your productivity for work when you’re expected to work the same number of hours.

I've spent a fair amount of time contracting -- this issue is even more relevant here. While I wasn't spending very much on AI tools, what I did spent was worth every penny... for the company I was supporting :).

Fortunately, there was enough work to be done so productivity increases didn't decrease my billable hours. Even if it did, I still would have done it. If it helps me help others, then it's good for my reputation. Thats hard to put a price on, but absolutely worth what I paid in this case.

Are the jobs out there actually paying people more?

Dw, there's quite a lot of push back against AI in some of the communities I hang around in. It's just rarely seldom visible here on HN.

It's usually not about the price, but more about the fact that a few megacorps and countries "own" the ability to work this way. This leads to some very real risks that I'm pretty sure will materialize at some point in time, including but not limited to:

- Geopolitical pressure - if some ass-hat of a president hypothetically were to decide "nuh uh - we don't like Spain, they're not being nice to us!", they could forbid AI companies to deliver their services to that specific country.

- Price hikes - if you can deliver "$100 worth of value" per hour, but "$1000 worth of value" per hour with the help of AI, then provider companies could still charge up to $899 per hour of usage and it'd still make "business sense" for you to use them since you're still creating more value with them than without them.

- Reduction in quality - I believe people who were senior developers _before_ starting to use AI assisted coding are still usually capable of producing high quality output. However every single person I know who "started coding" with tools like Claude Code produce horrible horrible software, esp. from a security p.o.v. Most of them just build "internal tools" for themselves, and I highly encourage that. However others have pursued developing and selling more ambitious software...just to get bitten by the fact that it's much more to software development than getting semi-correct output from an AI agent.

- A massive workload on some open source projects. We've all heard about projects closing down their bug bounty programs, declining AI generated PRs etc.

- The loss of the joy - some people enjoy it, some people don't.

We're definitely still in the early days of AI assisted / AI driven coding, and no one really knows how it'll develop...but don't mistake the bubble that is HN for universal positivity and acclaim of AI in the coding space :).

China did users a solid and Qwen is a thing, so the scenario where Anthropic/OpenAI/Google collude and segment the market to ratchet prices in unison just isn’t possible. Amodei talking about value based pricing is a dream unless they buy legislation to outlaw competitors. Altman might have beat them to that punch with this admin, though. Most of us are operating on 10-40% margins. Usually on the low end when there aren’t legal barriers. The 80-99% margins or rent extraction rights SaaS people expect is just out of touch. The revenue the big 3 already pull in now has a lot more to do with branding and fear-mongering than product quality.

My old work machine used power quite aggressively - I was happy to pay for that (and turn it off at night!). This seems even more directly valuable.

It's silly, who wouldn't answer yes to the question "would you like to finish your task faster?". The real trick is to produce more but by putting less effort than before.

> who wouldn't answer yes to the question "would you like to finish your task faster?"

People who enjoy the process of completing the task?

Maybe we'd see "coding gyms" like how white collar workers have gyms for the physical exercise they're not getting from their work.

If you finish faster, you'll be given another task. You're not freeing yourself sooner or spending less effort, you're working the same number of hours for the same pay. Your reward is not joining the ranks of those laid off.

I would be impressed if I could say "here's $100 turn it into $1000" but you still gotta do the thinking.

yup, agree - i spend most of my time reviewing the spec. The highest leverage time is now deciding what to work on and then working on the spec. I ended up building the verify skill (https://github.com/opslane/verify) because I wanted to ensure claude follows the spec. I have found that even after you have the spec - it can sometimes not follow it and it takes a lot of human review to catch those issues.

I call this "Test Theatre" and it is real. I wrote about it last year:

https://benhouston3d.com/blog/the-rise-of-test-theater

You have to actively work against it.

Yeah, having your agent write 3x the code in exhaustive tests (I tried this recently and got 600 lines of tests for my 100 lines of code!) sure makes things look great, but when you actually look at the content of the tests they’re meaningless. Good tests validate the use of design patterns, ensure that dependencies hold, and are meaningful (e.g. shortcut debugging by setting up useful state) when they break.

I've found the best way to achieve that is to force the agent to do TDD. Better to get it to do Outside-in TDD. Even better to get it to run Outside-in TDD, then use mutation testing to ensure it has fully covered the logic.

I've written about this and have a POC here for those interested: https://www.joegaebel.com/articles/principled-agentic-softwa...

Test theatre is exactly the right framing. The tests are syntactically correct, they run, they pass but do they actually prove anything?

This was really good, and second leaning on property testing. I’ve had really good outcomes from setting up Schemathesis and getting blanket coverage for stuff like “there should be no request you can generate as logged in user A that let’s you do things as or see things that belong to user B”, as well as “there should be no request you can find to any API endpoint that can trigger a 5xx response”

Test theatre isn’t new. Most people writing tests do the exact same thing, testing implementation.

I've been impressed by Google Jules since the Gemini 3.1 Pro update. Sometimes it's been working on a task for 4h. I've now put it in a ralph loop using a Github Action to call itself and auto merge PRs after the linter, formatter and tests pass. It does still occasionally want my approval, but most of the time I just say Sounds great!

It's currently burning through the TESTING.md backlog: https://github.com/alpeware/datachannel-clj

> A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.

I can't understand the mindset that would lead someone not to have realized this from the beginning.

Does anyone know what this guy is having his agents build? Bc I looked a bit and all I see him ship is linkedin posts about Claude.

Yeah maybe I'm just old but in 25 years in industry - not one company has needed this much code that fast. They may insist they do but then it sits while they figure out how to sell, or the inevitable "oh wait, we didn't think about that..."

So much of this - never would have guessed how much code I wouldn't write doing this as a career.

Hurry up and wait.

It's... really the same problem when you hire people to just write tests. A lot of time it just confirms that the code does what the code does. Having clear specs of what the code should do make things better and clearer.

Yep, tests written after the fact are just verifying tautologies.

> Most teams don't [write tests first] because thinking through what the code should do before writing it takes time they don't have.

It's astonishing to me how much our industry repeats the same mistakes over and over. This doesn't seem like what other engineering disciplines do. Or is this just me not knowing what it looks like behind the curtain of those fields?

When push comes to shove, software can usually be fudged. Unlike a building or a water treatment plant where the first fuck up could mean that people die.

I like to think that people writing actual mission critical software try their absolute best to get it right before shipping and that the rest our industry exists in a totally separate world where a bug in the code is just actually not that big of a deal. Yeah, it might be expensive to fix, but usually it can be reverted or patched with only an inconvenience to the user and to the business.

It’s like the fines that multinational companies pay when breaking the law. If it’s a cost of doing business, it’s baked into the price of the product.

You see this also in other industries. OSHA violations on a residential construction site? I bet you can find a dozen if you really care to look. But 99% of the time, there are no consequences big enough for people to care so nobody wears their PPE because it “slows them down” or “makes them less nimble”. Sound familiar?

Quite. We’re far more similar to construction workers than we are civil engineers, despite the lofty title we like to bestow upon ourselves.

That's probably dependent on your specific area of work. For most projects, It's okayish to deploy code with bugs. There will be future releases that fix those bugs and add improvements. Obviously that's not the case with high risk systems like space rockets software and similar.

With other engineering professions, all projects are like that. You cannot "deploy a bridge to production" to see what happens and fix it after a few have died

a lot of the value of tests is confirming that the system hasn't regressed beyond the behavior at the original release. It's bad if the original release is wrong, but a separate issue is if the system later accidentally stops behaving the way it did originally.

The issue I see is that the high test coverage created by having LLMs write tests results in almost all non-trivial changes breaking tests, even if they don't change behavior in ways that are visible from the outside. In one project I work, we require 100% test coverage, so people just have LLMs write tons of tests, and now every change I make to the code base always breaks tests.

So now people just ignore broken tests.

> Claude, please implement this feature.

> Claude, please fix the tests.

The only thing we've gained from this is that we can brag about test coverage.

My best unit tests are 3 lines, one of them whitespace, and they assert one single thing that's in the requirements.

These are the only tests I've witnessed people delete outright when the requirements change. Anything more complex than this, they'll worry that there's some secondary assertion being implied by a test so they can't just delete it.

Which, really is just experience telling them that the code smells they see in the tests are actually part of the test.

meanwhile:

    it("only has one shipping address", ...

is demonstrably a dead test when the story is, "allow users to have multiple shipping addresses", as is a test that makes sure balances can't go negative when we decide to allow a 5 day grace period on account balances. But if it's just one of six asserts in the same massive tests, then people get nervous and start losing time.

Unit tests vs acceptance tests. You shouldn't be afraid to throw away unit tests if the implementation changes, and acceptance tests should verify behavior at API boundaries, ignoring implementation details.

BDD helps with this as it can allow you to get the setup out of the tests making it even cheaper for someone to yeet a defunct test.

I feel it end up a massive drag on development velocity and makes refactoring to simpler designs incredibly painful.

But hey, we're just supposed to let the AIs run wild and rewrite everything every change so maybe that's a heretic view.

[deleted]

yup agree - i think have specs and then do verifications against the spec. I have heard that this is how a lot of consulting firms work - you have acceptance criterias and thats how work is validated.

I've been doing differential testing in Gemini CLI using sub-agents. The idea is:

1. one agent writes/updates code from the spec

2. one agent writes/updates tests from identified edge cases in the spec.

3. a QA agent runs the tests against the code. When a test fails, it examines the code and the test (the only agent that can see both) to determine blame, then gives feedback to the code and/or test writing agent on what it perceives the problem as so they can update their code.

(repeat 1 and/or 2 then 3 until all tests pass)

Since the code can never fix itself to directly pass the test and the test can never fix itself to accept the behavior of the code, you have some independence. The failure case is that the tests simply never pass, not that the test writer and code writer agents both have the same incorrect understanding of the spec (which is very improbable, like something that will happen before the heat death of the universe improbable, it is much more likely the spec isn't well grounded/ambiguous/contradictory or that the problem is too big for the LLM to handle and so the tests simply never wind up passing).

Where is the interface defined ? If it is just the coder reading the test it can hard code specific cases based on the test setup/fixture data.

There is a specification and the interface is defined from that. The coder never gets to see the test.

> Writing acceptance criteria is harder than writing a prompt, because it forces you to think through edge cases before you've seen them. Engineers resist it for the same reason they resisted TDD, because it feels slower at the start.

This resonates with my experience, and it is also a refreshing honest take: pushing back on heavy upfront process isn't laziness, it's just the natural engineers drive to build things and feel productive.

The hardest part of running agents autonomously is the data quality problem. When your agent runs unsupervised, every decision is only as good as the data it pulls. Having agents access authoritative structured sources (government APIs, international org datasets) rather than scraping random pages makes a huge difference. The real failure mode is not hallucination - it is the agent confidently acting on unreliable data.

Sounds like we've just gotten into lazy mode where we believe that whatever it spits out is good enough. Or rather, we want to believe it, and convince ourselves that some simple guardrail we put up will make it true, because God forbid we have to use our own brain again.

What if instead, the goal of using agents was to increase quality while retaining velocity, rather than the current goal of increasing velocity while (trying to) retain quality? How can we make that world come to be? Because TBH that's the only agentic-oriented future that seems unlikely to end in disaster.

You can't. To retain and improve quality requires care. Very few if any of the people setting stuff like this up truly care about delivering a quality result (any result is the real goal). Unless there's some incentive to care, quality will be found among the exceedingly rare people/businesses.

I guess to reach this point you have already decided you don't care what the code looks like.

Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?

Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

If you haven't reviewed the code yet, how can you say it did 4 weeks of work in 2 days? You haven't verified the correctness, and besides reviewing the code is part of the work.

You’re not alone. I went from being a mediocre security engineer to a full time reviewer of LLM code reviews last week. I just read reports and report on incomplete code all day. Sometimes things get humorously worse from review to review. I take breaks by typing out the PoCs the LLMs spell out for me…

> Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?

Same as before. Small PRs, accept that you won't ship a month of code in two days. Pair program with someone else so the review is just a formality.

The value of the review is _also_ for someone else to check if you have built the right thing, not just a thing the right way, which is exponentially harder as you add code.

The proper solution is to treat the agent generated code like assembly... IE. don't review it. Agents are the compiler for your inputs (prompts, context, etc). If you care about code quality you should have people writing it with AI help, not the other way around.

I've been thinking a lot about this!

Redoing the work as smaller PRs might help with readability, but then you get the opposite problem: it becomes hard to hold all the PRs in your head at once and keep track of the overall purpose of the change (at least for me).

IMO the real solution is figuring out which subset of changes actually needs human review and focusing attention there. And even then, not necessarily through diffs. For larger agent-generated changes, more useful review artifacts may be things like design decisions or risky areas that were changed.

So you have become a reviewer instead of a programmer? Is that so? hones question. And if so, what is the advantage of looking a code for 12 hours instead of coding for 12.

Build features faster. Granted, this exposes the difference between people who like to finish projects and people who like to get paid a lot of money for typing on a keyboard.

It sounds like you know this but what happened is that you didn't do 4 weeks of work over 2 days, you got started on 4 weeks of work over 2 days, and now you have to finish all 4 weeks worth of work and that might take an indeterminate amount of time.

If you find a big problem in commit #20 of #40, you'll have to potentially redo the last 20 commits, which is a pain.

You seem to be gated on your review bandwidth and what you probably want to do is apply backpressure - stop generating new AI code if the code you previously generated hasn't gone through review yet, or limit yourself to say 3 PRs in review at any given time. Otherwise you're just wasting tokens on code that might get thrown out. After all, babysitting the agents is probably not 'free' for you either, even if it's easier than writing code by hand.

Of course if all this agent work is helping you identify problems and test out various designs, it's still valuable even if you end up not merging the code. But it sounds like that might not be the case?

Ideally you're still better off, you've reduced the amount of time being spent on the 'writing the PR' phase even if the 'reviewing the PR' phase is still slow.

yeah honestly thats what i am struggling with too and I dont have a a good solution. However, I do think we are going to see more of this - so it will be interesting to see how we are going to handle this.

i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)

It's a nice idea, but how do you know the agent is aligned with what it thinks the intent is?

or moreso, what happens at compact boundaries where the agent completely forgets the intent

> how do you review all the code?

Code review is a skill, as is reading code. You're going to quickly learn to master it.

> It's like 20k of line changes over 30-40 commits.

You run it, in a debugger and step through every single line along your "happy paths". You're building a mental model of execution while you watch it work.

> One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

Not going to be a time saver, but next time you want to take nibbles and bites, and then merge the branches in (with the history). The hard lesson here is around task decomposition, in line documentation (cross referenced) and digestible chunks.

But if you get step debugging running and do the hard thing of getting through reading the code you will come out the other end of the (painful) process stronger and better resourced for the future.

Oh I didn't mean literally how do I review code. I meant, if an agent can write a lot of code to achieve a large task that seemingly works (from manual testing), what's the point if we haven't really solved code review? There's still that bottleneck no matter how fast you can get working code down.

>Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

Get an LLM to generate a list of things to check based on those plans (and pad that out yourself with anything important to you that the LLM didn't add), then have the agents check the codebase file by file for those things and report any mismatches to you. As well as some general checks like "find anything that looks incorrect/fragile/very messy/too inefficient". If any issues come up, ask the agents to fix them, then continue repeating this process until no more significant issues are reported. You can do the same for unit tests, asking the agents to make sure there are tests covering all the important things.

Pet peeve: this post misunderstands “TDD.” What it really describes is acceptance tests.

TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code.

TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.”

Thank you for coming to my TED^H^H^H TDD talk.

> TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice.

I would like to emphasize that feedback includes being alerted to breaking something you previously had working in a seemly unrelated/impossible way.

Accidentally mutating an input is always a 'fun' way to trigger spooky action at a distance.

suggestion: TeDD talk.

Many times there is really no way of getting around some of the expert-human judgement complexity of the larger question of "How to get agents to build reliably".

One example I have been experimenting is using Learning Tests[1]. The idea is that when something new is introduced in the system the Agent must execute a high value test to teach itself how to use this piece of code. Because these should be high leverage i.e. they can really help any one understand the code base better, they should be exceptionally well chosen for AIs to use to iterate. But again this is just the expert-human judgement complexity shifted to identifying these for AI to learn from. In code bases that code Millions of LoC in new features in days, this would require careful work by the human.

[1] https://anthonysciamanna.com/2019/08/22/the-continuous-value...

The concept of long-running background agents sounds appealing, but the real challenge tends to be reliability and task definition rather than raw model capability.

If an agent runs unattended for hours, small errors compound quickly. Even simple misunderstandings about file structure or instructions can derail the whole process.

You can find approaches that improve things, but there's always going to be a chance that your code is terrible if you let an LLM generate it and don't review it with human eyes.

But review fatigue and resulting apathy is real. Devs should instead be informed if incorrect code for whatever feature or process they are working on would be high-risk to the business. Lower-risk processes can be LLM-reviewed and merged. Higher risk must be human-reviewed.

If the business you're supporting can't tolerate much incorrectness (at least until discovered), than guess what - you aren't going to get much speed increase from LLMs. I've written about and given conference talks on this over the past year. Teams can improve this problem at the requirements level: https://tonyalicea.dev/blog/entropy-tolerance-ai/

Solo founder here, shipping a real product built mostly with AI. The code review thing is real but my actual daily pain is different. AI lies about being done. It'll say "implemented" and what it actually did is add a placeholder with a TODO comment. Or it silently adds a fallback path that returns hardcoded data when the real API fails, and now your app "works" but nothing is real.

I've also given it explicit rules like "never use placeholder images, always generate real assets" — and it just... ignores them sometimes. Not always. Sometimes. Which is worse, because you can't trust it but you also can't not use it.

The 80% it writes is fine. The problem is you still have to verify 100% of it.

Have you tried using an additional agent to verify the outputs? It seems that can help if the supervising agent has a small context demand on it. (ie. run this command, make sure it returns 0, invoke main coding agent with error message if it doesn't)

Took a super intelligent AI for us to realize how important tests and TDD is.

Regarding the self-congratulation machine - I simply use a different claude code session to do the reviews. There is no self-congratulation, but overly critical at times. Works well.

Honestly, sometimes the harnesses, specs, some predefined structure for skills etc all feel over-engineering. 99% of the time a bloody prompt will do. Claude Code is capable of planning, spawning sub-agents, writing tests and so on.

Claude.md file with general guidelines about our repo has worked extraordinarily good, without any external wrappers, harnesses or special prompts. Even the MD file has no specific structure, just instructions or notes in English.

One thing I've been wrestling with building persistent agents is memory quality. Most frameworks treat memory as a vector store — everything goes in, nothing gets resolved. Over time the agent is recalling contradictory facts with equal confidence.

The architecture we landed on: ingest goes through a certainty scoring layer before storage. Contradictions get flagged rather than silently stacked. Memories that get recalled frequently get promoted; stale ones fade.

It's early but the difference in agent coherence over long sessions is noticeable. Happy to share more if anyone's going down this path.

Interesting. I’ve been playing with something similar, at the coding agent harness message sequence level (memory, I guess). I’m looking at human driven UX for compaction and resolving/pruning dead ends

All these macho men - I wonder what exactly are they shipping at that pace?

Not a rhetoric question. Trillion token burners and such.

[deleted]

It's an interesting problem that even though it's represented by just you as a single person, I think this is shared across the board with larger corporations at scale. I know for example they were seeing this with game devs in regards to the Godot engine. So many people were uploading work done by AI that has been unverified that people just can't keep up with it. And maybe some of it's good, but how do you vet all the crap out? No one knows what's being written anymore (and non-devs can code now too, which is amazing, but part of the problem that we introduced). I think in the future of being a developer will be more about verifying code integrity and working with AI to ensure it is meeting said standards. Rather than actually being in the driver's seat. Not sexy, but we're handing the keys over willingly, yet, AI is only interpreting the intent. It's going to get things wrong no matter what we do.

> At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.

To everyone who plan on automating themselves out of a job by taking the human element out- this is the endgame that management wants: replacing your (expensive and non-tax-optimized) labor with scalable Opex.

It's also delusional.

Our app is a desktop integration and last year we added a local API that could be hit to read and interact with the UI. This unlocked the same thing the author is talking about - the LLM can do real QA - but it's an example of how it can be done even in non-web environments.

Edit: I even have a skill called release-test that does manual QA for every bug we've ever had reported. It takes about 10 hours to run but I execute it inside a VM overnight so I don't care.

i got me a windows mcp setup running in a sandbox, so it can look at screenshots, see the UIA, and click things either by coordinate or by UIA.

i let it run overnight against a windows app i was working on, and that got it from mostly not working to mostly working.

the loop was

1. look at the code and specs to come up with tests 2. predict the result 3. try it 4. compare the prediction against rhe result 5. file bug report, or call it a success

and then switch to bug fixing, and go back around again. Worked really well in geminicli with the giant context window

Just don’t use the same model to write and vet the code. Use two or more different models to verify the code in addition to reading it yourself.

They're definitely inferior to proper tests, but even weak CC tests on top of CC code is an improvement over no tests. If CC does make a change that shifts something dramatically even a weak test may flag enough to get CC to investigate.

Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.

yeah i have been hearing a lot more about this concept of “digital twins” - where you have high fidelity versions of external services to run tests against. You can ask the API docs of these external services and give it to Claude. Wonder if that is where we will be going more towards.

Isn’t this just an API sandbox? Many services have a test/sandbox mode. I do wish they were more common outside of fintech.

Honestly I think the "same AI checking same AI" concern is a bit overstated at this point. If the agents don't share context - separate conversations, no common memory - Opus is good enough that they don't really fall into the same patterns. At least at the micro level, like individual functions and logic. Maybe at the macro/architectural level there's still something there but in practice I'm not seeing it much anymore.

There seems lots of preparing, planning, token buying, set goal, and token cost just to niche target and related vibe coding.

He admits the real hole himself: "this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass."

But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again.

The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.

Did AI write this?

Nope - though I’ll take it as a compliment either way. It’s a problem I’ve been sitting with for a while, so the answer came out more formed than I expected. You disagree?

It has that punchy, breathless cadence... shrugs

Its actually a pretty good idea/framework for writing commit descriptions, especially for smaller changes that don't have any nuances to note in the commit

Why only small changes tho? I think it can also work with larger changes if you commit more regularly. And with agentic coding or even with autonomous agentic coding, you need to do it regularly and create these contextual checkpoints, no?

To me the last paragraph was the highest value in the article. Write out your test in plain language first, and then write the prompt for the autonomous agent using your language and the test prompt not the auto-code.

Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.

even in one agent, a different starting prompt will have you tracing a very different path through the model.

maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct

I think people are misunderstanding reward functions and LLMs.

LLMs don't actually have a reward system like some other ML models.

They are trained with one, and when you look at DPO you can say they contain an implicit one as well.

It’s superstition that using a different slop generator to “review” the slop from a different brand of slop generator somehow makes things better. It’s slop all the way down.

https://github.com/karpathy/llm-council

https://ui.adsabs.harvard.edu/abs/2025arXiv250214815C/abstra...

https://www.arxiv.org/abs/2509.23537

https://www.aristeidispanos.com/publication/panos2025multiag...

https://arxiv.org/abs/2305.14325

https://arxiv.org/abs/2306.05685

https://arxiv.org/abs/2310.19740v1

[deleted]

I think the solution has to be end to end tests. Maybe first run by humans, then maybe agents can learn and replicate. I can't see why unit tests really help other than for the LLM to reason about its own code a little more.

It's amazing the length at which people who want to write code go, to not write code.

Don't get me wrong, I use agentic coding often, when I feel it's going to type it faster than me (e.g. a lot of scaffolding and filler code).

Otherwise, what's the point?

I feel the whole industry is having its "Look ma! no hands!" moment.

Time to mature up, and stop acting like sailing is going where the seas take you.

> When Claude writes tests for code Claude just wrote, it's checking its own work.

You can have Gemini write the tests and Claude write the code. And have Gemini do review of Claude's implementation as well. I routinely have ChatGPT, Claude and Gemini review each other's code. And having AI write unit tests has not been a problem in my experience.

yeah i have started using codex to do my code reviews and it helps to have “a different llm” - i think one of my challenges has been that unit tests are good but not always comprehensive. you still need functional tests to verify the spec itself.

I don't think that's necessary, just make sure the context is not shared. A pretty good model can handle both sides well enough.

How do people not understand this? LLMs are goal machines. You need to give them the specific goal if you want good results and continue to reenforce it. So of course this means speccing and design work.

People are so enamored with how fast the 20% part is now and yes it’s amazing. But the 80% part by time (designing, testing, reviewing, refactoring, repairing) still exists if you want coherent systems of non-trivial complexity.

All the old rules still apply.

> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don't want to push slop, and I had no real answer.

That’s really putting the cart before the horse. How do you get to “merging 50 PRs a week” before thinking “wait, does this do the right thing?”

Yeah just wanted to see what the bottlenecks would be as I started pushing the limits. Eventually made this into a verification skill(github.com/opslane/verify)

I wish there was a way to "freeze" the tests. I want to write the tests first (or have Claude do it with my review), and then I want to get Claude to change the code to get them to pass - but with confidence that it doesn't edit any of the test files!

I use devcontainers in all the projects I use claude code on. [1] With it you can have claude running inside a container with just the project's code in write access and also mount a test folder with just read permissions, or do the opposite. You can even have both devcontainers and run them at the same time.

[1] https://code.claude.com/docs/en/devcontainer

If you want to try it just ask Claude to set it up for your project and review it after.

1. Make tests 2. Commit them 3. Proceed with implementation and tell agent to use the tests but not modify them

It will probably comply, and at least if it does change the tests you can always revert those files to where you committed them

Are there really no ways to control read/write permissions in a smart way? I've not had to do this yet, but is it really only capable of either being advisory with you implementing all the code, or it having full control over the repo where you just hope nothing important is changed?

You could probably make a system-level restriction so the software physically can't modify certain files, but I'm not sure how well that's going to fly if the program fails to edit it and there's no feedback of the failure.

You can use a Claude PreToolUse command hook to prevent write (or even read) access to specific files.

With this approach you can enforce that Claude cannot access to specific files. It’s a guarantee and will always work, unlike a prompt or Claude.md which is just a suggestion that can be forgotten or ignored.

This post has an example hook for blocking access to sensitive files:

https://aiorg.dev/blog/claude-code-hooks#:~:text=Protect%20s...

No. I don't want the mental burden of auditing whether it modified the tests.

Then, run the agent vm-sandboxed, with tests mounted as a read-only directory, if your directory structure allows it.

Or, less securely, hash the tests and check the hash with a hook, post tool use. Or a commit hook.

You'd be surprised - I know I was - you can encode Test-Driven development into workflows that agents actually follow. I wrote an in-depth guide about this and have a POC for people to try over here: https://www.joegaebel.com/articles/principled-agentic-softwa...

Why can't you do just that? You can configure file path permissions in Claude or via an external tool.

Why not use a client-server infrastructure for tests? The server sends the test code, the client runs the code, sends the output to the server and this replies pass/not pass.

One could even make zero-knowledge test development this way.

[deleted]

yeah i agree - this is somewhat the approach I have been using more of. Write the tests first based on specs and then write code to make the tests pass. This works well for cases where unit tests are sufficient.

You can remove edit permissions on the test directory

I'm not up to speed on Claude's features. Can I, from the prompt, quickly remove those permissions and then re-add them (i.e. one command to drop, and one command to re-add)?

Yeah, you can type `/permisssions` and do it there. Or you can make a custom slash command, or just ask Claude to do it. You can also set it when you launch a claude session, there are a dozen ways to do anything.

"Add a config option preventing you from modifying files matching src/*_test.py."

Just tell it that the tests can't be changed. Honestly I'd be surprised if it tried to anyway. I've never had it do that through many projects where tests were provided to drive development.

Now, Someone has to review tests! Just shifting ownership! Claude has just released 'Code Review'. But I don't think you can leave either one on autopilot.

Code Review: https://news.ycombinator.com/item?id=47313787

The example given in the article is acceptance criteria for a login/password entry flow. This is fairly easy to spec-out in terms of AC and TDD.

I have been asking these tools to build other types of projects where it (seems?) much more difficult to verify without a human-in-the-loop. One example is I had asked Codex to build a simulation of the solar system using a Metal renderer. It produced a fun working app quickly.

I asked it to add bloom. It looped for hours, failing. I would have to manually verify — because even from images — it couldn't tell what was right and wrong. It only got it right when I pasted a how-to-write-a-bloom-shader-pass-in-Metal blog post into it.

Then I noticed that all of the planet textures were rotating oddly every time I orbited the camera. Codex got stuck in another endless loop of "Oh, the lookAt matrix is in column major, let me fix that <proceeds to break everything>." or focusing (incorrectly) on UV coordinates and shader code. Eventually Codex told me what I was seeing "was expected" and that I just "felt like it was wrong."

When I finally realised the problem was that Codex had drawn the planets with back-facing polygons only, I reported the error, to which Codex replied, "Good hypothesis, but no"

I insisted that it change the culling configuration and then it worked fine.

These tools are fun, and great time savers (at times), but take them out of their comfort zone and it becomes real hard to steer them without domain knowledge and close human review.

Anyone who wants a more programmatic version of this, check out cucumber / gherkin - very old school regex-to-code plain english kind of system.

I am afraid that we are heading to a world in which we simply give up on the idea of correct code as an aspiration to strive for. Of course code has always been bad, and of course good code has never been a goal in the whole startup ecosystem (for perfectly legitimate reasons!). But that real production code, for services that millions or even billions of people rely on, should be reliable, that if it breaks that's a problem, this is the whole _engineering_ part of software engineering. And we can say: if we give that up we're going to have a whole lot more outages, security issues, all those things we are meant to minimize as a profession. And the answer is going to be: so what? We save money overall. And people will get used to software being unreliable; which is to say, people will not have a choice but to get used to it.

I disagree. An analytics tool that's correct 99.9% of the time is not 0.1% less valuable than a tool that is always correct. It's 100% less valuable.

Outage is the easy failure mode. I can work around a service that's up 80% of the time, but is 100% correct. A service that's up 100% of the time but is 80% correct is useless.

Well hang-on, in this case it is _neither_ reliable in terms of availability _nor_ correctness. Worst of all worlds.

Feels like a whole bunch of us are converging on very similar patterns right now.

I've been building OctopusGarden (https://github.com/foundatron/octopusgarden), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory (https://factory.strongdm.ai/). The autoissue.py script (https://github.com/foundatron/octopusgarden/blob/main/script...) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus.

On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it.

The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently.

Also want to shout out Ouroboros (https://github.com/Q00/ouroboros), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.

A short story: A developer let ClaudeCode manage his AWS infrastructure. The agent ran a Terraform destroy command... Gone: 2 websites, production database, all backups and 2.5 years of data The agent didn't make a mistake. It did exactly what it was allowed to do. That's the problem dude

I don't think this is right becasue it's talking about Claude like it's a entity in the world. Claude reviewing Claude generated code and framing it like a individual reviewing it's own code isn't the same.

red / green / refactor is a reasonable way through this problem

I appear to be in the minority here. Perhaps because I've been practicing TDD for decades, this reads like the blog equivalent of "water is wet."

Adversarial AI code gen. Have another AI write the tests, tell Codex that Claude wrote some code and to audit the code and write some tests. Tell Gemini that Codex wrote the tests. Have it audit the tests. Tell Codex that Gemini thinks its code is bad and to do better. (Have Gemini write out why into dobetter.md)

Even better, encode it into a workflow and have the subagents be adversarial to each other: https://www.joegaebel.com/articles/principled-agentic-softwa...

Do you really, honestly, have to be doing this stuff even when you sleep? To the point it hits you “wait is this even any good? Gee I don’t want to push out slop.”

If you don’t trust the agent to do it right in the first place why do you trust them to implement your tests properly? Nothing but turtles here.

None of this really answers the problem of all this slop is being produced at record pace and still requires absorption into the company, into the practices, and be reviewed by a human being.

I don't think AI will ever solve this problem. It will never be more than a tool in the arsenal. Probably the best tool, but a tool nonetheless.

[dead]

[flagged]

[dead]

[flagged]

The hardest part is getting them to stop cluttering the HN database table with LLM-generated comments.

Exactly. That’s why I’m skeptical about long running agent loops too.

The thing is, LLMs are probabilistic data structures, and the probability of incorrect final output is proportional to both amount of turns made and amount of agents run simultaneously. In practice it means you almost never end up with desired result after a long loop.

This account constantly posts LLM-generated comments.

If you don't like it, flag the comment as per the guidelines: https://news.ycombinator.com/newsguidelines.html#generated