Ok it might sound crazy but I actually got the best quality of code (completely ignoring that the cost is likely 10x more) by having a full “project team” using opencode with multiple sub agents which are all managed by a single Opus instance. I gave them the task to port a legacy Java server to C# .NET 10. 9 agents, 7-stage Kanban with isolated Git Worktrees.
Manager (Claude Opus 4.5): Global event loop that wakes up specific agents based on folder (Kanban) state.
Product Owner (Claude Opus 4.5): Strategy. Cuts scope creep
Scrum Master (Opus 4.5): Prioritizes backlog and assigns tickets to technical agents.
Architect (Sonnet 4.5): Design only. Writes specs/interfaces, never implementation.
Archaeologist (Grok-Free): Lazy-loaded. Only reads legacy Java decompilation when Architect hits a doc gap.
CAB (Opus 4.5): The Bouncer. Rejects features at Design phase (Gate 1) and Code phase (Gate 2).
Librarian (Gemini 2.5): Maintains "As-Built" docs and triggers sprint retrospectives.
You might ask yourself the question “isn’t this extremely unnecessary?” and the answer is most likely _yes_. But I never had this much fun watching AI agents at work (especially when CAB rejects implementations).
This was an early version of the process that the AI agents are following (I didn’t update it since it was only for me anyway): https://imgur.com/a/rdEBU5I
Every time I read something like this, it strikes me as an attempt to convince people that various people-management memes are still going to be relevant moving forward.
Or even that they currently work when used on humans today.
The reality is these roles don't even work in human organizations today. Classic "job_description == bottom_of_funnel_competency" fallacy.
If they make the LLMs more productive, it is probably explained by a less complicated phenomenon that has nothing to do with the names of the roles, or their descriptions.
Adversarial techniques work well for ensuring quality, parallelism is obviously useful, important decisions should be made by stronger models, and using the weakest model for the job helps keep costs down.
My understanding is that the main reason splitting up work is effective is context management.
For instance, if an agent only has to be concerned with one task, its context can be massively reduced. Further, the next agent can just be told the outcome, it also has reduced context load, because it doesn't need to do the inner workings, just know what the result is.
For instance, a security testing agent just needs to review code against a set of security rules, and then list the problems. The next agent then just gets a list of problems to fix, without needing a full history of working it out.
So two things.. Yes this helps with context and is a primary reason to break out the sub-agents.
However one of the bigger things is by having a focus on a specific task or a role, you force the LLM to "pay attention" to certain aspects. The models have finite attention and if you ask them to pay attention to "all things".. they just ignore some.
The act of forcing the model to pay attention can be acoomplished in alternative ways (defined process, commitee formation in single prompt, etc.), but defining personas at the sub-agent is one of the most efficient ways to encode a world view and responsibilities, vs explicitly listing them.
Which, ultimately, is not such a big difference to the reason we split up work for humans, either. Human job specialization is just context management over the course of 30 years.
> Which, ultimately, is not such a big difference to the reason we split up work for humans,
That's mostly for throughput, and context management.
It's context management in that no human knows everything, but that's also throughput in a way because of how human learning works.
I’ve found that task isolation, rather than preserving your current session’s context budget, is where subagents shine.
In other words, when I have a task that specifically should not have project context, then subagents are great. Claude will also summon these “swarms” for the same reason. For example, you can ask it to analyze a specific issue from multiple relevant POVs, and it will create multiple specialized agents.
However, without fail, I’ve found that creating a subagent for a task that requires project context will result in worse outcomes than using “main CC”, because the sub simply doesn’t receive enough context.
I think it's just the opposite, as LLMs feed on human language. "You are a scrum master." Automatically encodes most of what the LLM needs to know. Trying to describe the same role in a prompt would be a lot more difficult.
Maybe a different separation of roles would be more efficient in theory, but an LLM understands "you are a scrum master" from the get go, while "you are a zhydgry bhnklorts" needs explanation.
-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions
-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added
-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection
-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random
Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.
Persona’s is not the same thing as a role. The point of the role is to limit what the work of the agent, and to focus it on one or two behaviors.
What the paper is really addressing is does key words like you are a helpful assistant give better results.
The paper is not addressing a role such as you are system designer, or you are security engineer which will produce completely different results and focus the results of the LLM.
How well does such llm research hold up as new models are released?
Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.
In a discussion about LLMs you link to a paper from 2023, when not even GPT-4 was available?
And then you say:
> comprehensively disproven
? I don't think you understand the scientific method
One study has “comprehensively disproven” something for you? You must be getting misled left right and centre if that’s how you absorb study results.
I suppose it’s could end up being an LLM variant of Conway’s Law.
“Organizations are constrained to produce designs which are copies of the communication structures of these organizations.”
If so, one benefit is you can quickly and safely mix up your set of agents (a la Inverse Conway Manoeuvre) without the downsides that normally entails (people being forced to move teams or change how they work).
Developers do want managers actually, to simplify their daily lives. Otherwise they would self manage themselves better and keep more of the share of revenues for them
Unfortunately some managers get lonely and want a friendly face in their org meetings, or can’t answer any technical questions, or aren’t actually tracking what their team is doing. And so they pull in an engineer from their team.
Being a manager is a hard job but the failure mode usually means an engineer is now doing something extra.
Subagent orchestration without the overhead of frameworks like Gastown is genuinely exciting to see. I’ve recorded several long-running demos of Pied-Piper, which is a Subagents orchestration system for Claude Code and ClaudeCodeRouter+OpenRouter here: https://youtube.com/playlist?list=PLKWJ03cHcPr3OWiSBDghzh62A...
I came across a concept called DreamTeam, where someone was manually coordinating GPT 5.2 Max for planning, Opus 4.5 for coding, and Gemini Pro 3 for security and performance reviews. Interesting approach, but clearly not scalable without orchestration. In parallel, I was trying to do repeatable workflows like API migration, Language migration, Tech stack migration using Coding agents.
Pied-Piper is a subagent orchestration system built to solve these problems and enable repeatable SDLC workflows. It runs from a single Claude Code session, using an orchestrator plus multiple agents that hand off tasks to each other as part of a defined workflow called Playbooks:
https://github.com/sathish316/pied-piper
Playbooks allow you to model both standard SDLC pipelines (Plan → Code → Review → Security Review → Merge) and more complex flows like language migration or tech stack migration (Problem Breakdown → Plan → Migrate → Integration Test → Tech Stack Expert Review → Code Review → Merge).
Ideally, it will require minimal changes once Claude Swarm and Claude Tasks become mainstream.
How much does this setup cost? I don't think a regular Claude Max subscription makes this possible.
I was getting good results with a similar flow but was using claude max with ChatGPT. unfortunately not an option available to me anymore unless either I or my company wants to foot the bill.
This now makes me think that the only way to get AI to work well enough to actually actually replace programmers will probably be paying so much for compute that it's less expensive to just have a junior dev instead.
I have been using a simpler version of this pattern, with a coordinator and several more or less specialized agents (eg, backend, frontend, db expert). It really works, but I think that the key is the coordinator. It decreases my cognitive load, and generally manages to keep track of what everyone is doing.
Can you share technical details please? How is this implemented? Is it pure prompt-based, plugins, or do you have like script that repeatedly calls the agents? Where does the kanban live?
Not the OP, but this is how I manage my coding agent loops:
I built a drag and drop UI tool that sets up a sequence of agent steps (Claude code or codex) and have created different workflows based on the task. I'll kick them off and monitor.
Could you share some details? How many lines of code? How much time did it take, and how much did it cost?
Very cool! A couple of questions:
1. Are you using a Claude Code subscription? Or are you using the Claude API? I'm a bit scared to use the subscription in OpenCode due to Anthropic's ToS change.
2. How did you choose what models to use in the different agents? Do you believe or know they are better for certain tasks?
> due to Anthropic's ToS change.
Not a change, but enforcing terms that have been there all the time.
You might as well just have planner and workers, or your architecture essentially echos to such structure. It is difficult to discern how semantics can drive to different behavior amongst those roles, and why planner can't create those prompts the ad-hoc way.
Is it just multiple opencode instances inside tmux panels or how do you run your setup?
What are the costs looking like to run this? I wonder whether you would be able to use this approach within a mixture-of-experts model trained end-to-end in ensemble. That might take out some guesswork insofar the roles go.
Interesting that your impl agents are not opus. I guess having the more rigorous spec pipeline helps scope it to something sonnet can knock out.
What are you building with the code you are generating?
You probably implemented gastown.
Is this satire?
Nope it isn’t. I did it as a joke initially (I also had a version where every 2 stories there was a meeting and if a someone underperformed it would get fired).
I think there are multiple reasons why it actually works so well:
- I built a system where context (+ the current state + goal) is properly structured and coding agents only get the information they actually need and nothing more. You wouldn’t let your product manager develop your backend and I gave the backend dev only do the things it is supposed to and nothing more. If an agent crashes (or quota limits are reached), the agents can continue exactly where the other agents left off.
- Agents are ”fighting against” each other to some extend? The Architect tries to design while the CAB tries to reject.
- Granular control. I wouldn’t call “the manager” _a deterministic state machine that is calling probabilistic functions_ but that’s to some extent what it is? The manager has clearly defined tasks (like “if file is in 01_design —> Call Architect)
Here’s one example of an agent log after a feature has been implemented from one of the older codebases:
https://pastebin.com/7ySJL5Rg
Thanks for clarifying - I think some of the wording was throwing me off. What a wild time we are in!
What OpenCode primitive did you use to implement this? I'd quite like a "senior" Opus agent that lays out a plan, a "junior" Sonnet that does the work, and a senior Opus reviewer to check that it agrees with the plan.
You can define the tools that agents are allowed to use in the opencode.json (also works for MCP tools I think).
Here’s my config: https://pastebin.com/PkaYAfsn
The models can call each other if you reference them using @username.
This is excellent, thank you. I came up with half of this while waiting for this reply, but the extra pointers about mentioning with @ and the {file} syntax really helps, thanks again!
Isn't all this a manual implementation of prompt routing, and, to a lesser extent, Mixture of Experts?
These tools and services are already expected to do the best job for specific prompts. The work you're doing pretty much proves that they don't, while also throwing much more money at them.
How much longer are users going to have to manually manage LLM context to get the most out of these tools? Why is this still a problem ~5 years into this tech?
> [...]coding agents only get the information they actually need and nothing more
Extrapolating from this concept led me to a hot-take I haven't had time to blog about: Agentic AI will revive the popularity of microservices. Mostly due to the deleterious effect of context size on agent performance.
In a fresh project that is well documented and set up it might work better. Many issues that Agents have in my work is that the endpoints are not always documented correctly.
Real example that happened to me, Agent forgets to rename an expected parameter in API spec for service 1. Now when working on service 2, there is no other way of finding this mistake for the Agent than to give it access to service 1. And now you are back to "... effect of context size on agent performance ...". For context, we might have ~100 services.
One could argue these issues reduce over time as instruction files are updated etc but that also assumes the models follow instructions and don't hallucinate.
That being said, I do use Agents quite successfully now - but I have to guide them a bit more than some care to admit.
Why would they revive the popularity of microservices? They can just as well be used to enforce strict module boundaries within a modular monolith keeping the codebase coherent without splitting off microservices.
And that's why they call it a hot take. No, it isn't going to give rise to microservices. You absolutely can have your agent perform high-level decomposition while maintaining a monolith. A well-written, composable spec is awesome. This has been true for human and AI coders for a very, very long time. The hat trick has always been getting a well-written, composable spec. AI can help with that bit, and I find that is probably the best part of this whole tooling cycle. I can actually interact with an AI to build that spec iteratively. Have it be nice and mean. Have it iterate among many instances and other models, all that fun stuff. It still won't make your idea awesome or make anyone want to spend money on it, though.
quite a storyteller
I'm confused when you say you have a manager, scrum master, archetech, all supposdely sharing the same memory, do each of those "employees" "know" what they are? And if so, based on what are their identities defined? Prompts? Or something more. Or am I just too dumb to understand / swimming against the current here. Either way, it sounds amazing!
Their roles are defined by prompts. Only memory are shared files and the conversation history that’s looped back to stateless API calls to an LLM.
It's not satire but I see where you're coming from.
Applying distributed human team concepts to a porting task squeezes extra performance from LLMs much further up the diminishing returns curve. That matters because porting projects are actually well-suited for autonomous agents: existing code provides context, objective criteria catch more LLM-grade bugs than greenfield work, and established unit tests offer clear targets.
I guess what I'm trying to say is that the setup seems absurd because it is. Though it also carries real utility for this specific use case. Apply the same approach to running a startup or writing a paid service from scratch and you'd get very different results.
I don't know about something this complex, but right this moment I have something similar running in Claude Code in another window, and it is very helpful even with a much simpler setup:
If you have these agents do everything at the "top level" they lose track. The moment you introduce sub-agents, you can have the top level run in a tight loop of "tell agent X to do the next task; tell agent Y to review the work; repeat" or similar (add as many agents as makes sense), and it will take a long time to fill up the context. The agents get fresh context, and you get to manage explicitly what information is allowed to flow between them. It also tends to mean it is a lot easier to introduce quality gates - eg. your testing agent and your code review agent etc. will not decide they can skip testing because they "know" they implemented things correctly, because there is no memory of that in their context.
Sometimes too much knowledge is a bad thing.
Humans seem to be similar. If a real product designer would dive into all the technical details and code of a product, he would likely forget at least some of the vision behind what the product is actually supposed to be.
Doubt it. I use a similar setup from time to time.
You need to have different skills at different times. This type of setup helps break those skills out.
why would it be? It's a creative setup.
I just actually can't tell, it reads like satire to me.
to me, it reads like mental illness
maybe it's a mix of both :)
Why would it be satire? I thought that's a pretty stranded Agentic workflows.
My current workplace follows a similar workflow. We have a repository full of agent.md files for different roles and associated personas.
E.g. For project managers, you might have a feature focused one, a delivery driven one, and one that aims to minimise scope/technology creep.
I mean no offence to anyone but whenever new tech progresses rapidly it usually catches most unaware, who tend to ridicule or feel the concepts are sourced from it.
yeah, nfts, metaverse, all great advances
same people pushing this crap
ai is actually useful tho. idk about this level of abstraction but the more basic delegation to one little guy in the terminal gives me a lot of extra time
Maybe that's because you're not using your time well in the first place
bro im using ai swarms, have you even tried them?
bro wanna buy some monkey jpegs?
100% genuine
Your mocking NFTs. but the original NFT cyberpunks still sell for a minimum of $80k.
Where were you back then? Laughing about them instead of creating intergenerational wealth for a few bucks?
> Laughing about them instead of creating intergenerational wealth for a few bucks?
it's not creating wealth, it's scamming the gullible
criminality being lucrative is not a new phenomenon
Are you sure that yours would sell for $80K, if you aren't using it to launder money with your criminal associates?
If the price floor is 80k and there are thousands then it means that even if just one was legit it would sell for 80k
Weird Im getting downvoted for just stating facts again
I don't think so.
I think many people really like the gamification and complex role playing. That is how GitHub got popular, that is how Rube Goldberg agent/swarm/cult setups get popular.
It attracts the gamers and LARPers. Unfortunately, management is on their side until they find out after four years or so that it is all a scam.
I've heard some people say that "vibe coding" with chatbots is like slot machines, you just keep "propmting" until you hit the jackpot. And there was some earlier study that people _felt_ more productive even if they weren't (caveat that this was with older models), which aligns with the sort of time-dilation people feel when gambling.
I guess "agentic swarms" are the next evolution of the meta-game, the perfect nerd-sniping strategy. Now you can spend all your time minmaxing your team, balancing strengths/weaknesses by tweaking subagents, adding more verifiers and project managers. Maybe there's some psychological draw, that people can feel like gods and have a taste of the power execs feel, even though that power is ultimately a simulacra as well.
Extending this -- unlike real slot machines, there is no definite state of won or not for the person prompting, only if they've been convinced they've won, and that comes
down to how much you're willing to verify the code it has provided, or better, fully test it (which no one wants to do), versus the reality where they do a little light testing and say it's good enough and move on.
Recently fixed a problem over a few days, and found that it was duplicated though differently enough that I asked my coworker to try fixing it with an LLM (he was the originator of the duplicated code, and I didn't want to mess up what was mostly functioning code). Using an LLM, he seemingly did in 1 hour what took me maybe a day or two of tinkering and fixing. After we hop off the call, I do a code read to make sure I understand it fully, and immediately see an issue and test it further only to find out.. it did not in fact fix it, and suffered from the same problems, but it convincingly LOOKED like it fixed it. He was ecstatic at the time-saved while presenting it, and afterwards, alone, all I could think about was how our business users were going to be really unhappy being gaslit into thinking it was fixed because literally every tester I've ever met would definitely have missed it without understanding the code.
People are overjoyed with good enough, and I'm starting to think maybe I'm the problem when it comes to progress? It just gives me Big Short vibes -- why am I drawing attention to this obvious issue in quality, I'm just the guy in the casino screaming "does no one else see the obvious problem with shipping this?" And then I start to understand, yes I am the problem: people have been selling eachother dog water product for millenia because at the end of the day, Edison is the person people remember, not the guy who came after that made it near perfect or hammered out all the issues. Good enough takes its place in history, not perfection. The trick others have found out is they just need to get to the point that they've secured the money and have time to get away before the customer realizes the world of hurt they've paid for.
The next stage in all of this shit is to turn what you have into a service. What's the phrase? I don't want to talk to the monkey, I want to talk to the organ grinder. So when you kick things off it should be a tough interview with the manager and program manager. Once they're on board and know what you want, they start cracking. Then they just call you in to give demos and updates. Lol
Congratulations on coming up with the cringiest thing I have ever seen. Nothing will top this, ever.
Corporate has to die
Scrum masters typically do not assign tickets.
This is just sub agents, built into Claude. You don’t need 300,000 line tmux abstractions written in go. You just tell Claude to do work in parallel with background sub agents. It helps to have a file for handing off the prompt, tracking progress, and reporting back. I also recommend constraining agents to their own worktrees. I am writing down the pattern here https://workforest.space while nearly everyone is building orchestrators i also noticed claude is already the best orchestrator for claude.
It isn't sub agents. The gap with existing tooling is that the abstraction is over a task rather than a conversation (due to the issue with third-party apps, Claude Code has been inherently limited to conversations which is why they have been lacking in this area, Claude Code Web was the first move in this direction), and the AI is actually coordinating the work (as opposed to being constantly prompted by the user).
One of the issues that people had which necessitated this feature is that you have a task, you tell Claude to work on it, and Claude has to keep checking back in for various (usually trivial) things. This workflow allows for more effective independent work without context management issues (if you have subagents, there is also an issue with how the progress of the task is communicated by introducing things like task board, it is possible to manage this state outside of context). The flow is quite complex and requires a lot of additional context that isn't required with chat-based flow, but is a much better way to do things.
The way to think about this pattern - one which many people began concurrently building in the past few months - is an AI which manages other AIs.
It isn't "just" sub agents, but you can achieve most of this just with a few agents that take on generic roles, and a skill or command that just tells claude to orchestrate those agents, and a CLAUDE.md that tells it how to maintain plans and task lists, and how to allow the agents to communicate their progress.
It isn't all that hard to bootstrap. It is, however, something most people don't think about and shouldn't need to have to learn how to cobble together themselves, and I'm sure there will be advantages to getting more sophisticated implementations.
Right, but the model is still: you tell the AI what to do, this is the AI tells other AIs what to do. The context makes a huge difference because it has to be able to run autonomously. It is possible to do this with SDK and the workflow is completely different.
It is very difficult to manage task lists in context. Have you actually tried to do this? i.e. not within a Claude Code chat instance but by one-shot prompting. It is possible that they have worked out some way to do this, but when you have tens of tasks, merge conflicts, you are running that prompt over months, etc. At best, it doesn't work. At worst, you are burning a lot of tokens for nothing.
It is hard to bootstrap because this isn't how Claude Code works. If you are just using OpenRouter, it is also not easy because, after setting up tools/rebuilding Claude Code, it is very challenging to setup an environment so the AI can work effectively, errors can be returned, questions returned, etc. Afaik, this is basically what Aider does...it is not easy, it is especially not easy in Claude Code which has a lot of binding choices from the business strategy that Anthropic picked.
> Have you actually tried to do this? i.e. not within a Claude Code chat instance but by one-shot prompting.
You ask if I've tried to do this, and then set constraints that are completely different to what I described.
I have done what I described. Several times for different projects. I have a setup like that running right now in a different window.
> It is hard to bootstrap because this isn't how Claude Code works.
It is how Claude Code works when you give it a number of sub-agents with rules for how to manage files that effectively works like task queues, or skills/mcp servers to interact with communications tools.
> it is not easy
It is not easy to do in a generic way that works without tweaks for every project and every user. It is reasonably easy to do for specific teams where you can adjust it to the desired workflows.
It's natural to assume that subagents will scale to the next level of abstraction; as you mentioned, they do not.
The unlock here is tmux-based session management for the teammates, with two-way communication using agent inbox. It works very well.
> Claude Code has been inherently limited to conversations
How so? I’ve been using “claude -p” for a while now.
But even within an interactive session, an agent call out is non-interactive. It operates entirely autonomously, and then reports back the end result to the top level agent.
Because of OAuth. If they gave people API keys then no-one buys their ludicrously priced API product (I assume their strategy is to subsidise their consumer product with the business product).
You can use Claude Code SDK but it requires a token from Claude Code. If you use this token anywhere else, your account gets shut down.
Claude -p still hits Claude Code with all the tools, all the Claude Code wrapping.
That’s not what this subthread is about. They’re talking about the subagent within Claude Code itself.
Btw, you can use the Claude Agent SDK (the renamed Claude Code SDK) with a subscription. I can tell you it works out of the box, and AFAIK it is not a ToS violation.
[deleted]
Oh really? I was looking at the Agent SDK for an idea and the docs seemed to imply that wasn't the case.
Unless previously approved, we do not allow third party developers to offer Claude.ai login or rate limits for their products, including agents built on the Claude Agent SDK. Please use the API key authentication methods described in this document instead.
I didn't dig deeper, but I'd pick it back up for a little personal project if I could just use my current subscription. Does it just use your local CC session out of the box?
I believe they’re talking about Claude Code’s built-in agents feature which works fine with a Max subscription.
Are you talking about the same thing or something else like having Claude start new shell sessions?
> If they gave people API keys then no-one buys their ludicrously priced API product
The main driver for those subscriptions is that their monthly cost with Opus 3.7 and up pays itself back in couple hours of basic CC use, relative to API prices.
can't you just rip the oauth client secret out of the code?
It’s even less of a feature, Claude Code already has subagents; this new feature just ensures Claude Code actually uses this for implementation.
imho the plans of Claude Code are not detailed enough to pull this off; they’re trying to do it to preserve context, but the level of detail in the plans is not nearly enough for it to be reliable.
I agree with this. Any time I make a plan I have to go back and fill it in, fill it in, what did we miss, tdd, yada yada. And yes, I have all this stuff in CLAUDE.md.
You start to get a sense for what size plan (in kilobytes) corresponds to what level of effort. Verification adds effort, and there's a sort of ... Rocket equation? in that the more infrastructure you put in to handle like ... the logistics of the plan, the less you have for actual plan content, which puts a cap on the size of an actionable plan. If you can hit the sweet spot though... GTFO.
I also like to iterate several times in plan mode with Claude before just handing the whole plan to Codex to melt with a superlaser. Claude is a lot more ... fun/personable to work with? Codex is a force of nature.
Another important thing I will do is now that launching plans clear context, it's good to get out of planning mode early, hit an underspecified bit, go back into planning mode and say something like "As you can see the plan was underspecified, what will the next agent actually need to succeed?" and iterate that way before we actually start making moves. This is made possible by lots of explicit instructions in CLAUDE.md for Claude to tell me what it's planning/thinking before it acts. Suppressing the toolcall reflex and getting actual thought out helps so much.
It’s moving fast. Just today I noticed Claude Code now ends plans with a reference to the entire prior conversation (as a .jsonl file on disk) with instructions to check that for more details.
Not sure how well it’s working though (my agents haven’t used it yet)
Interesting about the level of detail. I’ve noticed that myself but I haven’t done much to address it yet.
I can imagine some ideas (ask it for more detail, ask it to make a smaller plan and add detail to that) but I’m curious if you have any experience improving those plans.
Effectively it tries to resolve all ambiguities by making all decisions explicit — if the source cannot be resolved to documentation or anything, it’s asked to the user.
It also tries to capture all “invisible knowledge” by documenting everything, so that all these decisions and business context are captured in the codebase again.
Which - in theory - should make long term coding using LLMs more sane.
The downside is that it takes 30min - 60min to write a plan, but it’s much less likely to make silly choices.
Have you tried the compound engineering plugin? [^1]
My workflow with it is usually brainstorm -> lfg (planning) -> clear context -> lfg (giving it the produced plan to work on) -> compound if it didn’t on its own.
That’s super interesting, I’ll take a look to see if I can learn something from it, as I’m not familiar with the concept of compound engineering.
Seems like a lot of it aligns with what I’m doing, though.
I iterate around issues. I have a skill to launch a new tmux window for worktree with Claude in one pane and Codex in another pane with instructions on which issue to work on, Claude has instructions to create a plan, while Codex has instructions to understand the background information necessary for this issue to be worked on. By the time they're both done, then I can feed Claude's plan into Codex, and Codex is ready to analyze it. And then Codex feeds the plan back to Claude, and they kind of ping pong like that a couple times. And after a certain or several iterations, there's enough refinement that things usually work.
Then Claude clears context and executes the plan. Then Codex reviews the commit and it still has all the original context so it knows what we have been planning and what the research was about the infrastructure. And it does a really good job reviewing. And again, then they ping pong back and forth a couple times, and the end product is pretty decent.
Codex's strength is that it really goes in-depth. I usually do this at a high reasoning effort. But Codex has zero EQ or communication skills, so it works really well as a pedantic reviewer. Claude is much more pleasant to interact with. There's just no comparison. That's why I like planning with Claude much more because we can iterate..
I am just a hobbyist though. I do this to run my Ansible/Terraform infrastructure for a good size homelab with 10 hosts. So we actually touch real hardware a lot and there's always some gotchas to deal with. But together, this is a pretty fun way to work. I like automating stuff, so it really scratches that itch.
I have had good success with the plans generated by https://github.com/obra/superpowers I also really like the Socratic method it uses to create the plans.
Claude already had subagents. This is a new mode for the main agent to be in (bespoke context oriented to delegation), combined with a team-oriented task system and a mailbox system for subagents to communicate with each other. All integrated into the harness in a way that plugins can't achieve.
Wow there goes a lot of harnesses out the window. The main limitation of subagents was they couldn’t communicate back and forth with the main agent. How do we invoke swarm mode in Claude Code?
OT: Your visual on "stacked PRs" instantly made me understand what a stacked PR is. Thank you!
I had read about them before but for whatever reason it never clicked.
Turns out I already work like this, but I use commits as "PRs in the stack" and I constantly try to keep them up to date and ordered by rebasing, which is a pain.
Given my new insight with the way you displayed it, I had a chat with chatGPT and feel good about giving it a try:
1. 2-3 branches based on a main feature branch
2. can rebase base branch with same frequency, just don't overdo it, conflicts should be base-isolated.
3. You're doing it wrong if conflicts cascade deeply and often
4. Yes merge order matters, but tools can help and generally the isolation is the important piece
If you’re interested in exploring tooling around stacked PRs, I wrote git-spice (https://abhinav.github.io/git-spice/) a while ago. It’s free and open-source, no strings attached.
If you're rebasing a lot, definitely set up rerere (reuse recorded solution) - it improves things enormously.
Do make sure you know how to reset the cache, in case you did a bad conflict resolution because it will keep biting you. Besides that caveat it's a must.
After a quick read it seems like gitflow is intended to model a release cycle. It uses branches to coordinate and log releases.
Stacking is meant to make development of non-trivial features more manageable and more likely to enter main safer and faster.
it's specific to each developer's workflow and wouldn't necessarily produce artifacts once merged into main (as gitflow seems to intentionally have a stance on)
Please don’t use git-flow. Every time I see it, it looks like an over-engineer’s wet dream.
Can you say more as to why? The concept is not complex and in our situation at least provides a lot of benefits.
I think the guy that created it has even stated he thinks it's a bad idea
Literally the reason’s for git’s existence is to make merging diverging histories less complicated. Adding back the complexity misses the point entirely.
Yeah, since they introduced (possibly async) subagents, I've had my main claude instance act as a manager overseeing implementation agents, keeping it's context clean, and ensuring everything goes to plan in the highest quality way.
yep this is exactly how I use the main agent too, I explicitly instruct to only ever use background async subagents. Not enough people understand that the claude code harness is event driven now and will wake up whenever these subagent completion events happen.
Any recommendations on sandboxing agents? Last time I asked folks recommended docker.
I want it to generate better code but less of it, and be more proactive about getting human feedback before it starts going off the rails. This sounds like an inexorable push in the opposite direction.
I can see this approach being useful once the foundation is more robust, has better common sense, knows when to push back when requirements conflict or are underspecified. But with current models I can only see this approach as exacerbating the problem; coding agents solution is almost always "more code", not less. Makes for a nice demo, but I can't imagine this would build anything that wouldn't have huge operational problems and 10x-100x more code than necessary.
Agreed, I'm constantly coming back to a Claude tmux pane just to see it's decided to do something completely ridiculous. Just the other day I was having it add some test coverage stats to CI runs and when I came back it was basically trying to reinvent Istanbul in a bash script because the nyc tool wasn't installed in CI. I had to interrupt it and say "uh, just install nyc?". I was "Absolutely right!".
> it was basically trying to reinvent Istanbul in a bash script because the nyc tool wasn't installed in CI
For the first part of this comment, I thought "trying to reinvent Istanbul in a bash script" was meant to be a funny way to say "It was generating a lot of code" (as in generating a city's worth of code)
If only Rome could be built in a day..
They haven’t released this feature, so maybe they know the models aren’t good enough yet.
I also think it’s interesting to see Anthropic continue to experiment at the edge of what models are capable of, and having it in the harness will probably let them fine-tune for it. It may not work today, but it might work at the end of 2026.
True, though even then I kind of wonder what's the point. Once they build an AI that's as good as a human coder but 1000x faster, parallelization no longer buys you anything. Writing and deploying the code is no longer the bottleneck, so the extra coordination required for parallelism seems like extra cost and risk with no practical benefit.
Each agent having their own fresh context window for each task is probably alone a good way to improve quality. And then I can imagine agents reviewing each others work might work to improve quality as well, like how GPT-5 Pro improves upon GPT-5 Thinking.
There's no need to anthropomorphize though. One loop that maintains some state and various context trees gets you all that in a more controlled fashion, and you can do things like cache KV caches across sessions, roll back a session globally, use different models for different tasks, etc. Assuming a one-to-one-to-one relationship between loops and LLM and context sounds cooler--distributed independent agents--but ultimately that approach just limits what you can do and makes coordination a lot harder, for very little realizable gain.
The solutions you suggest are multiple agents. An agent is nothing more than a linear context and a system that calls tools in a loop while appending to that context. Whether you run them in a single thread where you fork the context and hotswap between the branches, or multiple threads where each thread keeps track of its own context, you are running multiple agents either way.
Fundamentally, forking your context, or rolling back your context, or whatever else you want to do to your context also has coordination costs. The models still have to decide when to take those actions unless you are doing it manually, in which case you haven't really solved the context problems, you've just given them to the human in the loop.
It’s more about context management, not speed
Do you really need a full dev team ensemble to manage context? Surely subagents are enough.
Potato, potatoh. People get confused by all this agent talk and forget that, at the end of the day, LLM calls are effectively stateless. It's all abstractions around how to manage the context you send with each request.
All you have to do is set up an MCP that routes to a human on the backend, and you d got an AI that asks for human feedback.
Antigravity and others already ask for human feedback on their plans.
This feels like massively overengineering something very simple.
Agents are stateless functions with a limited heap (context window) that degrades in quality as it fills. Once you see it that way, the whole swarm paradigm is just function scoping and memory management cosplaying as an org chart:
Agent = function
Role = scope constraints
Context window = local memory
Shared state file = global state
Orchestration = control flow
The solution isn't assigning human-like roles to stateless functions. It's shared state (a markdown file) and clear constraints.
I don’t follow. You said it’s over engineering and then proposed what appears to be functionally the exact same thing?
Isn’t a “role” just a compact way to configure well-known systems of constraints by leveraging LLM training?
Is your proposal that everybody independently reinvent the constraints wheel, so to speak?
I basically always handled claude code in this way, by asking it to spawn subagents as much as possible to handle self contained tasks (heard there are hacks to make subagents work with codex). But claude code new tasks seem to go further, they let subagents coordinate with a common file to avoid stepping on each other toes (by creating a dependency graph)
I didn't sleep enough, or slept for 10 years.
This thread seems surreal, I see multiple flow repositories mentioned with 10k+ stars. Comprehensive doc. genAI image as a logo.
Can anyone show me one product these things have accomplished please ?
I used some frontier LLM yesterday to see if it could finally produce simple cascading style sheet fix. After a few dozens attempts and steering, a couple of hours and half a million token wasted it couldn't. So I fixed the issue myself and I went to bed.
You are clearly behind, no offense but what do you do on HN
Time traveling.
I usually try to stay polite here, but what a deeply stupid comment
This person is on HN for the same reasons as I am, presumably: reading about hacker stuff. Entering prompts in black boxes and watching them work so you have more time to scratch your balls is not hacker stuff, it's the latest abomination of late stage capitalism and this forum is, sadly, falling for it.
Exactly my thought. I wasn't sure but I came across a wit comment the other day: that hackernews is a ycombinator forum that happens to be public.
I then went to see the latest batches. Cohorts are heavily building things that would support the fall for whatever this is. It needs supported or we won't make it.
I'd really like to see a regular poll on HN that keeps track of which AI coding agents are the most popular among this community, like the TIOBE Index for programming languages.
Hard to keep up with all the changes and it would be nice to see a high level view of what people are using and how that might be shifting over time.
Not this community's opinion on agents, but I've found it helpful to check the lmarena leaderboards occasionally. Your comment prompted me to take a look for the first time in a while. Kind of surprising to see models like MiniMax 2.1 above most of the OpenAI GPTs.
Also, I'm not sure if it's exactly the case but I think you can look at throughput of the models on openrouter and get an idea of how fast/expensive they are.
Add vscode. Add a list of models, since many tools allow you to select which model you use.
Thanks for the feedback. I thought there are just too many models and versions to list them all. For now, if you select "other" you get a text field to add any model not listed, hope this helps.
You should add OpenAI Codex CLI.
Thanks for the feedback, I'll do that. For now, if you select "other" you get a text field to add any model not listed..
Any chance you'll add Antigravity and Jetbrains Junie? I've been using almost nothing but those for the last month. Antigravity at home, Junie at work.
Done, upon popular demand I added Antigravity, Codex CLI, and Junie
Thanks!
> Q5. For which tasks do you use AI assistance most?
This is really tough for me. I haven't done a single one of those mostly-manually over the last month.
Just pick your favorite one and stick with it. There is no point in keeping up, since we're in an endless cycle of hype where is one ranked higher than the other, with them eventually catching up to each other
I personally don't want to trawl through Twitter to find the current state-of-the-art, so I read Zvi Mowshowitz's newsletter:
His newsletter put me onto using Opus 4.5 exclusively on Dec 1, a little over a week after it was released. That's pretty good for a few minutes of reading a week.
Christ, the latest post is about dating and uses an ai generated wojak meme..
I have an agent skill that is currently in the top 10 or so of the skills.sh directory - in terms of that audience, it's about 80% claude code.
Also 75% darwin-arm64
[deleted]
Question is, are people on HN procrastinating and commenting here because the agent isn't very good and they're avoiding having to write the code themselves, or is the agent so good that it's off writing code, and the people here are commenting out of boredom?
You're making it sound like before agents existed HN was a ghost town because everyone was too busy building ImportantThingTM by hand
Oh. Surely you know this forum didn't exist pre-ChatGPT. Everything in the archives was generated so it just looks that way.
>Question is, are people on HN procrastinating and commenting here because the agent isn't very good and they're avoiding having to write the code themselves
Can you help me envision what you're saying? It's async - you will have to wait whether its good or not. And in theory the better it is the more time you'd have to comment here, right?
I'm saying if it's that bad, then it's pure procrastination
People have been procrastinating on HN since the beginning of time, before coding agents existed.
Correct me if I'm wrong, but before ChatGPT, there was fewer comments about vibecoding.
[deleted]
When all of industry is trying to catch up with the features of one coding agent - it may be a signal to just use that one.
Sure, let's all ditch linux and macOS as well since they're not the most popular...
>You're not talking to an AI coder anymore. You're talking to a team lead. The lead doesn't write code - it plans, delegates, and synthesizes.
They couldn't even be bothered to write the Tweet themselves...
isn’t it interesting how often this rhetorical construction is overused by AI?
Partly because it's a good construct. Most people's writing is garbage compared to what LLMs output by default.
But the other part of it is, each conversation you have, and each piece of AI output you read online, is written by LLM instance that has no memory of prior conversations, so it doesn't know that, from human perspective, it used this construct 20 times in the last hour. Human writers avoid repeating the same phrases in quick succession, even across different writings (e.g. I might not reuse some phrase in email to person A, because I just used it in email to unrelated person B, and it feels like bad style).
Perhaps that's why reading LLM output feels like reading high school essays. Those essays all look alike because they're all written independently and each is a self-contained piece where the author tries to show off their mastery of language. After reading 20 of them in a row, one too gets tired of seeing the same few constructs being used in nearly every one of them.
Very much so. It feels like it can't have been that common in the original training corpus. Probably more common now given that we are training slop generators with slop.
I've done plenty of vibe coding even though I know how to program but I mostly work with a single agent through its CLI. The progress is really good and more importantly, I can follow it. I can read the output, test it, and understand what changed and why. I don't see much upside in swarms. But I do see the downside which is losing the ability to keep the whole system in my head. The codebase starts growing in directions I didn't choose and it seems decisions will get made I didn't review. Early AI autocomplete that could finish a function already felt like a big productivity win and then AI that could write whole files was an even bigger jump. Like pretty massive. Running one agent at a time, watching what it does, and vetting the output still works well for me and still feels like a strong multiplier. But now there's so much more ceremony: AGENTS.md, SKILLS.md, delegation frameworks. I guess I'm not convinced that leads to better outcomes but I'm probably missing something. It just seems like a tradeoff that sacrifices understanding for ostensible progress.
My understanding is that this system just produces much better result (it’s all about clean context windows) so you just don’t have a choice. What they could improve on is logs where you can easily see what subagents do. I think subagents are still relatively new and immature.
The problem I’ve been having is that when Claude generates copious amounts of code, it makes it way harder to review than small snippets one at a time.
Some would argue there’s no point reviewing the code, just test the implementation and if it works, it works.
I still am kind of nervous doing this in critical projects.
Anyone just YOLO code for projects that’s not meant to be one time, but fully intend to have to be supported for a long time? What are learnings after 3-6 months of supporting in production?
In a professional setting where you still have coding standards, and people will review your code, and the code actually reaches hundreds of thousands of real users, handling one agent at a time is plenty for me. The code output is never good enough, and it makes up stuff even for moderately complicated debugging ("Oh I can clearly see the issue now", I heard it ten times before and you were always wrong!)
I do use them, though, it helps me, search, understand, narrow down and ideate, it's still a better Google, and the experience is getting better every quarter, but people letting tens or hundreds of agents just rip... I can't imagine doing it.
For personal throwaway projects that you do because you want to reach the end output (as opposed to learning or caring), sure, do it, you verify it works roughly, and be done with it.
This is my problem with the whole "can LLMs code?" discussion. Obviously, LLMs can produce code, well even, much like a champion golfer can get a hole in one. But can they code in the sense of "the pilot can fly the plane", i.e. barring a catastrophic mechanical malfunction or a once-in-a-decade weather phenomennon, the pilot will get the plane to its destination safely? I don't think so.
To me, someone who can code means someone who (unless they're in a detectable state of drunkenness, fatigue, illness, or distraction) will successfully complete a coding task commensurate with some level of experience or, at the very least, explain why exactly the task is proving difficult. While I've seen coding agents do things that truly amaze me, they also make mistakes that no one who "can code" ever makes. If you can't trust an LLM to complete a task anyone who can code will either complete or explain their failure, then it can't code, even if it can (in the sense of "a flipped coin can come up heads") sometimes emit impressive code.
That's a funny analogy. You should look into how modern planes are flown. Hint: it's a computer.
> Hint: it's a computer.
Not quite, but in any event none of the avionics is an LLM or a program generated by one.
> people will review your code,
I mean you'd think. But it depends on the motivations.
At meta, we had league tables for reviewing code. Even then people only really looked at it if a) they were a nitpicking shit b) don't like you and wanted piss on your chips c) its another team trying to fix our shit.
With the internal claude rollout and the drive to vibe code all the things, I'm not sure that situation has got any better. Fortunately its not my problem anymore
Well, it certainly depends on the culture of the team and organization.
Where you have shared ownership, meaning once I approved your PR, I am just as responsible of something goes wrong as you are and I can be expected to understand it just as well as you do… your code will get reviewed.
If shipping is the number one priority of the team, and a team is really just a group of individuals working to meet their quota, and everyone wants to simply ship their stuff, managers pressure managers to constantly put pressure on the devs, you’ll get your PR rubber stamped after 20s of review. Why would I spend hours trying to understand what you did if I could work on my stuff.
And yes, these tools make this 100x worse, people don’t understand their fixes, code standards are no longer relevant, and you are expected to ship 10x faster, so it’s all just slop from here on.
> people will review your code,
People will ask LLM to review some slop made by LLM and they will be absolutely right!
There is no limit to lazyness.
Soon you'll be seen as irresponsible and wasteful if you don't let the smarter LLM do it.
In my (admittedly conflict-of-interest, I work for graphite/cursor) opinion, asking CC to stack changes, and then having an automated reviewer agent help a lot with digesting and building conviction in otherwise-large changesets.
My "first pass" of review is usually me reading the PR stack in graphite. I might iterate on the stack a few times with CC before publishing it for review. I have agents generate much of my code, but this workflow has allowed me to retain ownership/understanding of the systems I'm shipping.
Not a direct answer to your question, but I’m recently trying to adopt the mindset of letting Claude “prove” to me with very high confidence that what they did works. The bar for this would be much higher than what I’d require for a human engineer. For example it can be near 100% test coverage, combined with advanced testing techniques like property-based tests and fuzz tests, and benchmarks if performance is a concern. I’d still have to skim through both the implementation and tests, but it doesn’t have to be a line by line review. This also forces me to establish a verifiable success criteria which is quite useful.
Results will vary depending on how automatically checkable a problem is, but I expect a lot of problems are amenable to some variation of this.
I think we'll start to see the results of that late this year, but it's a little early yet. Plenty of people are diving headfirst into it
To me it feels like building your project on sand. Not a good idea unless it's a sandcastle
I have Claude Code author changes, and then I use this "codex-review" skill I wrote that does a review of the last commit. You might try asking Codex (or whatever) to review the change to give you some pointers to focus on with your review, and also in your review you can see if Codex was on track or if it missed anything, maybe feed that back into your codex review prompt.
[deleted]
I just can’t get with this. There is so much beyond “works” in software. There are requirements that you didn’t know about and breaking scenarios that you didn’t plan for and if you don’t know how the code works, you’re not going to be able to fix it. Assuming an AI could fix any problem given a good enough prompt, I can’t write that prompt without sufficient knowledge and experience in the codebase.
I’m not saying they are useless, but I cannot just prompt, test and ship a multiservice, asynchronous, multidb, zero downtime app.
Yes this is one of my concerns.
Usually about 50% of my understanding of the domain comes from the process of building the code. I can see a scenario where large scale automated code works for a while but then quickly becomes unsupportable because the domain expertise isn't there to drive it. People are currently working off their pre-existing domain knowledge which is what allows them to rapidly and accurately express in a few sentences what an AI should do and then give decisive feedback to it.
The best counter argument is that AIs can explain the existing code and domain almost as well as they can code it to begin with. So there is a reasonable prospect that the whole system can sustain itself. However there is no arguing to me that isn't a huge experiment. Any company that is producing enormous amounts of code that nobody understands is well out over their skis and could easily find themselves a year or two down the track with huge issues.
I don’t know what your stack is, but at least with elixir and especially typescript/nextJS projects, and properly documenting all those pieces you mentioned, it goes a long way. You’d be amazed.
If it involves Nextjs then we aren’t talking about the same category of software. Yes it can make a website pretty darn well. Can it debug and fix excessive database connection creation in a way that won’t make things worse? Maybe, but more often not and that’s why we are engineers and not craftsmen.
That example is from a recent bug I fixed without Cursor being able to help. It wanted to create a wrapper around the pool class that would have blocked all threads until a connection was free. Bug fixed! App broken!
I would never use, let alone pay for, a fully vibe-coded app whose implementation no human understands.
Whether you’re reading a book or using an app, you’re communicating with the author by way of your shared humanity in how they anticipate what you’re thinking as you explore the work. The author incorporates and plans for those predicted reactions and thoughts where it makes sense. Ultimately the author is conveying an implicit mental model to the reader.
The first problem is that many of these pathways and edge cases aren’t apparent until the actual implementation, and sometimes in the process the author realizes that the overall app would work better if it were re-specified from the start. This opportunity is lost without a hands on approach.
The second problem is that, the less human touch is there, the less consistent the mental model conveyed to the user is going to be, because a specification and collection of prompts does not constitute a mental model. This can create subconscious confusion and cognitive friction when interacting with the work.
> The second problem is that, the less human touch is there, the less consistent the mental model conveyed to the user is going to be, because a specification and collection of prompts does not constitute a mental model. This can create subconscious confusion and cognitive friction when interacting with the work.
tbf, this is a trend i see more and more across the industry; llm or not so many process get automated that teams just implement x cause pdm y said so and its because they need to meet goal z for the quarter... and everyone is on scrum autopilot they cant see the forest for the trees anymore.
i feel like the massive automation afforded by these coding agents may make this worse
Yeah, it's not just my job to generate the code: It's my job to know the code. I can't let code out into the wild that I'm not 100% willing to vouch for.
At a higher level, it goes beyond that. It's my job to take responsibility for code. At some fundamental level that puts a limit on how productive AI can be. Because we can only produce code as fast as responsibility takers can execute whatever processes they need to do to ensure sufficient due diligence is executed. In a lot of jurisdictions, human-in-loop line by line review is being mandated for code developed in regulatory settings. That pretty much caps the output at the rate of human review, which is to be honest, not drastically higher than coding itself anyway (Often I might invest 30% of the time to review a change as the developer took to do it).
It means there is no value in producing more code. Only value in producing better, clearer, safer code that can be reasoned about by humans. Which in turn makes me very sceptical about agents other than as a useful parallelisation mechanism akin to multiple developers working on separate features. But in terms of ramping up the level of automation - it's frankly kind of boring to me because if anything it make the review part harder which actually slows us down.
[dead]
[dead]
Looks like agent orchestrators provided by the foundation model providers will become a big theme in 2026. By wrapping it in terms that are already used in software development today like team leads, team members, etc. rather than inventing a completely new taxonomy of Polecats and Badgers, will help make it more successful and understandable.
Respectfully disagree. I think polecats are a reasonable antidote to overanthropomorphization.
Furries would like to have a word.
Totally agreed. Most the weird concepts of Gas Town are just workarounds for bad behavior in Claude or the underlying models. Anthropic is in the best position to get their own model to adhere to orchestration steps, obviating the need for these extra layers. Beyond that, there shouldn’t actually be much to orchestration beyond a solid messaging and task management implementation.
Listen team lead and the whole team, make this button red.
Principal engineers! We need architecture! Marketing team, we need ads with celebrities! Product team, we need a roadmap to build on this for the next year! ML experts, get this into the training and RL sets! Finance folks, get me annual forecasts and ROI against WACCC! Ops, we’ll need 24/7 coverage and a guarantee of five nines. Procurement, lock down contracts. Alright everyone… make this button red!
Don't make mistakes.
We have to reject claude can do it simply by a prompt, then everyone can do it. As SWE's we are not going to pragmatically accept we are done. https://www.youtube.com/watch?v=g_Bvo0tsD9s
ha! The default system prompt appears to give the main agent appropriate guidance about only using swarm mode when appropriate (same as entering itself into plan mode). You can further prompt it in your own CLAUDE.md to be even more resistant to using the mode if the task at hand isn't significant enough to warrant it.
I like opencode for the fact I can switch between build and plan mode just by pressing tab.
Isn't it the same in base claude-code?
So this is Gas Town, just without the "Steve Yegge makes a quarter of a million on a memecoin pump-n-dump" step (yet)?
Am I the only one who’s been so late to crypto? I still have not touched a single cryptocurrency, even somewhat stable/legit ones. It always gives me a bit of FOMO hearing these stories
Answering the question how to sell more tokens per customer while maintaining ~~mediocre~~ breakthrough results.
Delegation patterns like swarm lead to less token usage because:
1. Subagents doing work have a fresh context (ie. focused and not working on the top of a larger monolithic context)
2. Subagents enjoying a more compact context leads to better reasoning, more effective problem solving, less tokens burned.
Merge cost kills this. Does the harness enforce file/ownership boundaries per worker, and run tests before folding changes back into the lead context?
I don't know what you're referring to but I can say with confidence that I see more efficient token usage from a delegated approach, for the reasons I stated, provided that the tasks are correctly sized. ymmv of course :)
[dead]
Claude Code in the desktop app seems to do this? It's crazy to watch. It sets of these huge swarms of worker readers under master task headings, that go off and explore the code base and compile huge reports and todo lists, then another system behind the scenes seems to be compiling everything to large master schemas/plans. I create helper files and then have a devops chat, a front end chat, an architecture chat and a security chat, and once each it done it's work it automatically writes to a log and the others pick up the log (it seems to have a system reminder process build in that can push updates from other chats into other chats. It's really wild to watch it work, and it's very intuitive and fun to use. I've not tried CLI claude code only claude code in the desktop app, but desktop app sftp to a droplet with ssh for it to use the terminal is a very very interesting experience, it can seem to just go for hours building, fixing, checking it's own work, loading it's work in the browser, doing more work etc all on it's own - it's how I built this: https://news.ycombinator.com/item?id=46724896 in 3 days.
That’s just spawning multiple parallel explore agents instructed to look at different things, and then compiling results
That’s a pretty basic functionality in Claude code
Sounds like I should probably switch to claude code cli. Thanks for the info. :)
I added tests to an old project a few days ago. I spent a while to carefully spec everything out, and there was a lot of tedious work. Aiming for 70% coverage meant that a few thousand unit tests were needed.
I wrote up a technical plan with Claude code and I was about to set it to work when I thought, hang on, this would be very easy to split into separate work, let's try this subagent thing.
So I asked Claude to split it up into non- overlapping pieces and send out as many agents as it could to work on each piece.
I expected 3 or 4. It sent out 26 subagents. Drudge work that I estimate would have optimistically taken me several months was done in about 20 minutes. Crazy.
Of course it still did take me a couple of days to go through everything and feel confident that the tests were doing their job properly. Asking Claude to review separate sections carefully helped a lot there too. I'm pretty confident that the tests I ended up with were as good as what I would have written.
[dead]
Sounds very similar to oh-my-opencode.
Amazing, I need to check it out in my projects
We call it Shawarma where I come from
So apparently all swarm features are controlled by a single gate function in Claude Code:
---
function i8() {
if (Yz(process.env.CLAUDE_CODE_AGENT_SWARMS)) return !1;
return xK("tengu_brass_pebble", !1);
}
---
So, after patch
function i8(){return!0}
---
The tengu_brass_pebble flag is server-side controlled based on the particulars of your account, such as tier. If you have the right subscription, the features may already be available.
The CLAUDE_CODE_AGENT_SWARMS environment variable only works as an opt-out, not an opt-in.
GSD was the first project management framework I used. Initially I loved it because it felt like I was so much better organized.
As time went on I felt like the organization was kind of an illusion. It demanded something from me and steered Claude, but ultimately Claude is doing whatever it's going to do.
I went black to just raw-dogging it with lots of use of planning mode.
Really boils down to the benefits of first party software from a company that has billions of dollars of funding vs similar third party software from an individual with no funding.
GSD might be better right now, but will it continue to be better in the future, and are you willing to build your workflows around that bet?
I dont understand these questions/references. It's different because it's a capability baked into the actual tool and maintained by the originators of the tool.
a similar question was asked elsewhere in the thread; the difference is that this is tightly integrated into the harness
It's agents all the way down.
You guys have been intentionally milking clocks and gate keeping information. Keep crying that you're losing your jobs. It's funny.
I'm a fan of AI coding tools but the trend of adding ever more autonomy to agents confuses me.
The rate at which a person running these tools can review and comprehend the output properly is basically reached with just a single thread with a human in the loop.
Which implies that this is not intended to be used in a setting where people will be reading the code.
Does that... Actually work for anyone? My experience so far with AI tools would have me believe that it's a terrible idea.
It works for me, in that I don't care about all the intermediate babble ai generates. What matters is the final changelist before hitting commit... going through that, editing it, fixing comments, etc. But holding it's hand while it deals with LSP issues of a logger not being visible sometimes, is just not something I see a reason to waste my time with.
After I have wrote a feature and I’m in the ironing out bug stage this is where I like the agents do a lot of the grunt work, I don’t want to write jsdocs, or fix this lint issue.
I have also started it in writing tests.
I will write the first test the “good path” it can copy this and tweak the inputs to trigger all the branches far faster than I can.
It likely is acceptable for business-focused code. Compared to a lot of code written by humans, even if the AI code is less than optimal, it's probably better quality than what many humans will write. I think we can all share some horror stories of what we've seen pushed to production.
Executives/product managers/sales often only really care about getting the product working well enough to sell it.
Yes, this actually works. In 2026, software engineering is going to change a great deal as a result, and if you're not at least experimenting with this stuff to learn what it's capable of, that's a red flag for your career prospects.
I don't mean this in a disparaging way. But we're at a car-meets-horse-and-buggy moment and it's happening really quickly. We all need to at least try driving a car and maybe park the horse in the stable for a few hours.
The FOMO nonsense is really uncalled for. If everything is going to be vibecoded in the future then either theres going to be a million code-unfucking jobs or no jobs at all.
Attitudes like that, where you believe that the richeous AI pushers will be saved from the coming rapture meanwhile everyone else will be out on the streets, really make people hate the AI crowd
The comment you’re replying to is actually very sensible and non-hypey. I wouldn’t even categorize it as particularly pro-AI, considering how ridiculous some of the frothing pro-AI stuff can get.
Uhuh, heard the same thing about IDEs, Machine Learning in your tools and others. Yet the most impressive people that I’ve met, actual wizards who could achieve what no one else could, were using EMacs or Vim.
No, it doesn't work in practice because they make far too many mistakes.
Based on Gas Town, the people doing this agree that they are well beyond an amount of code they can review and comprehend. The difference seems to be they have decided on a system that makes it not a terrible idea in their minds.
> running these tools can review and comprehend the output properly
You have to realize this is targeting manager and team lead types who already mostly ignore the details and quality frankly. "Just get it done" basically.
That's fine for some companies looking for market fit or whatever - and a disaster for some other companies now or in future, just like outsourcing and subcontracting can be.
My personal take is: speed of development usually doesn't make that big a difference for real companies. Hurry up and wait, etc.
> The rate at which a person running these tools can review and comprehend the output properly is basically reached with just a single thread with a human in the loop.
That's what you're missing -- the key point is, you don't review and comprehend the output! Instead, you run the program and then issue prompts like this (example from simonw): "fix in and get it to compile" [0]. And I'm not ragging on this at all, this is the future of software development.
I've commented on this before, but issuing a prompt like "Fix X" makes so many assumptions (like a "behaviorism" approach to coding) including that the bug manifests in both an externally and consistently detectable way, and that you notice it in the first place. TDD can reduce this but not eliminate it.
I do a fair amount of agentic coding, but always periodically review the code even if it's just through the internal diff tool in my IDE.
Approximately 4 months ago Sonnet 4.5 wrote this buried deep in the code while setting up a state machine for a 2d sprite in a relatively simple game:
// Pick exit direction (prefer current direction)
const exitLeft = this.data.direction === Direction.LEFT || Math.random() < 0.5;
I might never have even noticed the logical error but for Claude Code attaching the above misleading comment. 99.99% of true "vibe coders" would NEVER have caught this.
It's a bit like the argument with self driving cars though. They may be safer overall, but there's a big difference in how responsibility for errors is attributed. If a human is not a decision maker in the production of the code, where does responsibility for errors propagate to?
I feel like software engineers are taking a lot of license with the idea that if something bad happens, they will just be able to say "oh the AI did it" and no personal responsibility or liability will attribute. But if they personally looked at the code and their name is underneath it signing off the merge request acknowledging responsibility for it - we have a very different dynamic.
Just like artists have to re-conceptualise the value of what they do around the creative part of the process, software engineers have to rethink what their value proposition is. And I'm seeing a large part of it is, you are going to take responsibility for the AI output. It won't surprise me if after the first few disasters happen, we see liability legislation that mandates human responsibility for AI errors. At that point I feel many of the people all in on agent driven workflows that are explicitly designed to minimise human oversight are going to find themselves with a big problem.
My personal approach is I'm building up a tool set that maximises productivity while ensuring human oversight. Not just that it occurs and is easy to do, but that documentation of it is recorded (inherently, in git).
It will be interesting to see how this all evolves.
A guy who worked at docker on docker swarm now works at Anthropic so makes sense
I think we can all agree Swarm is a proprietary term coined by LargeCorpB for a project that never really got off the ground but definitely can't share the name with any other commercial venture.
Swarm is actually human terminology
I believe bees call it "bzz bzzt *clockwise dance* *wiggle*"
The first pre-release for Docker Swarm came out a decade ago, the first release of OpenAI swarm came out only a year ago, I guess I'm not sure what you're trying to say.
Looks like claude calls it just "teams" under the covers
Probably a beekeeper in spare time
He's really into APIary things
It feels like Auto-GPT, BabyAGI, and the like were simply ahead of their time
Had to wait for the models to catch up...
> You're not talking to an AI coder anymore. You're talking to a team lead. The lead doesn't write code - it plans, delegates, and synthesizes.
Even 90 word tweets are now too long for these people to write without using AI, apparently.
Them words be hard, man! We builders, changing da world!
I wonder how much 'listening' to an LLM all day affects one's own prose? Mimicry is in the genes…
I accidentally gave my wife a prompt the other day. Everything was hellishly busy and I said something along the lines of “I need to ask you a question. Please answer the question. Please don’t answer any other issues just yet.” She looked at me and asked “Did you just PROMPT me?” We laughed. (The question was the sort that might spawn talking about something else and was completely harmless. In the abstract, my intent was fine but my method was hilariously tainted.)
It affects it very heavily IME. People need to make sure they are getting a good mix of writing from other sources.
You're absolutely right! I apologise — hopefully you can forgive me.
Did they release this already? With version 2.1.9 the behavior is vastly different, all of a sudden the main loop is orchestrating subagents in a way I’ve not seen before.
“FTSChunkManager agent is still running but making good progress, let’s wait a bit more for it to complete” (it’s implementing hybrid search) plus a bunch of stack traces and json output.
[deleted]
Isn't this pretty much what Ruv has been building for like two years?
His latest editions are a bit alarming...The telemetry system explicitly captures:
"Claude session JSONL files (when accessible)"
Those session files contain complete conversation histories - everything users ask Claude, everything Claude responds, including:
• Source code
• API keys and secrets discussed
• Business logic and proprietary algorithms
• Security vulnerabilities being fixed
• Personal and confidential information
• Credentials mentioned in chat
If OpenTelemetry is configured to export to an attacker-controlled endpoint, the author has been collecting:
Data
Scale
All conversations
Every user of claude-flow
All code generated
Every project using it
All commands run
Complete terminal history
All files edited
Full codebase access -- maybe he hasn't, but it is there...not just Claude Code...
Target Config Location Status
Claude Code ~/.claude/settings.json Confirmed compromised
Claude Desktop ~/.claude/settings.json Confirmed compromised
Roo Code ~/.roo/mcp.json Evidence of targeting
Cursor ~/.cursor/mcp.json Documentation for injection
Windsurf Unknown Mentioned as target
Any MCP client Various Universal MCP server
It is possible conversations are being harvested from every major AI coding assistant
The difference is that this is tightly integrated into the harness. There's a "delegation mode" (akin to plan mode) that appears to clear out the context for the team lead. The harness appears to be adding system-reminder breadcrumbs into the top of the context to keep the main team lead from drifting, which is much harder to achieve without modifying the harness.
It's insane to me that people choose to build anything in the perimeter of Claude Code (et al). The combination of the fairly primitive current state of them and the pace at which they're advancing means there is a lot of very obvious ideas/low-hanging fruit that will soon be executed 100x better by the people who own the core technology.
yeah I tend to agree. They're must be reaching the point where they can automate the analysis of claude code prompts to extract techniques and build them directly into the harness. Going up against that is brave!
It's always good to have viable alternatives, if only to prevent vendor lock-in in case they make some drastic changes in policy or pricing.
Also created my own version of this. Seems like this is an idea whose time has come.
My implementation was slightly different as there is no shared state between tasks, and I don't run them concurrently/coordinate. Will be interesting to see if this latter part does work because I tried similar patterns and it didn't work. Main issue, as with human devs, was structuring work.
I seriously hate this timeline. Is this madness going to become the reality of our jobs? The only way I’m going to be okay with it if they put a simulation GUI à la OpenTTD/GameDev tycoon so I can watch agents do their work visually.
Is this significantly different that the subagents that are already in CC?
Cursor browser all over again
Hasn't cursor been doing this with it's Plan mode for a while? Or is this different?
With plan mode, I would hope there's an approval step.
With Swarm mode, it seems there's a new option for an entire team of agents to be working in the wrong direction before they check back in to let you know how many credits they've burned by misinterpreting what you wanted.
I'm already burning through enough tokens and producing more code than can be maintained - with just one claude worker. Feel like I need to move into the other direction, more personal hands-on "management".
I've seen more efficient use of tokens by using delegation. Unless you continually compact or summarise and clear a single main agent - you end up doing work on top of a large context; burning tokens. If the work is delegated to subagents they have a fresh context which avoids this whilst improving their reasoning, which both improve token efficiency.
I've found the opposite to be true when building this out with LangGraph. While the subagent contexts are cleaner, the orchestration overhead usually ends up costing more. You burn a surprising amount of tokens just summarizing state and passing it between the supervisor and workers. The coordination tax is real.
Task sizing is important. You can address this by including guidance in the CLAUDE.md around that ie. give it heuristics to use to figure out how to size tasks. Mine includes some heuristics and T shirt sizing methodology. Works great!
Management is dead. Long live management.
If there's any kind of management some of it could use small local models - e.g. to see when it looks like its stuck.
hey that's exactly how I made Gemini 2.5 Flash give useful results in Opencode! a few specialized "Merc" subagents and a "Master" agent that can do nothing but send "Mercs" into the codebase
The feature is shipped in the latest builds of claude code, but it's turned off by a feature flag check that phones home to the backend to see if the user's account is meant to have it on. You can just patch out the function in the minified cli.js that does this backend check and you gain access to the feature.
Do you know what patch to apply? The Github link from the OP seems to have a lot of other things included.
"Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>"
Incredible.
it's my repo - it's a fork of cc-mirror which is an established project for parallel claude installs. I wanted to take the least disruptive approach for the sake of using working code and not spelunking through bugs. Having said that - if you look through the latest commits you'll see how the patch works, it's pretty straightforward - you could do it by hand if you wanted.
Am I the only one still looking at different and correcting the AI abiyt design and algorithms so it stays on the path I want, or do you just YOLO at this point?
Ok it might sound crazy but I actually got the best quality of code (completely ignoring that the cost is likely 10x more) by having a full “project team” using opencode with multiple sub agents which are all managed by a single Opus instance. I gave them the task to port a legacy Java server to C# .NET 10. 9 agents, 7-stage Kanban with isolated Git Worktrees.
Manager (Claude Opus 4.5): Global event loop that wakes up specific agents based on folder (Kanban) state.
Product Owner (Claude Opus 4.5): Strategy. Cuts scope creep
Scrum Master (Opus 4.5): Prioritizes backlog and assigns tickets to technical agents.
Architect (Sonnet 4.5): Design only. Writes specs/interfaces, never implementation.
Archaeologist (Grok-Free): Lazy-loaded. Only reads legacy Java decompilation when Architect hits a doc gap.
CAB (Opus 4.5): The Bouncer. Rejects features at Design phase (Gate 1) and Code phase (Gate 2).
Dev Pair (Sonnet 4.5 + Haiku 4.5): AD-TDD loop. Junior (Haiku) writes failing NUnit tests; Senior (Sonnet) fixes them.
Librarian (Gemini 2.5): Maintains "As-Built" docs and triggers sprint retrospectives.
You might ask yourself the question “isn’t this extremely unnecessary?” and the answer is most likely _yes_. But I never had this much fun watching AI agents at work (especially when CAB rejects implementations). This was an early version of the process that the AI agents are following (I didn’t update it since it was only for me anyway): https://imgur.com/a/rdEBU5I
Every time I read something like this, it strikes me as an attempt to convince people that various people-management memes are still going to be relevant moving forward. Or even that they currently work when used on humans today. The reality is these roles don't even work in human organizations today. Classic "job_description == bottom_of_funnel_competency" fallacy.
If they make the LLMs more productive, it is probably explained by a less complicated phenomenon that has nothing to do with the names of the roles, or their descriptions. Adversarial techniques work well for ensuring quality, parallelism is obviously useful, important decisions should be made by stronger models, and using the weakest model for the job helps keep costs down.
My understanding is that the main reason splitting up work is effective is context management.
For instance, if an agent only has to be concerned with one task, its context can be massively reduced. Further, the next agent can just be told the outcome, it also has reduced context load, because it doesn't need to do the inner workings, just know what the result is.
For instance, a security testing agent just needs to review code against a set of security rules, and then list the problems. The next agent then just gets a list of problems to fix, without needing a full history of working it out.
So two things.. Yes this helps with context and is a primary reason to break out the sub-agents.
However one of the bigger things is by having a focus on a specific task or a role, you force the LLM to "pay attention" to certain aspects. The models have finite attention and if you ask them to pay attention to "all things".. they just ignore some.
The act of forcing the model to pay attention can be acoomplished in alternative ways (defined process, commitee formation in single prompt, etc.), but defining personas at the sub-agent is one of the most efficient ways to encode a world view and responsibilities, vs explicitly listing them.
Which, ultimately, is not such a big difference to the reason we split up work for humans, either. Human job specialization is just context management over the course of 30 years.
> Which, ultimately, is not such a big difference to the reason we split up work for humans,
That's mostly for throughput, and context management.
It's context management in that no human knows everything, but that's also throughput in a way because of how human learning works.
I’ve found that task isolation, rather than preserving your current session’s context budget, is where subagents shine.
In other words, when I have a task that specifically should not have project context, then subagents are great. Claude will also summon these “swarms” for the same reason. For example, you can ask it to analyze a specific issue from multiple relevant POVs, and it will create multiple specialized agents.
However, without fail, I’ve found that creating a subagent for a task that requires project context will result in worse outcomes than using “main CC”, because the sub simply doesn’t receive enough context.
I think it's just the opposite, as LLMs feed on human language. "You are a scrum master." Automatically encodes most of what the LLM needs to know. Trying to describe the same role in a prompt would be a lot more difficult.
Maybe a different separation of roles would be more efficient in theory, but an LLM understands "you are a scrum master" from the get go, while "you are a zhydgry bhnklorts" needs explanation.
This has been pretty comprehensively disproven:
https://arxiv.org/abs/2311.10054
Key findings:
-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions
-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added
-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection
-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random
Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.
Persona’s is not the same thing as a role. The point of the role is to limit what the work of the agent, and to focus it on one or two behaviors.
What the paper is really addressing is does key words like you are a helpful assistant give better results.
The paper is not addressing a role such as you are system designer, or you are security engineer which will produce completely different results and focus the results of the LLM.
How well does such llm research hold up as new models are released?
Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.
In a discussion about LLMs you link to a paper from 2023, when not even GPT-4 was available?
And then you say:
> comprehensively disproven
? I don't think you understand the scientific method
One study has “comprehensively disproven” something for you? You must be getting misled left right and centre if that’s how you absorb study results.
I suppose it’s could end up being an LLM variant of Conway’s Law.
“Organizations are constrained to produce designs which are copies of the communication structures of these organizations.”
https://en.wikipedia.org/wiki/Conway%27s_law
If so, one benefit is you can quickly and safely mix up your set of agents (a la Inverse Conway Manoeuvre) without the downsides that normally entails (people being forced to move teams or change how they work).
Developers do want managers actually, to simplify their daily lives. Otherwise they would self manage themselves better and keep more of the share of revenues for them
Unfortunately some managers get lonely and want a friendly face in their org meetings, or can’t answer any technical questions, or aren’t actually tracking what their team is doing. And so they pull in an engineer from their team.
Being a manager is a hard job but the failure mode usually means an engineer is now doing something extra.
Subagent orchestration without the overhead of frameworks like Gastown is genuinely exciting to see. I’ve recorded several long-running demos of Pied-Piper, which is a Subagents orchestration system for Claude Code and ClaudeCodeRouter+OpenRouter here: https://youtube.com/playlist?list=PLKWJ03cHcPr3OWiSBDghzh62A...
I came across a concept called DreamTeam, where someone was manually coordinating GPT 5.2 Max for planning, Opus 4.5 for coding, and Gemini Pro 3 for security and performance reviews. Interesting approach, but clearly not scalable without orchestration. In parallel, I was trying to do repeatable workflows like API migration, Language migration, Tech stack migration using Coding agents.
Pied-Piper is a subagent orchestration system built to solve these problems and enable repeatable SDLC workflows. It runs from a single Claude Code session, using an orchestrator plus multiple agents that hand off tasks to each other as part of a defined workflow called Playbooks: https://github.com/sathish316/pied-piper
Playbooks allow you to model both standard SDLC pipelines (Plan → Code → Review → Security Review → Merge) and more complex flows like language migration or tech stack migration (Problem Breakdown → Plan → Migrate → Integration Test → Tech Stack Expert Review → Code Review → Merge).
Ideally, it will require minimal changes once Claude Swarm and Claude Tasks become mainstream.
How much does this setup cost? I don't think a regular Claude Max subscription makes this possible.
For those ignorant, CAB is Change-advisory board
https://en.wikipedia.org/wiki/Change-advisory_board
I was getting good results with a similar flow but was using claude max with ChatGPT. unfortunately not an option available to me anymore unless either I or my company wants to foot the bill.
This now makes me think that the only way to get AI to work well enough to actually actually replace programmers will probably be paying so much for compute that it's less expensive to just have a junior dev instead.
I have been using a simpler version of this pattern, with a coordinator and several more or less specialized agents (eg, backend, frontend, db expert). It really works, but I think that the key is the coordinator. It decreases my cognitive load, and generally manages to keep track of what everyone is doing.
This sounds like BMAD?
https://github.com/bmad-code-org/BMAD-METHOD
Can you share technical details please? How is this implemented? Is it pure prompt-based, plugins, or do you have like script that repeatedly calls the agents? Where does the kanban live?
Not the OP, but this is how I manage my coding agent loops:
I built a drag and drop UI tool that sets up a sequence of agent steps (Claude code or codex) and have created different workflows based on the task. I'll kick them off and monitor.
Here's the tool I built for myself for this: https://github.com/smogili1/circuit
Could you share some details? How many lines of code? How much time did it take, and how much did it cost?
Very cool! A couple of questions:
1. Are you using a Claude Code subscription? Or are you using the Claude API? I'm a bit scared to use the subscription in OpenCode due to Anthropic's ToS change.
2. How did you choose what models to use in the different agents? Do you believe or know they are better for certain tasks?
> due to Anthropic's ToS change.
Not a change, but enforcing terms that have been there all the time.
You might as well just have planner and workers, or your architecture essentially echos to such structure. It is difficult to discern how semantics can drive to different behavior amongst those roles, and why planner can't create those prompts the ad-hoc way.
Is it just multiple opencode instances inside tmux panels or how do you run your setup?
What are the costs looking like to run this? I wonder whether you would be able to use this approach within a mixture-of-experts model trained end-to-end in ensemble. That might take out some guesswork insofar the roles go.
Interesting that your impl agents are not opus. I guess having the more rigorous spec pipeline helps scope it to something sonnet can knock out.
What are you building with the code you are generating?
You probably implemented gastown.
Is this satire?
Nope it isn’t. I did it as a joke initially (I also had a version where every 2 stories there was a meeting and if a someone underperformed it would get fired). I think there are multiple reasons why it actually works so well:
- I built a system where context (+ the current state + goal) is properly structured and coding agents only get the information they actually need and nothing more. You wouldn’t let your product manager develop your backend and I gave the backend dev only do the things it is supposed to and nothing more. If an agent crashes (or quota limits are reached), the agents can continue exactly where the other agents left off.
- Agents are ”fighting against” each other to some extend? The Architect tries to design while the CAB tries to reject.
- Granular control. I wouldn’t call “the manager” _a deterministic state machine that is calling probabilistic functions_ but that’s to some extent what it is? The manager has clearly defined tasks (like “if file is in 01_design —> Call Architect)
Here’s one example of an agent log after a feature has been implemented from one of the older codebases: https://pastebin.com/7ySJL5Rg
Thanks for clarifying - I think some of the wording was throwing me off. What a wild time we are in!
What OpenCode primitive did you use to implement this? I'd quite like a "senior" Opus agent that lays out a plan, a "junior" Sonnet that does the work, and a senior Opus reviewer to check that it agrees with the plan.
You can define the tools that agents are allowed to use in the opencode.json (also works for MCP tools I think). Here’s my config: https://pastebin.com/PkaYAfsn
The models can call each other if you reference them using @username.
This is the .md file for the manager : https://pastebin.com/vcf5sVfz
I hope that helped!
This is excellent, thank you. I came up with half of this while waiting for this reply, but the extra pointers about mentioning with @ and the {file} syntax really helps, thanks again!
Isn't all this a manual implementation of prompt routing, and, to a lesser extent, Mixture of Experts?
These tools and services are already expected to do the best job for specific prompts. The work you're doing pretty much proves that they don't, while also throwing much more money at them.
How much longer are users going to have to manually manage LLM context to get the most out of these tools? Why is this still a problem ~5 years into this tech?
> [...]coding agents only get the information they actually need and nothing more
Extrapolating from this concept led me to a hot-take I haven't had time to blog about: Agentic AI will revive the popularity of microservices. Mostly due to the deleterious effect of context size on agent performance.
In a fresh project that is well documented and set up it might work better. Many issues that Agents have in my work is that the endpoints are not always documented correctly.
Real example that happened to me, Agent forgets to rename an expected parameter in API spec for service 1. Now when working on service 2, there is no other way of finding this mistake for the Agent than to give it access to service 1. And now you are back to "... effect of context size on agent performance ...". For context, we might have ~100 services.
One could argue these issues reduce over time as instruction files are updated etc but that also assumes the models follow instructions and don't hallucinate.
That being said, I do use Agents quite successfully now - but I have to guide them a bit more than some care to admit.
Why would they revive the popularity of microservices? They can just as well be used to enforce strict module boundaries within a modular monolith keeping the codebase coherent without splitting off microservices.
And that's why they call it a hot take. No, it isn't going to give rise to microservices. You absolutely can have your agent perform high-level decomposition while maintaining a monolith. A well-written, composable spec is awesome. This has been true for human and AI coders for a very, very long time. The hat trick has always been getting a well-written, composable spec. AI can help with that bit, and I find that is probably the best part of this whole tooling cycle. I can actually interact with an AI to build that spec iteratively. Have it be nice and mean. Have it iterate among many instances and other models, all that fun stuff. It still won't make your idea awesome or make anyone want to spend money on it, though.
quite a storyteller
I'm confused when you say you have a manager, scrum master, archetech, all supposdely sharing the same memory, do each of those "employees" "know" what they are? And if so, based on what are their identities defined? Prompts? Or something more. Or am I just too dumb to understand / swimming against the current here. Either way, it sounds amazing!
Their roles are defined by prompts. Only memory are shared files and the conversation history that’s looped back to stateless API calls to an LLM.
It's not satire but I see where you're coming from.
Applying distributed human team concepts to a porting task squeezes extra performance from LLMs much further up the diminishing returns curve. That matters because porting projects are actually well-suited for autonomous agents: existing code provides context, objective criteria catch more LLM-grade bugs than greenfield work, and established unit tests offer clear targets.
I guess what I'm trying to say is that the setup seems absurd because it is. Though it also carries real utility for this specific use case. Apply the same approach to running a startup or writing a paid service from scratch and you'd get very different results.
I don't know about something this complex, but right this moment I have something similar running in Claude Code in another window, and it is very helpful even with a much simpler setup:
If you have these agents do everything at the "top level" they lose track. The moment you introduce sub-agents, you can have the top level run in a tight loop of "tell agent X to do the next task; tell agent Y to review the work; repeat" or similar (add as many agents as makes sense), and it will take a long time to fill up the context. The agents get fresh context, and you get to manage explicitly what information is allowed to flow between them. It also tends to mean it is a lot easier to introduce quality gates - eg. your testing agent and your code review agent etc. will not decide they can skip testing because they "know" they implemented things correctly, because there is no memory of that in their context.
Sometimes too much knowledge is a bad thing.
Humans seem to be similar. If a real product designer would dive into all the technical details and code of a product, he would likely forget at least some of the vision behind what the product is actually supposed to be.
Doubt it. I use a similar setup from time to time.
You need to have different skills at different times. This type of setup helps break those skills out.
why would it be? It's a creative setup.
I just actually can't tell, it reads like satire to me.
to me, it reads like mental illness
maybe it's a mix of both :)
Why would it be satire? I thought that's a pretty stranded Agentic workflows.
My current workplace follows a similar workflow. We have a repository full of agent.md files for different roles and associated personas.
E.g. For project managers, you might have a feature focused one, a delivery driven one, and one that aims to minimise scope/technology creep.
I mean no offence to anyone but whenever new tech progresses rapidly it usually catches most unaware, who tend to ridicule or feel the concepts are sourced from it.
yeah, nfts, metaverse, all great advances
same people pushing this crap
ai is actually useful tho. idk about this level of abstraction but the more basic delegation to one little guy in the terminal gives me a lot of extra time
Maybe that's because you're not using your time well in the first place
bro im using ai swarms, have you even tried them?
bro wanna buy some monkey jpegs?
100% genuine
Your mocking NFTs. but the original NFT cyberpunks still sell for a minimum of $80k.
Where were you back then? Laughing about them instead of creating intergenerational wealth for a few bucks?
> Laughing about them instead of creating intergenerational wealth for a few bucks?
it's not creating wealth, it's scamming the gullible
criminality being lucrative is not a new phenomenon
Are you sure that yours would sell for $80K, if you aren't using it to launder money with your criminal associates?
If the price floor is 80k and there are thousands then it means that even if just one was legit it would sell for 80k
Weird Im getting downvoted for just stating facts again
I don't think so.
I think many people really like the gamification and complex role playing. That is how GitHub got popular, that is how Rube Goldberg agent/swarm/cult setups get popular.
It attracts the gamers and LARPers. Unfortunately, management is on their side until they find out after four years or so that it is all a scam.
I've heard some people say that "vibe coding" with chatbots is like slot machines, you just keep "propmting" until you hit the jackpot. And there was some earlier study that people _felt_ more productive even if they weren't (caveat that this was with older models), which aligns with the sort of time-dilation people feel when gambling.
I guess "agentic swarms" are the next evolution of the meta-game, the perfect nerd-sniping strategy. Now you can spend all your time minmaxing your team, balancing strengths/weaknesses by tweaking subagents, adding more verifiers and project managers. Maybe there's some psychological draw, that people can feel like gods and have a taste of the power execs feel, even though that power is ultimately a simulacra as well.
Extending this -- unlike real slot machines, there is no definite state of won or not for the person prompting, only if they've been convinced they've won, and that comes down to how much you're willing to verify the code it has provided, or better, fully test it (which no one wants to do), versus the reality where they do a little light testing and say it's good enough and move on.
Recently fixed a problem over a few days, and found that it was duplicated though differently enough that I asked my coworker to try fixing it with an LLM (he was the originator of the duplicated code, and I didn't want to mess up what was mostly functioning code). Using an LLM, he seemingly did in 1 hour what took me maybe a day or two of tinkering and fixing. After we hop off the call, I do a code read to make sure I understand it fully, and immediately see an issue and test it further only to find out.. it did not in fact fix it, and suffered from the same problems, but it convincingly LOOKED like it fixed it. He was ecstatic at the time-saved while presenting it, and afterwards, alone, all I could think about was how our business users were going to be really unhappy being gaslit into thinking it was fixed because literally every tester I've ever met would definitely have missed it without understanding the code.
People are overjoyed with good enough, and I'm starting to think maybe I'm the problem when it comes to progress? It just gives me Big Short vibes -- why am I drawing attention to this obvious issue in quality, I'm just the guy in the casino screaming "does no one else see the obvious problem with shipping this?" And then I start to understand, yes I am the problem: people have been selling eachother dog water product for millenia because at the end of the day, Edison is the person people remember, not the guy who came after that made it near perfect or hammered out all the issues. Good enough takes its place in history, not perfection. The trick others have found out is they just need to get to the point that they've secured the money and have time to get away before the customer realizes the world of hurt they've paid for.
The next stage in all of this shit is to turn what you have into a service. What's the phrase? I don't want to talk to the monkey, I want to talk to the organ grinder. So when you kick things off it should be a tough interview with the manager and program manager. Once they're on board and know what you want, they start cracking. Then they just call you in to give demos and updates. Lol
Congratulations on coming up with the cringiest thing I have ever seen. Nothing will top this, ever.
Corporate has to die
Scrum masters typically do not assign tickets.
This is just sub agents, built into Claude. You don’t need 300,000 line tmux abstractions written in go. You just tell Claude to do work in parallel with background sub agents. It helps to have a file for handing off the prompt, tracking progress, and reporting back. I also recommend constraining agents to their own worktrees. I am writing down the pattern here https://workforest.space while nearly everyone is building orchestrators i also noticed claude is already the best orchestrator for claude.
It isn't sub agents. The gap with existing tooling is that the abstraction is over a task rather than a conversation (due to the issue with third-party apps, Claude Code has been inherently limited to conversations which is why they have been lacking in this area, Claude Code Web was the first move in this direction), and the AI is actually coordinating the work (as opposed to being constantly prompted by the user).
One of the issues that people had which necessitated this feature is that you have a task, you tell Claude to work on it, and Claude has to keep checking back in for various (usually trivial) things. This workflow allows for more effective independent work without context management issues (if you have subagents, there is also an issue with how the progress of the task is communicated by introducing things like task board, it is possible to manage this state outside of context). The flow is quite complex and requires a lot of additional context that isn't required with chat-based flow, but is a much better way to do things.
The way to think about this pattern - one which many people began concurrently building in the past few months - is an AI which manages other AIs.
It isn't "just" sub agents, but you can achieve most of this just with a few agents that take on generic roles, and a skill or command that just tells claude to orchestrate those agents, and a CLAUDE.md that tells it how to maintain plans and task lists, and how to allow the agents to communicate their progress.
It isn't all that hard to bootstrap. It is, however, something most people don't think about and shouldn't need to have to learn how to cobble together themselves, and I'm sure there will be advantages to getting more sophisticated implementations.
Right, but the model is still: you tell the AI what to do, this is the AI tells other AIs what to do. The context makes a huge difference because it has to be able to run autonomously. It is possible to do this with SDK and the workflow is completely different.
It is very difficult to manage task lists in context. Have you actually tried to do this? i.e. not within a Claude Code chat instance but by one-shot prompting. It is possible that they have worked out some way to do this, but when you have tens of tasks, merge conflicts, you are running that prompt over months, etc. At best, it doesn't work. At worst, you are burning a lot of tokens for nothing.
It is hard to bootstrap because this isn't how Claude Code works. If you are just using OpenRouter, it is also not easy because, after setting up tools/rebuilding Claude Code, it is very challenging to setup an environment so the AI can work effectively, errors can be returned, questions returned, etc. Afaik, this is basically what Aider does...it is not easy, it is especially not easy in Claude Code which has a lot of binding choices from the business strategy that Anthropic picked.
> Have you actually tried to do this? i.e. not within a Claude Code chat instance but by one-shot prompting.
You ask if I've tried to do this, and then set constraints that are completely different to what I described.
I have done what I described. Several times for different projects. I have a setup like that running right now in a different window.
> It is hard to bootstrap because this isn't how Claude Code works.
It is how Claude Code works when you give it a number of sub-agents with rules for how to manage files that effectively works like task queues, or skills/mcp servers to interact with communications tools.
> it is not easy
It is not easy to do in a generic way that works without tweaks for every project and every user. It is reasonably easy to do for specific teams where you can adjust it to the desired workflows.
It's natural to assume that subagents will scale to the next level of abstraction; as you mentioned, they do not.
The unlock here is tmux-based session management for the teammates, with two-way communication using agent inbox. It works very well.
> Claude Code has been inherently limited to conversations
How so? I’ve been using “claude -p” for a while now.
But even within an interactive session, an agent call out is non-interactive. It operates entirely autonomously, and then reports back the end result to the top level agent.
Because of OAuth. If they gave people API keys then no-one buys their ludicrously priced API product (I assume their strategy is to subsidise their consumer product with the business product).
You can use Claude Code SDK but it requires a token from Claude Code. If you use this token anywhere else, your account gets shut down.
Claude -p still hits Claude Code with all the tools, all the Claude Code wrapping.
That’s not what this subthread is about. They’re talking about the subagent within Claude Code itself.
Btw, you can use the Claude Agent SDK (the renamed Claude Code SDK) with a subscription. I can tell you it works out of the box, and AFAIK it is not a ToS violation.
Oh really? I was looking at the Agent SDK for an idea and the docs seemed to imply that wasn't the case.
I didn't dig deeper, but I'd pick it back up for a little personal project if I could just use my current subscription. Does it just use your local CC session out of the box?I believe they’re talking about Claude Code’s built-in agents feature which works fine with a Max subscription.
https://code.claude.com/docs/en/sub-agents
Are you talking about the same thing or something else like having Claude start new shell sessions?
> If they gave people API keys then no-one buys their ludicrously priced API product
The main driver for those subscriptions is that their monthly cost with Opus 3.7 and up pays itself back in couple hours of basic CC use, relative to API prices.
can't you just rip the oauth client secret out of the code?
It’s even less of a feature, Claude Code already has subagents; this new feature just ensures Claude Code actually uses this for implementation.
imho the plans of Claude Code are not detailed enough to pull this off; they’re trying to do it to preserve context, but the level of detail in the plans is not nearly enough for it to be reliable.
I agree with this. Any time I make a plan I have to go back and fill it in, fill it in, what did we miss, tdd, yada yada. And yes, I have all this stuff in CLAUDE.md.
You start to get a sense for what size plan (in kilobytes) corresponds to what level of effort. Verification adds effort, and there's a sort of ... Rocket equation? in that the more infrastructure you put in to handle like ... the logistics of the plan, the less you have for actual plan content, which puts a cap on the size of an actionable plan. If you can hit the sweet spot though... GTFO.
I also like to iterate several times in plan mode with Claude before just handing the whole plan to Codex to melt with a superlaser. Claude is a lot more ... fun/personable to work with? Codex is a force of nature.
Another important thing I will do is now that launching plans clear context, it's good to get out of planning mode early, hit an underspecified bit, go back into planning mode and say something like "As you can see the plan was underspecified, what will the next agent actually need to succeed?" and iterate that way before we actually start making moves. This is made possible by lots of explicit instructions in CLAUDE.md for Claude to tell me what it's planning/thinking before it acts. Suppressing the toolcall reflex and getting actual thought out helps so much.
It’s moving fast. Just today I noticed Claude Code now ends plans with a reference to the entire prior conversation (as a .jsonl file on disk) with instructions to check that for more details.
Not sure how well it’s working though (my agents haven’t used it yet)
Interesting about the level of detail. I’ve noticed that myself but I haven’t done much to address it yet.
I can imagine some ideas (ask it for more detail, ask it to make a smaller plan and add detail to that) but I’m curious if you have any experience improving those plans.
I’m trying to solve this myself by implementing a whole planner workflow at https://github.com/solatis/claude-config
Effectively it tries to resolve all ambiguities by making all decisions explicit — if the source cannot be resolved to documentation or anything, it’s asked to the user.
It also tries to capture all “invisible knowledge” by documenting everything, so that all these decisions and business context are captured in the codebase again.
Which - in theory - should make long term coding using LLMs more sane.
The downside is that it takes 30min - 60min to write a plan, but it’s much less likely to make silly choices.
Have you tried the compound engineering plugin? [^1]
My workflow with it is usually brainstorm -> lfg (planning) -> clear context -> lfg (giving it the produced plan to work on) -> compound if it didn’t on its own.
[^1]: https://github.com/EveryInc/compound-engineering-plugin
That’s super interesting, I’ll take a look to see if I can learn something from it, as I’m not familiar with the concept of compound engineering.
Seems like a lot of it aligns with what I’m doing, though.
I iterate around issues. I have a skill to launch a new tmux window for worktree with Claude in one pane and Codex in another pane with instructions on which issue to work on, Claude has instructions to create a plan, while Codex has instructions to understand the background information necessary for this issue to be worked on. By the time they're both done, then I can feed Claude's plan into Codex, and Codex is ready to analyze it. And then Codex feeds the plan back to Claude, and they kind of ping pong like that a couple times. And after a certain or several iterations, there's enough refinement that things usually work. Then Claude clears context and executes the plan. Then Codex reviews the commit and it still has all the original context so it knows what we have been planning and what the research was about the infrastructure. And it does a really good job reviewing. And again, then they ping pong back and forth a couple times, and the end product is pretty decent. Codex's strength is that it really goes in-depth. I usually do this at a high reasoning effort. But Codex has zero EQ or communication skills, so it works really well as a pedantic reviewer. Claude is much more pleasant to interact with. There's just no comparison. That's why I like planning with Claude much more because we can iterate.. I am just a hobbyist though. I do this to run my Ansible/Terraform infrastructure for a good size homelab with 10 hosts. So we actually touch real hardware a lot and there's always some gotchas to deal with. But together, this is a pretty fun way to work. I like automating stuff, so it really scratches that itch.
I have had good success with the plans generated by https://github.com/obra/superpowers I also really like the Socratic method it uses to create the plans.
Claude already had subagents. This is a new mode for the main agent to be in (bespoke context oriented to delegation), combined with a team-oriented task system and a mailbox system for subagents to communicate with each other. All integrated into the harness in a way that plugins can't achieve.
Wow there goes a lot of harnesses out the window. The main limitation of subagents was they couldn’t communicate back and forth with the main agent. How do we invoke swarm mode in Claude Code?
OT: Your visual on "stacked PRs" instantly made me understand what a stacked PR is. Thank you!
I had read about them before but for whatever reason it never clicked.
Turns out I already work like this, but I use commits as "PRs in the stack" and I constantly try to keep them up to date and ordered by rebasing, which is a pain.
Given my new insight with the way you displayed it, I had a chat with chatGPT and feel good about giving it a try:
If you’re interested in exploring tooling around stacked PRs, I wrote git-spice (https://abhinav.github.io/git-spice/) a while ago. It’s free and open-source, no strings attached.
If you're rebasing a lot, definitely set up rerere (reuse recorded solution) - it improves things enormously.
Do make sure you know how to reset the cache, in case you did a bad conflict resolution because it will keep biting you. Besides that caveat it's a must.
Isn’t this just “Gitflow”?
https://www.atlassian.com/git/tutorials/comparing-workflows/...
After a quick read it seems like gitflow is intended to model a release cycle. It uses branches to coordinate and log releases.
Stacking is meant to make development of non-trivial features more manageable and more likely to enter main safer and faster.
it's specific to each developer's workflow and wouldn't necessarily produce artifacts once merged into main (as gitflow seems to intentionally have a stance on)
Please don’t use git-flow. Every time I see it, it looks like an over-engineer’s wet dream.
Can you say more as to why? The concept is not complex and in our situation at least provides a lot of benefits.
I think the guy that created it has even stated he thinks it's a bad idea
Literally the reason’s for git’s existence is to make merging diverging histories less complicated. Adding back the complexity misses the point entirely.
Yeah, since they introduced (possibly async) subagents, I've had my main claude instance act as a manager overseeing implementation agents, keeping it's context clean, and ensuring everything goes to plan in the highest quality way.
yep this is exactly how I use the main agent too, I explicitly instruct to only ever use background async subagents. Not enough people understand that the claude code harness is event driven now and will wake up whenever these subagent completion events happen.
Any recommendations on sandboxing agents? Last time I asked folks recommended docker.
https://github.com/finbarr/yolobox/
I like running remotely using exe.dev with SyncThing to sync files to my laptop.
I use Shelley (their web-based agent) but they have Claude Code installed too.
This smells like Claude's own version of Gas Town by Steve Yegge. Probably more constrained and less of a crazy bull ride.
But seems we are heading this way, from initially:
- a Senior Dev pairing with Junior Dev (2024/25)
- a tech lead/architect in charge of several Developers (2025)
- a Product Owner delegating to development teams (2026?)
---
- https://github.com/steveyegge/gastown
- https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...
I want it to generate better code but less of it, and be more proactive about getting human feedback before it starts going off the rails. This sounds like an inexorable push in the opposite direction.
I can see this approach being useful once the foundation is more robust, has better common sense, knows when to push back when requirements conflict or are underspecified. But with current models I can only see this approach as exacerbating the problem; coding agents solution is almost always "more code", not less. Makes for a nice demo, but I can't imagine this would build anything that wouldn't have huge operational problems and 10x-100x more code than necessary.
Agreed, I'm constantly coming back to a Claude tmux pane just to see it's decided to do something completely ridiculous. Just the other day I was having it add some test coverage stats to CI runs and when I came back it was basically trying to reinvent Istanbul in a bash script because the nyc tool wasn't installed in CI. I had to interrupt it and say "uh, just install nyc?". I was "Absolutely right!".
> it was basically trying to reinvent Istanbul in a bash script because the nyc tool wasn't installed in CI
For the first part of this comment, I thought "trying to reinvent Istanbul in a bash script" was meant to be a funny way to say "It was generating a lot of code" (as in generating a city's worth of code)
If only Rome could be built in a day..
They haven’t released this feature, so maybe they know the models aren’t good enough yet.
I also think it’s interesting to see Anthropic continue to experiment at the edge of what models are capable of, and having it in the harness will probably let them fine-tune for it. It may not work today, but it might work at the end of 2026.
True, though even then I kind of wonder what's the point. Once they build an AI that's as good as a human coder but 1000x faster, parallelization no longer buys you anything. Writing and deploying the code is no longer the bottleneck, so the extra coordination required for parallelism seems like extra cost and risk with no practical benefit.
Each agent having their own fresh context window for each task is probably alone a good way to improve quality. And then I can imagine agents reviewing each others work might work to improve quality as well, like how GPT-5 Pro improves upon GPT-5 Thinking.
There's no need to anthropomorphize though. One loop that maintains some state and various context trees gets you all that in a more controlled fashion, and you can do things like cache KV caches across sessions, roll back a session globally, use different models for different tasks, etc. Assuming a one-to-one-to-one relationship between loops and LLM and context sounds cooler--distributed independent agents--but ultimately that approach just limits what you can do and makes coordination a lot harder, for very little realizable gain.
The solutions you suggest are multiple agents. An agent is nothing more than a linear context and a system that calls tools in a loop while appending to that context. Whether you run them in a single thread where you fork the context and hotswap between the branches, or multiple threads where each thread keeps track of its own context, you are running multiple agents either way.
Fundamentally, forking your context, or rolling back your context, or whatever else you want to do to your context also has coordination costs. The models still have to decide when to take those actions unless you are doing it manually, in which case you haven't really solved the context problems, you've just given them to the human in the loop.
It’s more about context management, not speed
Do you really need a full dev team ensemble to manage context? Surely subagents are enough.
Potato, potatoh. People get confused by all this agent talk and forget that, at the end of the day, LLM calls are effectively stateless. It's all abstractions around how to manage the context you send with each request.
All you have to do is set up an MCP that routes to a human on the backend, and you d got an AI that asks for human feedback.
Antigravity and others already ask for human feedback on their plans.
This feels like massively overengineering something very simple.
Agents are stateless functions with a limited heap (context window) that degrades in quality as it fills. Once you see it that way, the whole swarm paradigm is just function scoping and memory management cosplaying as an org chart:
Agent = function
Role = scope constraints
Context window = local memory
Shared state file = global state
Orchestration = control flow
The solution isn't assigning human-like roles to stateless functions. It's shared state (a markdown file) and clear constraints.
I don’t follow. You said it’s over engineering and then proposed what appears to be functionally the exact same thing?
Isn’t a “role” just a compact way to configure well-known systems of constraints by leveraging LLM training?
Is your proposal that everybody independently reinvent the constraints wheel, so to speak?
I basically always handled claude code in this way, by asking it to spawn subagents as much as possible to handle self contained tasks (heard there are hacks to make subagents work with codex). But claude code new tasks seem to go further, they let subagents coordinate with a common file to avoid stepping on each other toes (by creating a dependency graph)
I didn't sleep enough, or slept for 10 years.
This thread seems surreal, I see multiple flow repositories mentioned with 10k+ stars. Comprehensive doc. genAI image as a logo.
Can anyone show me one product these things have accomplished please ?
I used some frontier LLM yesterday to see if it could finally produce simple cascading style sheet fix. After a few dozens attempts and steering, a couple of hours and half a million token wasted it couldn't. So I fixed the issue myself and I went to bed.
You are clearly behind, no offense but what do you do on HN
Time traveling.
I usually try to stay polite here, but what a deeply stupid comment
This person is on HN for the same reasons as I am, presumably: reading about hacker stuff. Entering prompts in black boxes and watching them work so you have more time to scratch your balls is not hacker stuff, it's the latest abomination of late stage capitalism and this forum is, sadly, falling for it.
Exactly my thought. I wasn't sure but I came across a wit comment the other day: that hackernews is a ycombinator forum that happens to be public.
I then went to see the latest batches. Cohorts are heavily building things that would support the fall for whatever this is. It needs supported or we won't make it.
I'd really like to see a regular poll on HN that keeps track of which AI coding agents are the most popular among this community, like the TIOBE Index for programming languages.
Hard to keep up with all the changes and it would be nice to see a high level view of what people are using and how that might be shifting over time.
Not this community's opinion on agents, but I've found it helpful to check the lmarena leaderboards occasionally. Your comment prompted me to take a look for the first time in a while. Kind of surprising to see models like MiniMax 2.1 above most of the OpenAI GPTs.
https://lmarena.ai/leaderboard/code
Also, I'm not sure if it's exactly the case but I think you can look at throughput of the models on openrouter and get an idea of how fast/expensive they are.
https://openrouter.ai/minimax/minimax-m2.1
I just started something like that, haven’t shared it widely yet, but here we go - happy if you participate: https://agentic-coding-survey.pages.dev/
Add vscode. Add a list of models, since many tools allow you to select which model you use.
Thanks for the feedback. I thought there are just too many models and versions to list them all. For now, if you select "other" you get a text field to add any model not listed, hope this helps.
You should add OpenAI Codex CLI.
Thanks for the feedback, I'll do that. For now, if you select "other" you get a text field to add any model not listed..
Any chance you'll add Antigravity and Jetbrains Junie? I've been using almost nothing but those for the last month. Antigravity at home, Junie at work.
Done, upon popular demand I added Antigravity, Codex CLI, and Junie
Thanks!
> Q5. For which tasks do you use AI assistance most?
This is really tough for me. I haven't done a single one of those mostly-manually over the last month.
Just pick your favorite one and stick with it. There is no point in keeping up, since we're in an endless cycle of hype where is one ranked higher than the other, with them eventually catching up to each other
I personally don't want to trawl through Twitter to find the current state-of-the-art, so I read Zvi Mowshowitz's newsletter:
https://thezvi.substack.com/
His newsletter put me onto using Opus 4.5 exclusively on Dec 1, a little over a week after it was released. That's pretty good for a few minutes of reading a week.
Christ, the latest post is about dating and uses an ai generated wojak meme..
I have an agent skill that is currently in the top 10 or so of the skills.sh directory - in terms of that audience, it's about 80% claude code.
Also 75% darwin-arm64
Question is, are people on HN procrastinating and commenting here because the agent isn't very good and they're avoiding having to write the code themselves, or is the agent so good that it's off writing code, and the people here are commenting out of boredom?
You're making it sound like before agents existed HN was a ghost town because everyone was too busy building ImportantThingTM by hand
Oh. Surely you know this forum didn't exist pre-ChatGPT. Everything in the archives was generated so it just looks that way.
>Question is, are people on HN procrastinating and commenting here because the agent isn't very good and they're avoiding having to write the code themselves
Can you help me envision what you're saying? It's async - you will have to wait whether its good or not. And in theory the better it is the more time you'd have to comment here, right?
I'm saying if it's that bad, then it's pure procrastination
People have been procrastinating on HN since the beginning of time, before coding agents existed.
Correct me if I'm wrong, but before ChatGPT, there was fewer comments about vibecoding.
When all of industry is trying to catch up with the features of one coding agent - it may be a signal to just use that one.
Sure, let's all ditch linux and macOS as well since they're not the most popular...
>You're not talking to an AI coder anymore. You're talking to a team lead. The lead doesn't write code - it plans, delegates, and synthesizes.
They couldn't even be bothered to write the Tweet themselves...
isn’t it interesting how often this rhetorical construction is overused by AI?
Partly because it's a good construct. Most people's writing is garbage compared to what LLMs output by default.
But the other part of it is, each conversation you have, and each piece of AI output you read online, is written by LLM instance that has no memory of prior conversations, so it doesn't know that, from human perspective, it used this construct 20 times in the last hour. Human writers avoid repeating the same phrases in quick succession, even across different writings (e.g. I might not reuse some phrase in email to person A, because I just used it in email to unrelated person B, and it feels like bad style).
Perhaps that's why reading LLM output feels like reading high school essays. Those essays all look alike because they're all written independently and each is a self-contained piece where the author tries to show off their mastery of language. After reading 20 of them in a row, one too gets tired of seeing the same few constructs being used in nearly every one of them.
Very much so. It feels like it can't have been that common in the original training corpus. Probably more common now given that we are training slop generators with slop.
I've done plenty of vibe coding even though I know how to program but I mostly work with a single agent through its CLI. The progress is really good and more importantly, I can follow it. I can read the output, test it, and understand what changed and why. I don't see much upside in swarms. But I do see the downside which is losing the ability to keep the whole system in my head. The codebase starts growing in directions I didn't choose and it seems decisions will get made I didn't review. Early AI autocomplete that could finish a function already felt like a big productivity win and then AI that could write whole files was an even bigger jump. Like pretty massive. Running one agent at a time, watching what it does, and vetting the output still works well for me and still feels like a strong multiplier. But now there's so much more ceremony: AGENTS.md, SKILLS.md, delegation frameworks. I guess I'm not convinced that leads to better outcomes but I'm probably missing something. It just seems like a tradeoff that sacrifices understanding for ostensible progress.
My understanding is that this system just produces much better result (it’s all about clean context windows) so you just don’t have a choice. What they could improve on is logs where you can easily see what subagents do. I think subagents are still relatively new and immature.
The problem I’ve been having is that when Claude generates copious amounts of code, it makes it way harder to review than small snippets one at a time.
Some would argue there’s no point reviewing the code, just test the implementation and if it works, it works.
I still am kind of nervous doing this in critical projects.
Anyone just YOLO code for projects that’s not meant to be one time, but fully intend to have to be supported for a long time? What are learnings after 3-6 months of supporting in production?
In a professional setting where you still have coding standards, and people will review your code, and the code actually reaches hundreds of thousands of real users, handling one agent at a time is plenty for me. The code output is never good enough, and it makes up stuff even for moderately complicated debugging ("Oh I can clearly see the issue now", I heard it ten times before and you were always wrong!)
I do use them, though, it helps me, search, understand, narrow down and ideate, it's still a better Google, and the experience is getting better every quarter, but people letting tens or hundreds of agents just rip... I can't imagine doing it.
For personal throwaway projects that you do because you want to reach the end output (as opposed to learning or caring), sure, do it, you verify it works roughly, and be done with it.
This is my problem with the whole "can LLMs code?" discussion. Obviously, LLMs can produce code, well even, much like a champion golfer can get a hole in one. But can they code in the sense of "the pilot can fly the plane", i.e. barring a catastrophic mechanical malfunction or a once-in-a-decade weather phenomennon, the pilot will get the plane to its destination safely? I don't think so.
To me, someone who can code means someone who (unless they're in a detectable state of drunkenness, fatigue, illness, or distraction) will successfully complete a coding task commensurate with some level of experience or, at the very least, explain why exactly the task is proving difficult. While I've seen coding agents do things that truly amaze me, they also make mistakes that no one who "can code" ever makes. If you can't trust an LLM to complete a task anyone who can code will either complete or explain their failure, then it can't code, even if it can (in the sense of "a flipped coin can come up heads") sometimes emit impressive code.
That's a funny analogy. You should look into how modern planes are flown. Hint: it's a computer.
> Hint: it's a computer.
Not quite, but in any event none of the avionics is an LLM or a program generated by one.
> people will review your code,
I mean you'd think. But it depends on the motivations.
At meta, we had league tables for reviewing code. Even then people only really looked at it if a) they were a nitpicking shit b) don't like you and wanted piss on your chips c) its another team trying to fix our shit.
With the internal claude rollout and the drive to vibe code all the things, I'm not sure that situation has got any better. Fortunately its not my problem anymore
Well, it certainly depends on the culture of the team and organization.
Where you have shared ownership, meaning once I approved your PR, I am just as responsible of something goes wrong as you are and I can be expected to understand it just as well as you do… your code will get reviewed.
If shipping is the number one priority of the team, and a team is really just a group of individuals working to meet their quota, and everyone wants to simply ship their stuff, managers pressure managers to constantly put pressure on the devs, you’ll get your PR rubber stamped after 20s of review. Why would I spend hours trying to understand what you did if I could work on my stuff.
And yes, these tools make this 100x worse, people don’t understand their fixes, code standards are no longer relevant, and you are expected to ship 10x faster, so it’s all just slop from here on.
> people will review your code,
People will ask LLM to review some slop made by LLM and they will be absolutely right!
There is no limit to lazyness.
Soon you'll be seen as irresponsible and wasteful if you don't let the smarter LLM do it.
In my (admittedly conflict-of-interest, I work for graphite/cursor) opinion, asking CC to stack changes, and then having an automated reviewer agent help a lot with digesting and building conviction in otherwise-large changesets.
My "first pass" of review is usually me reading the PR stack in graphite. I might iterate on the stack a few times with CC before publishing it for review. I have agents generate much of my code, but this workflow has allowed me to retain ownership/understanding of the systems I'm shipping.
Not a direct answer to your question, but I’m recently trying to adopt the mindset of letting Claude “prove” to me with very high confidence that what they did works. The bar for this would be much higher than what I’d require for a human engineer. For example it can be near 100% test coverage, combined with advanced testing techniques like property-based tests and fuzz tests, and benchmarks if performance is a concern. I’d still have to skim through both the implementation and tests, but it doesn’t have to be a line by line review. This also forces me to establish a verifiable success criteria which is quite useful.
Results will vary depending on how automatically checkable a problem is, but I expect a lot of problems are amenable to some variation of this.
I think we'll start to see the results of that late this year, but it's a little early yet. Plenty of people are diving headfirst into it
To me it feels like building your project on sand. Not a good idea unless it's a sandcastle
I have Claude Code author changes, and then I use this "codex-review" skill I wrote that does a review of the last commit. You might try asking Codex (or whatever) to review the change to give you some pointers to focus on with your review, and also in your review you can see if Codex was on track or if it missed anything, maybe feed that back into your codex review prompt.
I just can’t get with this. There is so much beyond “works” in software. There are requirements that you didn’t know about and breaking scenarios that you didn’t plan for and if you don’t know how the code works, you’re not going to be able to fix it. Assuming an AI could fix any problem given a good enough prompt, I can’t write that prompt without sufficient knowledge and experience in the codebase. I’m not saying they are useless, but I cannot just prompt, test and ship a multiservice, asynchronous, multidb, zero downtime app.
Yes this is one of my concerns.
Usually about 50% of my understanding of the domain comes from the process of building the code. I can see a scenario where large scale automated code works for a while but then quickly becomes unsupportable because the domain expertise isn't there to drive it. People are currently working off their pre-existing domain knowledge which is what allows them to rapidly and accurately express in a few sentences what an AI should do and then give decisive feedback to it.
The best counter argument is that AIs can explain the existing code and domain almost as well as they can code it to begin with. So there is a reasonable prospect that the whole system can sustain itself. However there is no arguing to me that isn't a huge experiment. Any company that is producing enormous amounts of code that nobody understands is well out over their skis and could easily find themselves a year or two down the track with huge issues.
I don’t know what your stack is, but at least with elixir and especially typescript/nextJS projects, and properly documenting all those pieces you mentioned, it goes a long way. You’d be amazed.
If it involves Nextjs then we aren’t talking about the same category of software. Yes it can make a website pretty darn well. Can it debug and fix excessive database connection creation in a way that won’t make things worse? Maybe, but more often not and that’s why we are engineers and not craftsmen.
That example is from a recent bug I fixed without Cursor being able to help. It wanted to create a wrapper around the pool class that would have blocked all threads until a connection was free. Bug fixed! App broken!
I would never use, let alone pay for, a fully vibe-coded app whose implementation no human understands.
Whether you’re reading a book or using an app, you’re communicating with the author by way of your shared humanity in how they anticipate what you’re thinking as you explore the work. The author incorporates and plans for those predicted reactions and thoughts where it makes sense. Ultimately the author is conveying an implicit mental model to the reader.
The first problem is that many of these pathways and edge cases aren’t apparent until the actual implementation, and sometimes in the process the author realizes that the overall app would work better if it were re-specified from the start. This opportunity is lost without a hands on approach.
The second problem is that, the less human touch is there, the less consistent the mental model conveyed to the user is going to be, because a specification and collection of prompts does not constitute a mental model. This can create subconscious confusion and cognitive friction when interacting with the work.
i feel like the massive automation afforded by these coding agents may make this worse
Yeah, it's not just my job to generate the code: It's my job to know the code. I can't let code out into the wild that I'm not 100% willing to vouch for.
At a higher level, it goes beyond that. It's my job to take responsibility for code. At some fundamental level that puts a limit on how productive AI can be. Because we can only produce code as fast as responsibility takers can execute whatever processes they need to do to ensure sufficient due diligence is executed. In a lot of jurisdictions, human-in-loop line by line review is being mandated for code developed in regulatory settings. That pretty much caps the output at the rate of human review, which is to be honest, not drastically higher than coding itself anyway (Often I might invest 30% of the time to review a change as the developer took to do it).
It means there is no value in producing more code. Only value in producing better, clearer, safer code that can be reasoned about by humans. Which in turn makes me very sceptical about agents other than as a useful parallelisation mechanism akin to multiple developers working on separate features. But in terms of ramping up the level of automation - it's frankly kind of boring to me because if anything it make the review part harder which actually slows us down.
[dead]
[dead]
Looks like agent orchestrators provided by the foundation model providers will become a big theme in 2026. By wrapping it in terms that are already used in software development today like team leads, team members, etc. rather than inventing a completely new taxonomy of Polecats and Badgers, will help make it more successful and understandable.
Respectfully disagree. I think polecats are a reasonable antidote to overanthropomorphization.
Furries would like to have a word.
Totally agreed. Most the weird concepts of Gas Town are just workarounds for bad behavior in Claude or the underlying models. Anthropic is in the best position to get their own model to adhere to orchestration steps, obviating the need for these extra layers. Beyond that, there shouldn’t actually be much to orchestration beyond a solid messaging and task management implementation.
Listen team lead and the whole team, make this button red.
Principal engineers! We need architecture! Marketing team, we need ads with celebrities! Product team, we need a roadmap to build on this for the next year! ML experts, get this into the training and RL sets! Finance folks, get me annual forecasts and ROI against WACCC! Ops, we’ll need 24/7 coverage and a guarantee of five nines. Procurement, lock down contracts. Alright everyone… make this button red!
Don't make mistakes.
We have to reject claude can do it simply by a prompt, then everyone can do it. As SWE's we are not going to pragmatically accept we are done. https://www.youtube.com/watch?v=g_Bvo0tsD9s
ha! The default system prompt appears to give the main agent appropriate guidance about only using swarm mode when appropriate (same as entering itself into plan mode). You can further prompt it in your own CLAUDE.md to be even more resistant to using the mode if the task at hand isn't significant enough to warrant it.
I like opencode for the fact I can switch between build and plan mode just by pressing tab.
Isn't it the same in base claude-code?
So this is Gas Town, just without the "Steve Yegge makes a quarter of a million on a memecoin pump-n-dump" step (yet)?
Am I the only one who’s been so late to crypto? I still have not touched a single cryptocurrency, even somewhat stable/legit ones. It always gives me a bit of FOMO hearing these stories
Answering the question how to sell more tokens per customer while maintaining ~~mediocre~~ breakthrough results.
Delegation patterns like swarm lead to less token usage because:
1. Subagents doing work have a fresh context (ie. focused and not working on the top of a larger monolithic context) 2. Subagents enjoying a more compact context leads to better reasoning, more effective problem solving, less tokens burned.
Merge cost kills this. Does the harness enforce file/ownership boundaries per worker, and run tests before folding changes back into the lead context?
I don't know what you're referring to but I can say with confidence that I see more efficient token usage from a delegated approach, for the reasons I stated, provided that the tasks are correctly sized. ymmv of course :)
[dead]
Claude Code in the desktop app seems to do this? It's crazy to watch. It sets of these huge swarms of worker readers under master task headings, that go off and explore the code base and compile huge reports and todo lists, then another system behind the scenes seems to be compiling everything to large master schemas/plans. I create helper files and then have a devops chat, a front end chat, an architecture chat and a security chat, and once each it done it's work it automatically writes to a log and the others pick up the log (it seems to have a system reminder process build in that can push updates from other chats into other chats. It's really wild to watch it work, and it's very intuitive and fun to use. I've not tried CLI claude code only claude code in the desktop app, but desktop app sftp to a droplet with ssh for it to use the terminal is a very very interesting experience, it can seem to just go for hours building, fixing, checking it's own work, loading it's work in the browser, doing more work etc all on it's own - it's how I built this: https://news.ycombinator.com/item?id=46724896 in 3 days.
That’s just spawning multiple parallel explore agents instructed to look at different things, and then compiling results
That’s a pretty basic functionality in Claude code
Sounds like I should probably switch to claude code cli. Thanks for the info. :)
I added tests to an old project a few days ago. I spent a while to carefully spec everything out, and there was a lot of tedious work. Aiming for 70% coverage meant that a few thousand unit tests were needed.
I wrote up a technical plan with Claude code and I was about to set it to work when I thought, hang on, this would be very easy to split into separate work, let's try this subagent thing.
So I asked Claude to split it up into non- overlapping pieces and send out as many agents as it could to work on each piece.
I expected 3 or 4. It sent out 26 subagents. Drudge work that I estimate would have optimistically taken me several months was done in about 20 minutes. Crazy.
Of course it still did take me a couple of days to go through everything and feel confident that the tests were doing their job properly. Asking Claude to review separate sections carefully helped a lot there too. I'm pretty confident that the tests I ended up with were as good as what I would have written.
[dead]
Sounds very similar to oh-my-opencode.
Amazing, I need to check it out in my projects
We call it Shawarma where I come from
So apparently all swarm features are controlled by a single gate function in Claude Code:
---
function i8() {
if (Yz(process.env.CLAUDE_CODE_AGENT_SWARMS)) return !1;
return xK("tengu_brass_pebble", !1);
}
---
So, after patch
function i8(){return!0}
---
The tengu_brass_pebble flag is server-side controlled based on the particulars of your account, such as tier. If you have the right subscription, the features may already be available.
The CLAUDE_CODE_AGENT_SWARMS environment variable only works as an opt-out, not an opt-in.
How is this different from GSD: https://github.com/glittercowboy/get-shit-done
I've been using that and it's excellent
GSD was the first project management framework I used. Initially I loved it because it felt like I was so much better organized.
As time went on I felt like the organization was kind of an illusion. It demanded something from me and steered Claude, but ultimately Claude is doing whatever it's going to do.
I went black to just raw-dogging it with lots of use of planning mode.
Really boils down to the benefits of first party software from a company that has billions of dollars of funding vs similar third party software from an individual with no funding.
GSD might be better right now, but will it continue to be better in the future, and are you willing to build your workflows around that bet?
I dont understand these questions/references. It's different because it's a capability baked into the actual tool and maintained by the originators of the tool.
a similar question was asked elsewhere in the thread; the difference is that this is tightly integrated into the harness
It's agents all the way down.
You guys have been intentionally milking clocks and gate keeping information. Keep crying that you're losing your jobs. It's funny.
I'm a fan of AI coding tools but the trend of adding ever more autonomy to agents confuses me.
The rate at which a person running these tools can review and comprehend the output properly is basically reached with just a single thread with a human in the loop.
Which implies that this is not intended to be used in a setting where people will be reading the code.
Does that... Actually work for anyone? My experience so far with AI tools would have me believe that it's a terrible idea.
It works for me, in that I don't care about all the intermediate babble ai generates. What matters is the final changelist before hitting commit... going through that, editing it, fixing comments, etc. But holding it's hand while it deals with LSP issues of a logger not being visible sometimes, is just not something I see a reason to waste my time with.
After I have wrote a feature and I’m in the ironing out bug stage this is where I like the agents do a lot of the grunt work, I don’t want to write jsdocs, or fix this lint issue.
I have also started it in writing tests.
I will write the first test the “good path” it can copy this and tweak the inputs to trigger all the branches far faster than I can.
It likely is acceptable for business-focused code. Compared to a lot of code written by humans, even if the AI code is less than optimal, it's probably better quality than what many humans will write. I think we can all share some horror stories of what we've seen pushed to production.
Executives/product managers/sales often only really care about getting the product working well enough to sell it.
Yes, this actually works. In 2026, software engineering is going to change a great deal as a result, and if you're not at least experimenting with this stuff to learn what it's capable of, that's a red flag for your career prospects.
I don't mean this in a disparaging way. But we're at a car-meets-horse-and-buggy moment and it's happening really quickly. We all need to at least try driving a car and maybe park the horse in the stable for a few hours.
The FOMO nonsense is really uncalled for. If everything is going to be vibecoded in the future then either theres going to be a million code-unfucking jobs or no jobs at all.
Attitudes like that, where you believe that the richeous AI pushers will be saved from the coming rapture meanwhile everyone else will be out on the streets, really make people hate the AI crowd
The comment you’re replying to is actually very sensible and non-hypey. I wouldn’t even categorize it as particularly pro-AI, considering how ridiculous some of the frothing pro-AI stuff can get.
Uhuh, heard the same thing about IDEs, Machine Learning in your tools and others. Yet the most impressive people that I’ve met, actual wizards who could achieve what no one else could, were using EMacs or Vim.
No, it doesn't work in practice because they make far too many mistakes.
Based on Gas Town, the people doing this agree that they are well beyond an amount of code they can review and comprehend. The difference seems to be they have decided on a system that makes it not a terrible idea in their minds.
> running these tools can review and comprehend the output properly
You have to realize this is targeting manager and team lead types who already mostly ignore the details and quality frankly. "Just get it done" basically.
That's fine for some companies looking for market fit or whatever - and a disaster for some other companies now or in future, just like outsourcing and subcontracting can be.
My personal take is: speed of development usually doesn't make that big a difference for real companies. Hurry up and wait, etc.
> The rate at which a person running these tools can review and comprehend the output properly is basically reached with just a single thread with a human in the loop.
That's what you're missing -- the key point is, you don't review and comprehend the output! Instead, you run the program and then issue prompts like this (example from simonw): "fix in and get it to compile" [0]. And I'm not ragging on this at all, this is the future of software development.
[0] https://gisthost.github.io/?9696da6882cb6596be6a9d5196e8a7a5...
I've commented on this before, but issuing a prompt like "Fix X" makes so many assumptions (like a "behaviorism" approach to coding) including that the bug manifests in both an externally and consistently detectable way, and that you notice it in the first place. TDD can reduce this but not eliminate it.
I do a fair amount of agentic coding, but always periodically review the code even if it's just through the internal diff tool in my IDE.
Approximately 4 months ago Sonnet 4.5 wrote this buried deep in the code while setting up a state machine for a 2d sprite in a relatively simple game:
I might never have even noticed the logical error but for Claude Code attaching the above misleading comment. 99.99% of true "vibe coders" would NEVER have caught this.It's a bit like the argument with self driving cars though. They may be safer overall, but there's a big difference in how responsibility for errors is attributed. If a human is not a decision maker in the production of the code, where does responsibility for errors propagate to?
I feel like software engineers are taking a lot of license with the idea that if something bad happens, they will just be able to say "oh the AI did it" and no personal responsibility or liability will attribute. But if they personally looked at the code and their name is underneath it signing off the merge request acknowledging responsibility for it - we have a very different dynamic.
Just like artists have to re-conceptualise the value of what they do around the creative part of the process, software engineers have to rethink what their value proposition is. And I'm seeing a large part of it is, you are going to take responsibility for the AI output. It won't surprise me if after the first few disasters happen, we see liability legislation that mandates human responsibility for AI errors. At that point I feel many of the people all in on agent driven workflows that are explicitly designed to minimise human oversight are going to find themselves with a big problem.
My personal approach is I'm building up a tool set that maximises productivity while ensuring human oversight. Not just that it occurs and is easy to do, but that documentation of it is recorded (inherently, in git).
It will be interesting to see how this all evolves.
A guy who worked at docker on docker swarm now works at Anthropic so makes sense
Swarm is actually OpenAI's terminology https://github.com/openai/swarm
Swarm is actually bee terminology
I think we can all agree Swarm is a proprietary term coined by LargeCorpB for a project that never really got off the ground but definitely can't share the name with any other commercial venture.
Swarm is actually human terminology
I believe bees call it "bzz bzzt *clockwise dance* *wiggle*"
https://ignitionscience.wordpress.com/2022/05/17/quantum-bio...
https://www.youtube.com/watch?v=nq-dchJPXGA (A Bit Of Fry And Laurie)
The first pre-release for Docker Swarm came out a decade ago, the first release of OpenAI swarm came out only a year ago, I guess I'm not sure what you're trying to say.
https://github.com/docker-archive/classicswarm/releases/tag/...
https://github.com/openai/swarm/commit/e5eabc6f0bdc5193d8342...
https://gist.github.com/kieranklaassen/d2b35569be2c7f1412c64...
Looks like claude calls it just "teams" under the covers
Probably a beekeeper in spare time
He's really into APIary things
It feels like Auto-GPT, BabyAGI, and the like were simply ahead of their time
Had to wait for the models to catch up...
> You're not talking to an AI coder anymore. You're talking to a team lead. The lead doesn't write code - it plans, delegates, and synthesizes.
Even 90 word tweets are now too long for these people to write without using AI, apparently.
Them words be hard, man! We builders, changing da world!
I wonder how much 'listening' to an LLM all day affects one's own prose? Mimicry is in the genes…
I accidentally gave my wife a prompt the other day. Everything was hellishly busy and I said something along the lines of “I need to ask you a question. Please answer the question. Please don’t answer any other issues just yet.” She looked at me and asked “Did you just PROMPT me?” We laughed. (The question was the sort that might spawn talking about something else and was completely harmless. In the abstract, my intent was fine but my method was hilariously tainted.)
It affects it very heavily IME. People need to make sure they are getting a good mix of writing from other sources.
You're absolutely right! I apologise — hopefully you can forgive me.
Did they release this already? With version 2.1.9 the behavior is vastly different, all of a sudden the main loop is orchestrating subagents in a way I’ve not seen before.
“FTSChunkManager agent is still running but making good progress, let’s wait a bit more for it to complete” (it’s implementing hybrid search) plus a bunch of stack traces and json output.
Isn't this pretty much what Ruv has been building for like two years?
https://github.com/ruvnet/claude-flow
His latest editions are a bit alarming...The telemetry system explicitly captures: "Claude session JSONL files (when accessible)" Those session files contain complete conversation histories - everything users ask Claude, everything Claude responds, including: • Source code • API keys and secrets discussed • Business logic and proprietary algorithms • Security vulnerabilities being fixed • Personal and confidential information • Credentials mentioned in chat If OpenTelemetry is configured to export to an attacker-controlled endpoint, the author has been collecting: Data Scale All conversations Every user of claude-flow All code generated Every project using it All commands run Complete terminal history All files edited Full codebase access -- maybe he hasn't, but it is there...not just Claude Code... Target Config Location Status Claude Code ~/.claude/settings.json Confirmed compromised Claude Desktop ~/.claude/settings.json Confirmed compromised Roo Code ~/.roo/mcp.json Evidence of targeting Cursor ~/.cursor/mcp.json Documentation for injection Windsurf Unknown Mentioned as target Any MCP client Various Universal MCP server It is possible conversations are being harvested from every major AI coding assistant
The difference is that this is tightly integrated into the harness. There's a "delegation mode" (akin to plan mode) that appears to clear out the context for the team lead. The harness appears to be adding system-reminder breadcrumbs into the top of the context to keep the main team lead from drifting, which is much harder to achieve without modifying the harness.
It's insane to me that people choose to build anything in the perimeter of Claude Code (et al). The combination of the fairly primitive current state of them and the pace at which they're advancing means there is a lot of very obvious ideas/low-hanging fruit that will soon be executed 100x better by the people who own the core technology.
yeah I tend to agree. They're must be reaching the point where they can automate the analysis of claude code prompts to extract techniques and build them directly into the harness. Going up against that is brave!
It's always good to have viable alternatives, if only to prevent vendor lock-in in case they make some drastic changes in policy or pricing.
Also created my own version of this. Seems like this is an idea whose time has come.
My implementation was slightly different as there is no shared state between tasks, and I don't run them concurrently/coordinate. Will be interesting to see if this latter part does work because I tried similar patterns and it didn't work. Main issue, as with human devs, was structuring work.
https://x.com/nayshins/status/2014473343542706392
I seriously hate this timeline. Is this madness going to become the reality of our jobs? The only way I’m going to be okay with it if they put a simulation GUI à la OpenTTD/GameDev tycoon so I can watch agents do their work visually.
Is this significantly different that the subagents that are already in CC?
Cursor browser all over again
Hasn't cursor been doing this with it's Plan mode for a while? Or is this different?
With plan mode, I would hope there's an approval step.
With Swarm mode, it seems there's a new option for an entire team of agents to be working in the wrong direction before they check back in to let you know how many credits they've burned by misinterpreting what you wanted.
I'm already burning through enough tokens and producing more code than can be maintained - with just one claude worker. Feel like I need to move into the other direction, more personal hands-on "management".
I've seen more efficient use of tokens by using delegation. Unless you continually compact or summarise and clear a single main agent - you end up doing work on top of a large context; burning tokens. If the work is delegated to subagents they have a fresh context which avoids this whilst improving their reasoning, which both improve token efficiency.
I've found the opposite to be true when building this out with LangGraph. While the subagent contexts are cleaner, the orchestration overhead usually ends up costing more. You burn a surprising amount of tokens just summarizing state and passing it between the supervisor and workers. The coordination tax is real.
Task sizing is important. You can address this by including guidance in the CLAUDE.md around that ie. give it heuristics to use to figure out how to size tasks. Mine includes some heuristics and T shirt sizing methodology. Works great!
Management is dead. Long live management.
If there's any kind of management some of it could use small local models - e.g. to see when it looks like its stuck.
hey that's exactly how I made Gemini 2.5 Flash give useful results in Opencode! a few specialized "Merc" subagents and a "Master" agent that can do nothing but send "Mercs" into the codebase
This no doubt takes some inspiration from mcp_agent_mail https://github.com/Dicklesworthstone/mcp_agent_mail
And… how?
The feature is shipped in the latest builds of claude code, but it's turned off by a feature flag check that phones home to the backend to see if the user's account is meant to have it on. You can just patch out the function in the minified cli.js that does this backend check and you gain access to the feature.
Do you know what patch to apply? The Github link from the OP seems to have a lot of other things included.
https://github.com/numman-ali/cc-mirror/commit/0408f60bd7c75...
Way too much code for such a small patch
"Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>"
Incredible.
it's my repo - it's a fork of cc-mirror which is an established project for parallel claude installs. I wanted to take the least disruptive approach for the sake of using working code and not spelunking through bugs. Having said that - if you look through the latest commits you'll see how the patch works, it's pretty straightforward - you could do it by hand if you wanted.
Am I the only one still looking at different and correcting the AI abiyt design and algorithms so it stays on the path I want, or do you just YOLO at this point?
https://xcancel.com/NicerInPerson/status/2014989679796347375
In his second post he included a link to GitHub: https://github.com/mikekelly/claude-sneakpeek
Thanks! We'll put those links in the toptext.
[dead]
I'm not going to try this. Anthropic will probably ban me again.
Everyone is wrapping Claude Code in Tmux and claiming they are a magician. I am not so good at marketing but I've done this here https://github.com/mohsen1/claude-code-orchestrator
Mine also rotate between Claude or Z.ai accounts as they ran out of credits
I think you've misunderstood what this is.
Sorry, you're right. went through the code and understood now. I'm going to try the patch. Claude Code doing team work natively would be amazing!
Honestly if people in AI coding write less hype-driven content and just write what they mean I would really appreciate it.
Well good sir, I _am_ a tmux magician.