Ask HN: Do you have any evidence that agentic coding works?

I've been trying to get agentic coding to work, but the dissonance between what I'm seeing online and what I'm able to achieve is doing my head in.

Is there real evidence, beyond hype, that agentic coding produces net-positive results? If any of you have actually got it to work, could you share (in detail) how you did it?

By "getting it to work" I mean: * creating more value than technical debt, and * producing code that’s structurally sound enough for someone responsible for the architecture to sign off on.

Lately I’ve seen a push toward minimal or nonexistent code review, with the claim that we should move from “validating architecture” to “validating behavior.” In practice, this seems to mean: don’t look at the code; if tests and CI pass, ship it. I can’t see how this holds up long-term. My expectation is that you end up with "spaghetti" code that works on the happy path but accumulates subtle, hard-to-debug failures over time.

When I tried using Codex on my existing codebases, with or without guardrails, half of my time went into fixing the subtle mistakes it made or the duplication it introduced.

Last weekend I tried building an iOS app for pet feeding reminders from scratch. I instructed Codex to research and propose an architectural blueprint for SwiftUI first. Then, I worked with it to write a spec describing what should be implemented and how.

The first implementation pass was surprisingly good, although it had a number of bugs. Things went downhill fast, however. I spent the rest of my weekend getting Codex to make things work, fix bugs without introducing new ones, and research best practices instead of making stuff up. Although I made it record new guidelines and guardrails as I found them, things didn't improve. In the end I just gave up.

I personally can't accept shipping unreviewed code. It feels wrong. The product has to work, but the code must also be high-quality.

Bear in mind that there is a lot of money riding on LLMs leading to cost savings, and development (seen as expensive and a common bottleneck) is a huge opportunity. There are paid (micro) influencer campaigns going on and what not.

Also bear in mind that a lot of folks want to be seen as being on the bleeding edge, including famous people. They get money from people booking them for courses and consulting, buying their books, products and stuff. A "personal brand" can have a lot of value. They can't be seen as obsolete. They're likely to talk about what could or will be, more than about what currently is. Money isn't always the motive for sure, people also want to be considered useful, they want to genuinely play around and try and see where things are going.

All that said, I think your approach is fine. If you don't inspect what the agent is doing, you're down to faith. Is it the fastest way to get _something_ working? Probably not. Is it the best way to build an understanding of the capabilities and pit falls? I'd say so.

This stuff is relatively new, I don't think anyone has truly figured out how to best approach LLM assisted development yet. A lot of folks are on it, usually not exactly following the scientific method. We'll get evidence eventually.

> There are paid (micro) influencer campaigns going on and what not.

Extremely important to keep in mind when you read about LLMs, agents and what not both here, on reddit and elsewhere.

Just the other day I got offered 200 USD if I posted about some new version of a "agentic coding platform" on HN, which obviously is too little for me to compromise my ethics and morals, but makes it very clear how much of this must be going on, if me, some random user, gets offered money to just post about their platform. If I was offered that 15-20 years ago when I was broke and cleaning hotels, I'd probably take them up on their offer.

Why limit to LLMs and agents ?

This is true about everything you see advertised to you.

When I went to opensource conference (or whatever it was called) in San Diego ~8y ago, there were so many Kubernetes people. When you talked with them nobody was actually using k8s in production and were clearly devrel/paid people.

Now it seems to be everywhere... so be careful with what you ignore too

Because they are pitching to replace 50% of white collar jobs...

Doesn't seem like its that necessary to run marketing campaigns if that's the case... people would just do it if it's possible

Do you think that won't happen ?

As a very silly example, my father in law has a school in Mexico and he is now using chatgtp to generate all of their visual materials that they used to pay someone to do in the past.

They also used to pay someone to take school pictures for the books to look professional, now they use AI to make it look good/professional.

My father in law has no knowledge in technology, he uses chatgtp daily to do professional work for his school. That's already 2 jobs gone.

People must be hiding under a rock if they don't think this will have big consequences to society

> People must be hiding under a rock if they don't think this will have big consequences to society

People are not paying much mind to these freelance style jobs. It absolutely will have an impact on society.

How can we be sure you aren't paid by bear cartel to push this post?

Bear cartel is a bunch of curmudgeons without tons of VC money. If they are being unfair, its purely for love of the game.

> which obviously is too little for me to compromise my ethics and morals

What would be enough to compromise your ethics and morals? I'm sure they can accommodate.

Hah, after submitting my comment, I actually though about it because I knew someone would eventually ask :)

I'm fortunate enough to live a very comfortable life after working myself to death, so I think for 20,000,000 USD I'd do it, happily so. 2,000,000 would be too little. So probably between those sit the real price to purchase my morals and ethics :)

It wasn't a shot at you personally but the point of this was that AI companies are flush with money and desperate to show any kind of growth and willing to spend money to do that. I'm sure they are finding people that have some social following and will happily pocket couple extra green bills to present AI products in a positive light with little to no actual proof.

> It wasn't a shot at you personally

No shots fired, as far as I'm aware, so np :)

> they are finding people that have some social following and will happily pocket couple extra green bills to present AI products in a positive light with little to no actual proof.

No doubt about it, I don't think people realize how pervasive this really is though, people still sometimes tell me they trusted something on HN/reddit just because it was the most upvoted answer, or that they chose a product based on what was mentioned the most etc.

Best I can do is t'ree fiddy.

I can definitely be bought for much much less. But only because I'm pretty sure I could rave about some AI platform while still being honest about it. Why do you draw such a hard ethical line? I agree with your sentiment, AI is currently a net negative on the world that I care about. But I also believe people when they say it helps them. Despite their inability to articulate anything useful to me, or that I can understand.

[flagged]

Parent didn't mention Simon Willinson, and neither me nor parent appear to imply that _all_ people posting positively about LLMs are paid influencers, that'd be a ridiculous claim. It's just that there _are_ paid influencers, at every level, down to non-famous people getting a few bucks, and that's worth knowing.

Here's one thing I quickly found on one of Anthropic's campaigns on LinkedIn: https://www.favikon.com/blog/inside-anthropic-influencer-mar...

Stop what? Posting on HN? :| What does this have to do with me?

> This stuff is relatively new, I don't think anyone has truly figured out how to best approach LLM assisted development yet.

Exactly. But as you say, there are so many people riding the hype wave that it is difficult to come to a sober discussion. LLMs are a new tool that is a quantum leap but they are not a silver bullet for fully autonomous development.

It can be a joy to work with LLMs if you have to write the umpteenth javascript CRUD boilerplate. And it can be deeply frustrating once projects are more complex.

Unfortunately I think benchmaxxing and lm-arena are currently pushing into the wrong direction. But trillions of VC money are at stake and leaning back, digesting, reflecting and discussing things is not an option right now.

even for CRUD I'm finding it quite frustrating. The question is no longer whether AI can write the code you specify: it can

It just writes terrible code I'd never want to maintain. Can I refactor and have it cleaned up by the AI also? Sure... but then I need to specify exactly how it should go about it and eugh should I just be writing this code myself?

It really excels when there are existing conventions within the app it can use as example

> But as you say, there are so many people riding the hype wave that it is difficult to come to a sober discussion. LLMs are a new tool that is a quantum leap but they are not a silver bullet for fully autonomous development.

While I agree with the latter, I actually think on former point - that hype is making sober discussion impossible - is actually directionally incorrect. Like a lot of people I speak to privately, I'm making a lot of money directly from software largely written by LLMs (roadmaps compressed from 1-2 years to months since Claude Code was released), but the company has never mentioned LLMs or AI in any marketing, client communications, or public releases. We all very aware that we need to be able to retire before LLMs swamp or obsolete our niche, and don't want to invite competition.

Outside of tech companies, I think this is extremely common.

> It can be a joy to work with LLMs if you have to write the umpteenth javascript CRUD boilerplate.

There is so much latent demand for slightly customised enterprise CRUD apps. An enormous swathe of corporate jobs are humans performing CRUD and task management. Even if LLMs top out here, the economic disruption from this alone is going to be immense.

It is delusional to believe the current frontier models can only write CRUD apps.

I would think someone would have to only write CRUD apps themselves to believe this.

It doesn't matter anyway what a person "believes". If anything, I am having the opposite experience that conversing with people is becoming a bigger and bigger waste of time instead of just talking to Gemini. It is not Gemini that is hallucinating all kinds of nonsense vs the average person. It is the opposite.

There is a critical failure in education - people are not being taught how to debate without degenerating into a dominance debate or merely an echo chamber of talking points. It's a real problem, people literally do not understand that a question is not an opportunity to show off or to dominate, but is a request for an exchange of information.

And that problem is not just between people, this lack of communication skill continues with people's internal self conversations. Many a bully personality is that way because they bully themselves and terrorize themselves.

It's no wonder that people can use AI at all, with how poorly people communicate. So the cluster of nonsense that is all the shallow thinkers directing people down incorrect paths is completely understandable. They are learning by doing, which with any other technology would be fine, but with AI to learn how to use it by using it can serious damage one's cognitive ability, as well as leave a junkyard of failed projects behind.

I blame chats and online forums for banning anything off topic and or political/ religious discussions.

Which led to:

echo chambers

Forgetting how to debate and discuss things.

You needed to go full out with guns blazing to get seen before you vanished and the discussion getting locked or deleted.

I’m not sure you read my comment. I didn’t claim LLMs have reached a ceiling – I’m very bullish on them.

The point I was making is about the baseline capability that even sceptics tend to concede: if LLMs were “only” good at CRUD and task automation (which I don’t think is their ceiling), that alone is already economically and socially transformative.

A huge share of white-collar work is effectively humans doing CRUD and coordination. Compressing or automating that layer will have second- and third-order effects on productivity, labour markets, economics, and politics globally for decades.

> This stuff is relatively new, I don't think anyone has truly figured out how to best approach LLM assisted development yet. A lot of folks are on it, usually not exactly following the scientific method. We'll get evidence eventually.

I try to think about other truly revolutionary things.

Was there evidence that GUIs would dramatically increase productivity / accessibility at first? I guess probably not. But the first time you used one, you would understand its value on some kind of intuitive level.

Having the ability to start OpenCode, give it an issue, add a little extra context, and have the issue completed without writing a single line of code?

The confidence of being able to dive into an unknown codebase and becoming productive immediately?

It's obvious there's something to this even if we can't quantify it yet. The wildly optimistic takes end with developers completely eliminated, but the wildly pessimistic ones - if clear eyed - should still acknowledge that this is a massive leap in capabilities and our field is changed forever.

> The confidence of being able to dive into an unknown codebase and becoming productive immediately?

I don't think there's any public evidence of this happening, except for the debacles with LLM-generated pull requests (which is evidence against, not for this happening).

I could be wrong, feel free to cite anything.

> Having the ability to start OpenCode, give it an issue, add a little extra context, and have the issue completed without writing a single line of code?

Is this a good thing? I'm asking why you said it like this, I'm not asking you to defend anything. I'm genuinely curious about your rational/reasoning/context for why you used those words specifically?

I ask, because I wouldn't willingly phrase it like this. I enjoy writing code. The expression of the idea, while not even close to value I assign to fixing the thing, still has meaning.

e.g. I would happily share code my friend wrote that fixed something. But I wouldn't take and pride in it. Is that difference irrelevant to you, or do you still feel that sense of significance when an LLM emits the code for you?

> should still acknowledge that this is a massive leap in capabilities and our field is changed forever.

Equally, I don't think I have to agree with this. Our field is likely changed, arguably for the worse if the default IDE now requires a monthly rent payment. But I have only found examples of AI generating boiler plate. If it's not able to copy the code from some other existing source, it's unable to emit anything functional. I wouldn't agree that's a massive leap. Boilerplate has always been the least significant portion of code, no?

We are paid to solve business problems and make money.

People who enjoy writing code can still do so, just not on a business context if there's a more optimal way

> We are paid to solve business problems and make money.

> People who enjoy writing code can still do so, just not on a business context if there's a more optimal way

Do you mean optimal, or expedient?

I hate working with people who's ideas of solving problems is punting it down the road for the next person to deal with. While I do see people do this kinda thing often, I refuse to be someone who claims credit for "fixing" some problem knowing I'm only creating a worse, or different problem for the next guy. If you're working on problems that require collaboration, creating more problems for the next guy is unlikely to give you an optimal output; because soon no one will willingly work with you. It's possible to fix business problems, and maintain your ethics, it's just feels easier to abandon them.

Cards on the table: this stuff saps the joy from something I loved doing, and turns me into a manager of robots.

I feel like it's narrowly really bad for me. I won't get rich and my field is becoming something far from what I signed up for. My skills long developed are being devalued by the second.

I hate that using these tools increases wealth inequality and concentrates power with massive corporations.

I wish it didn't exist. But it does. And these capabilities will be used to build software with far less labor.

Is that trade-off worth the negatives to society and the art of programming? Hard to say really. But I don't get to put this genie back in the bottle.

> Cards on the table: this stuff saps the joy from something I loved doing, and turns me into a manager of robots.

Pick two non-trivial tasks where you feel you can make a half-reasonable estimate on the time it should take, then time yourself. I'd be willing to bet that you don't complete it significantly faster with AI. And if you're not faster using AI, maybe ignore it like I and many others. If you enjoy writing code, keep writing code, and ignore the people lying because they need to spread FUD so they can sell something.

> But I don't get to put this genie back in the bottle.

Sounds like you've already bought into the meme that AI is actually magical, and can do everything the hype train says. I'm unconvinced. Just because there's smoke coming from the bottle doesn't mean it's a genie. What's more likely, magic is real? Or someone's lying to sell something?

> Sounds like you've already bought into the meme that AI is actually magical, and can do everything the hype train says. I'm unconvinced. Just because there's smoke coming from the bottle doesn't mean it's a genie. What's more likely, magic is real? Or someone's lying to sell something?

There are a lot of lies and BS out there in this moment, but it doesn't have to do everything the hype train says to have enough value that it will be adopted.

After my (getting to be long) career, there's a constant about software development: higher level abstractions will be used, because they enable people to either work faster, or they enable people who can't "grok" lower level abstractions to do things they couldn't before.

The output I can get from these tools today exceeds what I could've ever gotten from a junior developer before their existence, and it will never be worse than it is right now.

You're absolutely right!

Sorry, couldn't resist :P But I do, in fact, agree based on my anecdotal evidence and feeling. And I'm bullish that even if we _haven't_ cracked how to use LLMs in programming well, we will, in the form of quite different tools maybe.

Point is, I don't believe anyone is at the local maximum yet, models changed too much the last years to really get to something stable.

And I'm also willing to leave some doubt that my impression/feeling might be off. Measuring short term productivity is one thing. Measuring long term effects on systems is much harder. We had a few software crises in the past. That's not because people back then were idiots, they just followed what seemed to work. Just like we do today. The feedback loop for this stuff is _long_. Short term velocity gains are just one variable to watch.

Anyway, all my rambling aside, I absolutely agree that LLMs are both revolutionary and useful. I'm just careful to prematurely form a strong opinion on where/how exactly.

[deleted]

A principal engineer at Google posted on Twitter that Claude Code did in an hour what the team couldn’t do in a year.

Two days later, after people freaked out, context was added. The team built multiple versions in that year, each had its trade offs. All that context was given to the AI and it was able to produce a “toy” version. I can only assume it had similar trade offs.

https://xcancel.com/rakyll/status/2007659740126761033#m

My experience has been similar to yours, and I think a lot of the hype is from people like this Google engineer who play into the hype and leave out the context. This sets expectations way out of line from reality and leads to frustration and disappointment.

Thats because getting promoted requires thought leadership and fulfilling AI mandates. Hence the tweet from this PE at Google, another from one at Microsoft wanting to rewrite the entire c++ base to Rust, few other projects also from MS all about getting the right Markdown files etc etc

> A principal engineer at Google posted on Twitter that Claude Code did in an hour what the team couldn’t do in a year.

I’ll bring the tar if you bring the feathers.

That sounds hyperbolic but how can someone say something so outrageoulsy false.

as someone who worked at the company, i understood the meaning behind the tweet without the additional clarification. i think she assumed too much shared context when making the tweet

A principal engineer at Google made a public post on the World Wide Web and assumed some shared Google/Claude-context. Do you hear yourself?

Do you think people who work at Google are perfect?

Working in a large scale org gets you accustomed to general problems in decision making that aren’t that obvious. Like I totally understood what she means and in my head nodded with “yeah that tracks”.

Maybe it helps them sleep at night.

People make mistakes, it's not that deep. The correct incentive to encourage is admitting, and understand and forgiving when necessary because you don't want to encourage people to hide mistakes out of shame. That only makes things worse.

Especially considering forgetting the delta between yours and someone else's shared context is extremely common. And the least egregious mistake you can make when writing an untargeted promo post.

My bad. I will be more mindful tomorrow when someone at a big tech company yet again make-a-mistake in the same direction of AI hyping. Maybe with a later addendum. Like journalists that write about a Fatal Storm In Houston and you read down to the eighth paragraph and it turns out the fatality were among pigeons.

> when writing an untargeted promo post.

lol.

> My bad. I will be more mindful tomorrow when someone at a big tech company yet again make-a-mistake in the same direction of AI hyping.

Are you mad at them for playing the game, or mad that that's the game they have to play to advance at their company?

> Like journalists that write about a Fatal Storm In Houston and you read down to the eighth paragraph and it turns out the fatality were among pigeons.

I don't know; I guess I hold people who post on twitter so they can self-promo, or who have attention because they work at $company, to a slightly different standard than I would hold a journalist writing a news article?

> I don't know; I guess I hold people who post on twitter so they can self-promo, or who have attention because they work at $company, to a slightly different standard than I would hold a journalist writing a news article?

I know. One of them has a higher salary.

Who are you referring to here? If you follow the link, you will see that the Google engineer did not say that.

I am quoting the person that I responded to. Which linked to this: https://xcancel.com/rakyll/status/2007659740126761033#m

> I’m not joking and this isn’t funny. We have been trying to build distributed agent orchestrators at Google since last year. There are various options, not everyone is aligned... I gave Cloud Code a description of the problem, it generated what we built last year in an hour.

So I see one error. GP said “couldn’t do”. The engineer really said “matched”.

The key words in the quote are "not everyone is aligned". It's not about execution ability.

[flagged]

From the very beginning everyone tells us “you are using the wrong model”. Fast forward a year, the free models become as good as last year premium models and the result is still bad but you still hear the same message “you are not using the last model”… I just stopped caring to try the new shiny model each month and simply reevaluate the state of the art once a year for my sanity. Or maybe my expectation is clearly too high for these tools.

Are you sure you haven't moved the goalposts? The context here is "agentic coding" i.e. it does it all, while in the past the context was, to me anyway, "you describe the code you want and it writes it and you check it's what you asked for". The latter does work on free models now.

When one is not happy with LLM output, agentic workflow rarely improves quality --- even though it may improve functionality. Now, instead of making sure that LLM is on track at each step, it goes down a rabbit hole, at which point it's impossible to review the work, let alone make it do it your way.

This discussion is a request for positive examples to demonstrate any of the recent grandiose claims about ai assisted development. Attempting to switch instead to attacking the credentials of posters only seems to supply evidence that there are no positive examples, only hype. It doesn't seem to add to the conversation.

There's people spending 5k a month on tokens, if you're work generates 7-8 figures per year, that's peanuts and companies will happily pay for that per engineer

> would call out the AI hype bubble

Which is what it is by describing it as a tool needing thousands of dollars and years of time in learning fees while being described as "replaces devs" in an instant. It is a tool and when used sparingly by well trained people, works. To the extend that any large statistical text predictor would.

I’ve mostly used the 20 a month cursor plan and I’ve gotten to the point I can code huge things with rarely the need to do anything manually

Yeah that was bullshit (like most AI related crap... lies, damn lies, statistics, ai benchmarks). Like saying my 5 year old said words that would solve the Greenland issue in an hour. But words not put to test lol, just put on a screen and everyone say woah!!! AI can't ship. That stil needs humans.

That, uh, says a lot about Google, doesn't it?

Humans regularly design entire Uber, google, youtube, twitter, whatsapp etc in 45 mins in system design interviews. So AI designing some toy version is meh.

You're choosing to focus on specific hype posts (which were actually just misunderstandings of the original confusingly-worded Twitter post).

While ignoring the many, many cases of well-known and talented developers who give more context and say that agentic coding does give them a significant speedup (like Antirez (creator of Reddit), DHH (creator of RoR), Linus (Creator of Linux), Steve Yegge, Simon Wilison).

Why not in that case provide an example to rebut and contribute as opposed to knocking someone elses example even if it was against the use of agentic coding.

Serious question - what kind of example would help at this point?

Here are a sample of (IMO) extremely talented and well known developers who have expressed that agentic coding helps them: Antirez (creator of Reddit), DHH (creator of RoR), Linus (Creator of Linux), Steve Yegge, Simon Wilison. This is just randomly off the top of my head, you can find many more. None of them claim that agentic coding does a years' worth of work for them in an hour, of course.

In addition, pretty much every developer I know has used some form of GenAI or agentic coding over the last year, and they all say it gives them some form of speed up, most of them significant. The "AI doesn't help me" crowd is, as far as I can tell, an online-only phenomenon. In real life, everyone has used it to at least some degree and finds it very valuable.

Those are some high profile (celebrity) developers.

I wonder if they have measured their results? I believe that the perceived speed up of AI coding is often different from reality. The following paper backs this idea https://arxiv.org/abs/2507.09089 . Can you provide data that objects this view, based on these (celebrity) developers or otherwise?

Almost off-topic, but got me curious: How can I measure this myself? Say I want to put concrete numbers to this, and actually measure, how should I approach it?

My naive approach would be to just implement it twice, once together with an LLM and once without, but that has obvious flaws, most obvious that the order which you do it with impacts the results too much.

So how would I actually go about and be able to provide data for this?

> My naive approach would be to just implement it twice, once together with an LLM and once without, but that has obvious flaws, most obvious that the order which you do it with impacts the results too much.

You'd get a set of 10-15 projects, and a set of 10-15 developers. Then each developer would implement the solution with LLM assistance and without such assistance. You'd ensure that half the developers did LLM first, and the others traditional first.

You'd only be able to detect large statistical effects, but that would be a good start.

If it's just you then generate a list of potential projects and then flip a coin as to whether or not to use the LLM and record how long it takes along with a bunch of other metrics that make sense to you.

The initial question was:

> wonder if they have measured their results?

Which seems to indicate that there would be a suitable way for a single individual to be able to measure this by themselves, which is why I asked.

What you're talking about is a study and beyond the scope of a single person, and also doesn't give me the information I'd need about myself.

> If it's just you then generate a list of potential projects and then flip a coin as to whether or not to use the LLM and record how long it takes along with a bunch of other metrics that make sense to you.

That sounds like I can just go by "yeah, feels like I'm faster", which I thought exactly was parent wanted to avoid...

> That sounds like I can just go by "yeah, feels like I'm faster", which I thought exactly was parent wanted to avoid...

No it doesn't, but perhaps I assumed too much context. Like, you probably want to look up the Quantified Self movement, as they do lots of social science like research on themselves.

> Which seems to indicate that there would be a suitable way for a single individual to be able to measure this by themselves, which is why I asked.

I honestly think pick a metric you care about and then flip a coin to use an LLM or not is the best you're gonna get within the constraints.

> Like, you probably want to look up the Quantified Self movement, as they do lots of social science like research on themselves.

I guess I was looking for something bit more concrete, that one could apply themselves, which would answer the "if they have measured their results? [...] Can you provide data that objects this view" part of parents comment.

> then flip a coin to use an LLM or not is the best you're gonna get within the constraints.

Do you think trashb who made the initial question above would take the results of such evaluation and say "Yeah, that's good enough and answers my question"?

> I guess I was looking for something bit more concrete, that one could apply themselves, which would answer the "if they have measured their results? [...] Can you provide data that objects this view" part of parents comment.

This stuff is really, really hard. Social science is very difficult as there's a lot of variance in human ability/responses. Added to that is the variance surrounding setup and tool usage (claude code vs aider vs gemini vs codex etc).

Like, there's a good reason why social scientists try to use larger samples from a population, and get very nerdy with stratification et al. This stuff is difficult otherwise.

The gold standard (rather like the METR study) is multiple people with random assignment to tasks with a large enough sample of people/tasks that lots of the random variance gets averaged out.

On a 1 person sample level, it's almost impossible to get results as good as this. You can eliminate the person level variance (because it's just one person), but I think you'd need maybe 100 trials/tasks to get a good estimate.

Personally, that sounds really implausible, and even if you did accomplish this, I'd be sceptical of the results as one would expect a learning effect (getting better at both using LLM tools and side projects in general).

The simple answer here (to your original question) is no, you probably can't measure this yourself as you won't have enough data or enough controls around the collection of this data to make accurate estimates.

To get anywhere near a good estimate you'd need multiple developers and multiple tasks (and a set of people to rate the tasks such that the average difficulty remains constant.

Actually, I take that back. If you work somewhere with lots and lots of non-leetcode interview questions (take homes etc) you could probably do the study I suggested internally. If you were really interested in how this works for professional development, then you could randomise at the level of interviewee and track those that made it through and compare to output/reviews approx 1 year later.

But no, there's no quick and easy way to do this because the variance is way too high.

> Do you think trashb who made the initial question above would take the results of such evaluation and say "Yeah, that's good enough and answers my question"?

I actually think trashb would have been OK with my original study, but obviously that's just my opinion.

To wrap this up, what I was trying to say is that the feeling of being faster may not align with the reality. Even for people that have a good understanding of the matter it may be difficult to estimate. So I would say be skeptical of claims like this and try to somehow quantize it in a way that matters for the tasks you do. This is something managers of software projects have been trying to tackling for a while now.

There is no exact measurement in this case but you could get an idea by testing certain types of implementations. For example if you are finishing similar tasks on average 25% faster during a longer testing period with and without AI. Just the act of timing yourself doing tasks with or without AI may already give a crude indication of the difference.

You could also run a trail implementing coding tasks like leet code however you will introduce some kind of bias due to having done it previously. And additionally the tasks may not align with your daily activities.

A trail with multiple developers working on the same task pool with or without AI could lead to more substantial results but you won't be able to do that by yourself.

So there seems to be an shared underestanding how difficult "measure your results" would be in this case, so could we also agree that asking someone:

> I wonder if they have measured their results? [...] Can you provide data that objects this view, based on these (celebrity) developers or otherwise?

isn't really fair? Because not even you or I really know how to do so in a fair and reasonable manner, unless we start to involve trials with multiple developers and so on.

A lot of comments reads like a knee jerk reaction to the Twitter crowd claiming they vibe code apps making 1m$ in 2 weeks.

As a designer I'm having a lot of success vibe coding small use cases, like an alternative to lovable to prototype in my design system and share prototypes easily.

All the devs I work with use cursor, one of them (front) told me most of the code is written by AI. In the real world agentic coding is used massively

I think it is a mix of ego and fear - basically "I'm too smart to be replaced by a machine" and "what I'm gonna do if I'm replaced?".

The second part is something I think a lot about now after playing around with Claude Code, OpenCode, Antigravity and extrapolating where this is all going.

I agree it's about the ego .. about the other part I am also trying to project few scenarios in my head.

Wild guess nr.1: large majority of software jobs will be complemented (mostly replaced) with the AI agents, reducing the need for as many people doing the same job.

Wild guess nr.2: demand for creating software will increase but the demand for software engineers creating that software will not follow the same multiplier.

Wild guess nr.3: we will have the smallest teams ever with only few people on board leading perhaps to instantiating the largest amount of companies than ever.

Wild guess nr.4: in near future, the pool of software engineers as we know them today, will be drastically downsized, and only the ones who can demonstrate they can bring the substantial value over using the AI models will remain relevant.

Wild guess nr.5: getting the job in software engineering will be harder than ever.

Nit: s/Reddit/Redis/

Though it is fun to imagine using Reddit as a key-value store :)

That is hilarious.... and to prove the point of this whole comment thread, I created reddit-kv for us. It seems to work against a mock, I did not test it against Reddit itself as I think it violates ToS. My prompts are in the repo.

https://github.com/ConAcademy/reddit-kv/blob/main/README.md

Typo-Driven Development!

Aaarg I was typing quickly and mistyped. :face-palm:

Thanks for the correction.

You haven't provided a sample either... But sure, lets dig in.

> Antirez

When I first read his recent article, I found the whole article, uncompelling. https://antirez.com/news/158 (don't buy into the anti-AI hype) But gave it a 2nd chance; and re-read it. I'm gonna have to resist going line by line, because I find some of it outright objectionable.

> Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career.

Setting aside the rhetorical/argumentative deficiencies, and the fact this is just FUD because (he next suggests if you disagree, just keep trying it every few months? which suggests to me even he knows it's BS). He writes that in the context of the ethical or moral objections he raises. So he's suggesting that the best way to advance in your career, is to ignoring the social and ethical concerns and just get on board?

Gross.

Individual careers aside, I'm not impressed by the correctness of the code emitted, by AI and committed by most AI users. I'm unconvinced that AI will improve the industry, and it's reputation as a whole.

But the topic is supposed to be specific examples of code, so lets do that. He mentions adding utf-8 to his toy terminal input project -> https://github.com/antirez/linenoise/commit/c12b66d25508bd70... It's a very useful feature to add, without a doubt! His library is better than it was before. But parsing utf-8, while something that's very easy to implement without care, or incompletely, i.e. something that's very easy to trip over if you're careless. The implementation specifics of it are fairly described as a solved problem. It's been done so many times, if you're willing to re-implement from another existing source, It wouldn't take very long to do this without AI. (And if you're not, why are you using AI? I'm ethically opposed to the laundered provenience of source material) Then, it absolutely would take more time to verify that the code is correct if you did it by hand. The thing everyone keeps telling me I have to ensure that the AI hasn't made a mistake, so either I trust the vibes, or I'm still spending that time. Even Simon Willison agrees with me[1].

> Simon Willison

Is another suggested, so he's perfect to go next. I normally would exclude someone who's clearly best know as an AI influencer, but he's without a doubt an engineer too to fair game. Especially given he's answered a similar question just recently https://news.ycombinator.com/item?id=46582192 I've been searching for a counter point to my personal anti-AI hype, so was eager to see what the experts are making.... it's all boilerplate. I don't mean to say there's nothing valuable or that there's nothing useful there. Only that the vast majority of the code in these repos, is boilerplate that has no use out of context. The real value is just a few lines of code, something that I believe would only take 30m if you wrote the code without AI for the project you were already working on. It'd take a few hours to make any of this myself (assuming I'm even good enough to figure it out).

And I do admit, 10m on BART vs 3-4hours on a weekend is a very significant time delta. But also, I like writing code. So what was I really gonna do with that time? Make share holder value go up no doubt!

> Linus Torvalds

I can't find a single source where he's an advocate for AI. I've seen the commit, and while some of the github comments are gold. I wasn't able to draw any meaningful conclusions from the commit in isolation. Especially not when the last I read about it, he used it because he doesn't write python code. So I don't know what conclusions there are I can pull from this commit, other than AI can emit code. I knew that.

I don't have enough context to comment on the opinions of Steve Yegge or his AI generated output. I simply don't know enough, and after a quick search nothing other than AI influencer jumped out at me.

Then I try to care about who I give my time and attention to, or who I associate with so this is the end of list.

I contrast these, examples with all the hype that's proven over and over to be a miscommunication if I'm being charitable, or an outright lie if I'm not. I also think it's important to consider the incentives leading to these "miscommunications" when evaluating how much good faith you assign them.

On top of that, there's the countless examples of AI confidently lying to me about something. Explaining my fundamental concrete objection to being lied to; would take another hour I shouldn't spend on a HN comment.

What specific examples of impressive things/projects/commits/code am I missing? What output, makes all the downsides of AI a worthwhile trade off?

> In addition, pretty much every developer I know has used some form of GenAI or agentic coding over the last year, and they all say it gives them some form of speed up

I remember reading something that when tested, they're not actually faster. Any source on this other than vibes?

[1]: https://simonwillison.net/2025/Dec/18/code-proven-to-work/

Citation needed. Talk, especially in the 'agentic age', is cheap.

It really depends by what you mean by "it works". A retrospective of the last 6months.

I've had great success coding infra (terraform). It at least 10x the generation of easily verifiable and tedious to write code. Results were audited to death as the client was highly regulated.

Professional feature dev is hit and miss for sure, although getting better and better. We're nowhere near full agentic coding. However, by reinvesting the speed gains from not writing boilerplate into devex and tests/security, I bring to life much better quality software, maintainable and a boy to work with.

I suddenly have the homelab of my dreams, all the ideas previously in the "too long to execute" category now get vibe coded while watching TV or doing other stuff.

As an old jaded engineer, everything code was getting a bit boring and repetitive (so many rest APIs). I guess you get the most value out of it when you know exactly what you want.

Most importantly though, and I've heard this from a few other seniors: I've found joy in making cool fun things with tech again. I like that new way of creating stuff at the speed of thought, and I guess for me that counts as "it works"

Same experience here.

On some tasks like build scripts, infra and CI stuff, I am getting a significant speedup. Maybe I am 2x faster on these tasks, when measured from start to PR.

I am working on a HPC project[1] that requires more careful architectural thinking. Trying to let the LLM do the whole task most often fail, or produce low quality code (even with top models like Opus 4.5).

What works well though is "assisted" coding. I am usually writing the interface code (e.g. headers in C++) with some help from the agent, and then let the LLM do the actual implementation of these functions/methods. Then I do final adjustments. Writing a good AGENTS.md helps a lot. I might be 30% faster on these tasks.

It seems to match what I see from the PRs I am reviewing: we are getting these slightly more often than before.

---

[1] https://github.com/finos/opengris-scaler

> I guess you get the most value out of it when you know exactly what you want.

Oh yes. I am amateur-developping for 35 years and when I vibe code I let the basic, generic stuff happen and then tell the AI to refactor the way I want. It usually works.

I had the same "too boring to code" approach and AI was a revelation. It takes off the typing but allows, when used correctly, for the creative part. I love this.

The OP question was about agentic utility specifically. I've also gotten great side-project utility from AI codegen without having to marry my project to CC or give up on looking at code by simply prompting when I need something from whatever LLM.

Nothing wrong with CC, but I keep hearing the same kind of app being built -- home automation, side-project CRUD.

What I'm deeply skeptical of is the ability for agentic to integrate with a team maintaining+shipping a critical offering. If you're using LLMs for one-off PRs, great but then agentic seems like a band aid for memory etc.

Meamwhile if you're full CC/agentic it seems like a team would get out of sync.

> I suddenly have the homelab of my dreams, all the ideas previously in the "too long to execute" category now get vibe coded while watching TV or doing other stuff.

This is the true game changer.

I have a large-ish NAS that's not very well organised (I'm trying, it's a consolidated mess of different sources from two deacades - at least they're all in the same place now)

It was faster to ask Claude to write me a search database backend + frontend than try to click through the directories and wait for the slow SMB shares to update to find where that one file was I knew was in there.

Now I have a Go backend that crawls my NAS every night, indexes files to a FTS5 sqlite database with minimal metadata (size + mimetype + mtime/ctime) and a simple web frontend I can use to query the database

...actually I kinda want a cli search tool that uses the same schema. Brb.

Done.

AI might be a bubble etc. but I'll still have that search tool (and two dozen other utilities) in 5 years when Claude monthly subsciption is 2000€ and a right to harvest your organs on non-payment.

Same here. You have to slice things small enough for the agent to execute effectively, but beyond that, it’s magic.

I honestly find AI quite poor at writing good well thought through tests, potentially because:

1. writing testable code is part of writing good tests

2. testing is actually poorly done in all the training data because humans are also bad at writing tests

3. tests should be more focused around business logic and describing the application than arbitrarily testing things in an uncanny valley of AI slop

When Vibe coding/engineering I don't think of tests in the same way as when testing human written code.

I use unit tests to "lock down" current behavior so an agent rummaging around feature F doesn't break features A and B and will get immediate feedback if that happens.

I'm not trying to match every edge case, but focus more on end to end tests where input and output are locked golden files. "If this comes in, this exact thing must come out the other end." type of thing.

The AI can figure out what went wrong if the tests fail.

Yeah, I need to start accepting to some degree the world has changed - in the past when I want to understand a system I'd have read the tests, but with AI I can just ask cursor to explain what the code is doing and it's fairly good at explaining the functionality to me.

I'm not sure I feel truly comfortable yet with huge blocks of code that are not cleanly understood by humans but it's happening whether I like it or not.

I think one fatal flaw is letting the agent build the app from scratch. I've had huge success with agents, but only on existing apps that were architected by humans and have established conventions and guardrails. Agents are really bad at architecture, but quite good at following suit.

Other things that seem to contribute to success with agents are:

- Static type systems (not tacked-on like Typescript)

- A test suite where the tests cover large swaths of code (i.e. not just unit testing individual functions; you want e2e-style tests, but not the flaky browser kind)

With all the above boxes ticked, I can get away with only doing "sampled" reviews. I.e. I don't review every single change, but I do review some of them. And if I find anything weird that I had missed from a previous change, I to tell it to fix it and give the fix a full review. For architectural changes, I plan the change myself, start working on it, then tell the agent to finish.

C# works great for agents but it works due to established patterns, strict compiler & strong typing, compiler flag for "Treat Warnings as Errors", .editorconfig with many rules and enforcement of them. You have to tell it to use async where possible, to do proper error handling and logging, xml comments above complex methods and so on. It works really well once you got it figured out. It also helps to give it separate but focussed tasks, so I have a todo.txt file that it can read to keep track of tasks. Basically you have to be strict with it. I cannot imagine how people trust outputs for python/javascript as there are no strong typing or compilers involved, maybe some linting rules that can save you. Maybe Typescript with strict mode can work but then you have to be a purest about it and watch it like a hawk, which will drain you fast. C# + claude code works really well.

Upvote.

That's my experience too. Agent coding works really well for existing codebases that are well-structured and organized. If your codebase is mostly spaghetti—without clear boundaries and no clear architecture in place—then agents won't be of much help. They'll also suffer working in those codebases and produce mediocre results.

Regarding building apps and systems from scratch with agents, I also find it more challenging. You can make it work, but you'll have to provide much more "spec" to the agent to get a good result (and "good" here is subjective). Agents excel at tasks with a narrower scope and clear objectives.

The best use case for coding agents is tasks that you'd be comfortable coding yourself, where you can write clear instructions about what you expect, and you can review the result (and even make minor adjustments if necessary before shipping it). This is where I see clear efficiency gains.

Typescript is a great type system for agents to use. It's expressive and the compiler is much faster than rust, so turn around is much quicker.

I'm slowly accepting that Python's optional typing is mistake with AI agents, especially with human coders too. It's too easy for a type to be wrong and if someone doesn't have typechecking turned on that mistake propagates.

> I'm slowly accepting that Python's optional typing is mistake with AI agents

Don't make it optional, then. Use pyright or mypy in strict mode. Make it part of your lint task, have the agent run lint often, forbid it from using `type: ignore`, and review every `Any` and `cast` usage.

If you're using CI, make a type error cause the job to fail.

It's not the same as using a language with a proper type system (e.g. Rust), but it's a big step in the right direction.

Whenever I have an agent use Typescript, they always cast things to `any` and circumvent the types wherever convenient. And sometimes they don't even compile it - they just run it through Bun or similar.

I know I can configure tools and claude.md to fix this stuff but it's a drag when I could just use a language that doesn't have these problems to begin with.

You should not be using Python types without a type checker in use to enforce them.

With a type checker on, types are fantastic for catching missed cases early.

Same for typescript, by default you still got `any`, best case (for humans and LLM) is a strict linter that will give you feedback on what is wrong. But then (and I saw this a couple times with non-experienced devs), you or the AI has to know it. Write a strict linter config, use it, and as someone with not that much coding knowledge, you may be unfamiliar and thus not asking.

> I'm slowly accepting that Python's optional typing is mistake with AI agents, especially with human coders too. It's too easy for a type to be wrong and if someone doesn't have typechecking turned on that mistake propagates.

How would you end up using types but not have any type checking? What's the point of the types?

I’m currently experimenting (alongside working as usual) with a reasonably non-trivial rust project that will be designed “project managed”[0], built, and tested by LLM agents (mostly Claude, via OpenCode) based on me providing high level requirements and then prompting it to complete things, as well as course correcting (rule: I don’t edit the code, specifications, or tasks directly).

It’s too early to tell how it will work out but things are going better than I expected. It’s probably 20% built after a couple of days, in which I’ve mostly done other work, and it’s working for quite long periods without input from me.

When I do have to provide input, the prompt is often just “Continue working according to the project standards and rules”.

I have no idea if it’ll meet the requirements. I didn’t expect it to get this far, but a month or two ago I didn’t think the chances were high enough to even make it worth trying.

[0] I asked it to create additional documentation for project standards and rules to refer to only when needed (referenced from AGENTS.md). This included git workflow, maintaining a set of specifications, and an overall ROADMAP.md as well TASKS.md (detailed next steps from the roadmap) and STATUS.md (status of each of the tasks).

I've found Go to be the most efficient language with LLMs

The language is "small", very few keywords and hasn't changed much in a decade. It also has a built in testing system with well known patterns how to use it properly.

Along with robust linters I can be pretty confident LLMs can't mess up too badly.

They do tend to overcomplicate structures a bit and require a fresh context and "see if you can simplify this" or "make those three implement a generic interface" type of prompts to tear down some of the repetition and complexity - but again it's pretty easy with a simple language.

I've found good results with Clojure and Elixir despite them being dynamic and niche.

Not really production level or agentic, but I've been impressed with LLMs for Haskell.

I think that while these langs are "niche" they still have quality web resources and codebases available for training.

I worry about new languages though. I guess maybe model training with synthetic data will become a requirement?

> I worry about new languages though. I guess maybe model training with synthetic data will become a requirement?

I read a (rather pessimistic) comment here yesterday claiming that the current generation of languages is most likely going to be the last, since the already existing corpus of code for training is going to trump any other possible feature the new language might introduce, and most of the code will be LLM generated anyways.

I've wondered to myself here and there if new languages wouldn't be specifically written for LLM agentic coding, and what that might look like.

I use Augment with Claud Opus 4.5 every day at my job. I barely ever write code by hand anymore. I don't blindly accept the code that it writes, I iterate with it. We review code at my work. I have absolutely found a lot of benefit from my tools.

I've implemented several medium-scale projects that I anticipate would have taken 1-2 weeks manually, and took a day or so using agentic tools.

A few very concrete advantages I've found:

* I can spin up several agents in parallel and cycle between them. Reviewing the output of one while the others crank away.

* It's greatly improved my ability in languages I'm not expert in. For example, I wrote a Chrome extension which I've maintained for a decade or so. I'm quite weak in Javascript. I pointed Antigravity at it and gave it a very open-ended prompt (basically, "improve this extension") and in about five minutes in vastly improved the quality of the extension (better UI, performance, removed dependencies). The improvements may have been easy for someone expert in JS, but I'm not.

Here's the approach I follow that works pretty well:

1. Tell the agent your spec, as clearly as possible. Tell the agent to analyze the code and make a plan based on your spec. Tell the agent to not make any changes without consulting you.

2. Iterate on the plan with the agent until you think it's a good idea.

3. Have the agent implement your plan step by step. Tell the agent to pause and get your input between each step.

4. Between each step, look at what the agent did and tell it to make any corrections or modifications to the plan you notice. (I find that it helps to remind them what the overall plan is because sometimes they forget...).

5. Once the code is completed (or even between each step), I like to run a code-cleanup subagent that maintains the logic but improves style (factors out magic constants, helper functions, etc.)

This works quite well for me. Since these are text-based interfaces, I find that clarity of prose makes a big difference. Being very careful and explicit about the spec you provide to the agent is crucial.

This. I use it for coding in a Rails app when I'm not a Ruby expert. I can read the code, but writing it is painful, and so having the LLM write the code is beneficial. It's definitely faster than if I was writing the code, and probably produces better code than I would write.

I've been a professional software developer for >30 years, and this is the biggest revolution I've seen in the industry. It is going to change everything we do. There will be winners and losers, and we will make a lot of mistakes, as usual, but I'm optimistic about the outcome.

Agreed. In the domains where I'm an expert, it's a nice productivity boost. In the domains where I'm not, it's transformative.

As a complete aside from the question of productivity, these coding tools have reawakened a love of programming in me. I've been coding for long enough that the nitty gritty of everyday programming just feels like a slog - decrypting compiler errors, fixing type checking issues, factoring out helper functions, whatever. With these tools, I get to think about code at a much higher level. I create designs and high level ideas and the AI does all the annoying detail work.

I'm sure there are other people for whom those tasks feel like an interesting and satisfying puzzle, but for me it's been very liberating to escape from them.

> In the domains where I'm an expert, it's a nice productivity boost. In the domains where I'm not, it's transformative.

Is it possible that the code you are writing isn't good, but you don't know it because you're not an expert?

No, I'm quite confident that I'm very strong in these languages. Certainly not world-class but I write very good code and I know well-written code when I see it.

If you'd like some evidence, I literally just flipped a feature flag to change how we use queues to orchestrate workflows. The bulk of this new feature was introduced in a 1300-line PR, touching at least four different services, written in Golang and Python. It was very much AI agent driven using the flow I described. Enabling the feature worked the first time without a hiccup.

(To forestall the inevitable quibble, I am aware that very large PRs are against best practice and it's preferable to use smaller, stacked PRs. In this case for clarity purposes and atomicity of rollbacks I judged it preferable to use a single large PR.)

> I've implemented several medium-scale projects that I anticipate would have taken 1-2 weeks manually

A 1-week project is a medium-scale project?! That's tiny, dude. A medium project for me is like 3 months of 12h days.

You are welcome to use whatever definition of "small/medium/large" you like. Like you, 1-2 weeks is also far from the largest project I've worked on. I don't think that's particularly relevant to the point of my post.

The point that I'm trying to emphasize is that I've had success with it on projects of some scale, where you are implementing (e.g.) multiple related PRs in different services. I'm not just using it on very tightly scoped tasks like "implement this function".

I mean, if it's working for you, great.

The observation I was trying to make is that at the scope of one week, there's very little you actually get done, and it's likely mostly mechanical work. Given that, I suppose I'm unsurprised LLMs are proving useful. Seems like that's the type of thing they're excelling at.

That's not my experience. I agree that a project of any real size takes quite a bit longer than a week. But it's composed of lots of, well, week or two long subprojects. And if the AI coding tool is condensing week long projects into a day, that's a huge benefit.

Concretely speaking (well as concretely as I feel like being without piercing pseudonymity), at my last job I worked on a multi year rewrite of one of our core services. Within that rewrite were ton of much smaller projects that were a few weeks to a month long - refactor this algorithm, improve the load balancing, add a new sharding strategy, etc. An AI tool would definitely not have sped up the whole process. It's not going to, say, speed up figuring out and handling intra-team dependencies or figuring out product design. But speeding up those smaller coding subprojects would have been a huge benefit.

I'm not making any strong claims in my post. I don't have the experience of AI projects allowing me to one shot large projects. But OP asked if anyone has concrete experience with AI coding tools speeding up development, and the answer is yes, I do.

Well a medium project for me takes 3 years, so obviously I am the best out of everyone /s

1. And 2. I.e. creating a spec which is the source of truth (or spec driven development) is key to getting anything production grade from our experience.

Yes. This was the key thing I learned that let me set the agents loose on larger tasks. Before I started iterating on specs with them, I mostly had them doing very small scale, refactor-this-function style tasks.

The other advice I've read that I haven't yet internalized as much is to use an "adversarial" approach with the LLMs: i.e. give them a rigid framework that they have to code against. So, e.g., generate tests that the code has to work against, or sample output that the code has to perfectly match. My agents do write tests as part of their work, and I use them to verify correctness, but I haven't updated my flow to emphasize that the agents should start with those, and iterate on them before working on the main implementation.

Great advice.

> Tell the agent your spec, as clearly as possible.

I have recently added a step before that when beginning a project with Claude Code: invoke the AskUserQuestionTool and have it ask me questions about what I want to do and what approaches I prefer. It helps to clarify my thinking, and the specs it then produces are much better than if I had written them myself.

I should note, though, that I am a pure vibe coder. I don't understand any programming language well enough to identify problems in code by looking at it. When I want to check whether working code produced by Claude might still contain bugs, I have Gemini and Codex check it as well. They always find problems, which I then ask Claude to fix.

None of what I produce this way is mission-critical or for commercial use. My current hobby project, still in progress, is a Japanese-English dictionary:

https://github.com/tkgally/je-dict-1

https://www.tkgje.jp/

Great idea! That's actually the very next improvement I was planning on making to my coding flow: building a sub agent that is purely designed to study the codebase and create a structured implementation plan. Every large project I work on has the same basic initial steps (study the codebase, discuss the plan with me, etc) so it makes sense to formalize this in an agent I specialize for the purpose.

Is it just me, or does every post starting with "Great Idea!" or "Great point!" or "You're so right!" or similar just sound like an LLM is posting?

Or is this a new human linguistic tic that is being caused by prolonged LLM usage?

Or is it just me?

:-) I feel you. Perhaps I should have ended my post with "Would you like me to construct a good prompt for your planning agent?" to really drive us into the uncanny valley?

(My writing style is very dry and to the point, you may have noticed. I looked at my post and thought, "Huh, I should try and emotionally engage with this poster, we seem like we're having a shared experience." And so I figured, heck, I'll throw in an enthusiastic interjection. When I was in college, my friends told me I had "bonsai emotions" and I suppose that still comes through in my writing style...)

Excellent reply :) And yes, maybe that's it, that the LLM emotion feels forced so any forced emotion now feels like an LLM wrote it.

I wouldn't consider the proposed workflow agentic. When you review each step, give feedback after each step, it's simply development with LLMs.

Interesting. What would make the workflow "agentic" in your mind? The AI implementing the task fully autonomously, never getting any human feedback?

To me "agentic" in this context essentially that the LLM has the ability to operate autonomously, so execute tools on my behalf, etc. So for example my coding agents will often run unit tests, run code generation tools, etc. I've even used my agents to fix issues with git pre-commit hooks, in which case they've operated in a loop, repeatedly trying to check in code and fixing errors they see in the output.

So in that sense they are theoretically capable of one-shot implementing any task I set them to, their quality is just not good enough yet to trust them to. But maybe you mean something different?

IMHO, agentic workflow is the autonomous execution of a detailed plan. Back-and-forth between LLM and developer is fine in the planning stage. Then, the agent is supposed to overcome any difficulties or devise solutions to unplanned situations. Otherwise, Cursor had been able to develop in a tight loop of writing and running tests, followed by fixing bugs, before “agentic” became a buzzword.

Perhaps “agentic” initially referred to this simple loop, but the milestone was achieved so quickly that the meaning shifted. Regardless, I could be wrong.

Yeah, I have no idea what the consensus definition of the term is, and I suppose I can't say for sure what OP meant. I haven't used Cursor. My understanding was that it exercises IDE functions but does not execute arbitrary shell commands, maybe I'm wrong. I've specifically had good experiences with the tools being able to run arbitrary commands (like the git debugging example I mentioned).

In my experience reading discussions like this, people seem to be saying that they don't believe that Claude Code and similar tools provide much of a productivity boost on relatively open ended domains (i.e. the AI is driving the writing of the code, not just assisting you in writing your own code faster). And that's certainly not my experience.

I agree with you that success with the initial milestone ("agent operates in a self-contained loop and can execute arbitrary commands") was achieved pretty quickly. But in my experience a lot of people don't believe this. :-)

[dead]

[flagged]

"Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes"

That's a very good point.

The OP is "quite weak at JavaScript" but their AI "vastly improved the quality of the extension." Like, my dude, how can you tell? Does the code look polished, it looks smart, the tests pass, or what?! How can you come forward and be the judge of something you're not an expert in?

I mean, at this point, I'm beginning to be skeptical about half the content posted online. Anybody can come up with any damn story and make it credible. Just the other day I found out about reddit engagement bots, and I've seen some in the wild myself.

I'm waiting for the internet bubble to burst already so we can all go back to our normal lives, where we've left it 20 years or so ago.

How can I tell? Yes, the code looks quite a bit more polished. I'm not expert enough in JS to, e.g., know the cleanest method to inspect and modify the DOM, but I can look at code that does and tell if the approach it's using is sensible or not. Surely you've had the experience of a domain where you can evaluate the quality of the end product, even if you can't create a high quality product on your own?

Concretely in this case, I'd implemented an approach that used jQuery listeners to listen for DOM updates. Antigravity rewrote it to an approach that avoided the jQuery dependency entirely, using native MutationObservers. The code is sensible. It's noticeably more performant than the approach I crafted by hand. Antigravity allowed me to easily add a number of new features to my extension that I would have found tricky to add by hand. The UI looks quite a bit nicer than before I used AI tools to update it. Would these enhancements have been hard for an expert in Chrome extensions to implement? Probably not. But I'm not that expert, and AI coding tools allowed me to do them.

That was not actually the main thrust of my post, it's just a nice side benefit I've experienced. In the main domain where I use coding tools, at work, I work in languages where I'm quite a bit more proficient (Golang/Python). There, the quality of code that the AI tools generate is not better than I write by hand. The initial revisions are generally worse. But they're quite a bit faster than I write by hand, and if I iterate with the coding tools I can get to implementations that are as good as I would write by hand, and a lot faster.

I understand the bias towards skepticism. I have no particular dog in this fight, it doesn't bother me if you don't use these tools. But OP asked for peoples' experiences so I thought I'd share.

JavaScript isn't the only programming language around. I'm not the strongest around with JS either but I can figure it out as necessary -- knowing C/C++/Java/whatever means you can still grok "this looks better than that" for most cases.

Yep. I have plenty of experience in languages that use C-style syntax, enough to easily understand code written in other languages that occur nearby in the syntactical family tree. I'm not steeped in JS enough to know the weird gotchas of the type system, or know the standard library well, etc. But I can read the code fine.

If I'd asked an AI coding tool to write something up for me in Haskell, I would have no idea if it had done a good job.

I don't think so. Imagine it was vice versa, someone saying they knew JS and were weak at C/C++/Java.

This doesn't sound right to me. If someone who were expert in JS looked at a relatively simple C++ program, I think they could reasonably well tell if the quality of code were good or not. They wouldn't be able to, e.g., detect bugs from default value initialization, memory leaks, etc. But so long as the code didn't do any crazy templating stuff they'd be able to analyze it at a rough "this algorithm seems sensible" level".

Analogously I'm quite proficient at C++, and I can easily look at a small JS program and tell if it's sensible. But if you give me even a simple React app I wouldn't be able to understand it without a lot of effort (I've had this experience...)

I agree with your broad point: C/C++/Java are certainly much more complex than JS and I would expect someone expert in them to have a much easier time picking up JS than the reverse. But given very high overlap in syntax between the four I think anyone who's proficient in one can grok the basics of the others.

I've never had a job where writing Javascript has been the primary language (so far it's been C++/Java/Golang). The JS Chrome Extension is a fun side project. Using Augment in a work context, I'm primarily using it for Golang and Python code, languages where I'm pretty proficient but AI tools give me a decent efficiency boost.

I understand the emotional satisfaction of letting loose an easy snarky comment, of course, but you missed the mark I'm afraid.

[flagged]

> If you are any good with those four languages, you are leagues ahead of anyone who does Javascript full time.

That is a priggish statement, and comes across as ignorant.

I’ve been paid to program in many different languages over the years. Typescript is what I choose for most tasks these days. I haven’t noticed any real difference between my past C#, C++, C, Java, Ruby, etc programming peers and my current JavaScript ones.

> That is a priggish statement

A cursory glance at the definition of "prig" shows that what I wrote there is categorically not. You should at least try to look up that word and if you look it up and still don't get it then what you have is a reading comprehension issue.

> Typescript is what I choose for most tasks these days.

So you're smart on this, at least. Cantrill said it really well, Typescript brought "fresh water" to Javascript.

> haven’t noticed any real difference between my past C#, C++, C, Java, Ruby, etc programming peers and my current JavaScript ones.

You might still be on their level. I see that you didn't mention Rust or at least GoLang. Given the totality of your responses, you're certainly not writing any safe C (not ever).

The only approach I've tried that seems to work reasonably well, and consistently, was the following:

Make a commit.

Give Claude a task that's not particularly open ended, the closer to pure "monkey work" boilerplate nonsense the task is, the better (which is also the sort of code I don't want do deal with myself).

Preferably it should be something that only touches a file or two in the codebase unless it is a trivial refactor (like changing the same method call all over the place)

Make sure it is set to planning mode and let it come up with a plan.

Review the plan.

Let it implement the plan.

If it works, great, move on to review. I've seen it one-shot some pretty annoying tasks like porting code from one platform to another.

If there are obvious mistakes (program doesn't build, tests don't pass, etc.) then a few more iterations usually fix the issue.

If there are subtle mistakes, make a branch and have it try again. If it fails, then this is beyond what it can do, abort the branch and solve the issue myself.

Review and cleanup the code it wrote, it's usually a lot messier than it needs to be. This also allows me to take ownership of the code. I now know what it does and how it works.

I don't bother giving it guidelines or guardrails or anything of the sort, it can't follow them reliably. Even something as simple as "This project uses CMake, build it like this" was repeatedly ignored as it kept trying to invoke the makefile directly and in the wrong folder.

This doesn't save me all that much time since the review and cleanup can take long, but it serves a great unblocker.

I also use it as a rubber duck that can talk back and documentation source. It's pretty good for that.

This idea of having an army of agents all working together on the codebase is hilarious to me. Replace "agents" with "juniors I hired on fiverr with anterograde amnesia" and it's about how well it goes.

+1 for the Rubber duck, and as an unblocker.

My personal use is very much one function at a time. I know what I need something to do, so I get it to write the function which I then piece together.

It can even come back with alternatives I may not have considered.

I might give it some context, but I'm mainly offloading a bunch of typing. I usually debug and fix it's code myself rather than trying to get it to do better.

TBH I think the greatest benefit is on the documentation/analysis side. The "write the code" part is fine when it sits in the envelope of things that are 100% conventional boilerplate. Like, as a frontend to ffmpeg you can get a ton of value out of LLMs. As soon as things go open-ended and design-centric, brace yourself.

I get the sense that the application of armies of agents is actually a scaled-up Lisp curse - Gas Town's entire premise is coding wizardry, the emphasis on abstract goals and values, complete with cute, impenetrable naming schemes. There's some corollary with "programs are for humans to read and computers to incidentally execute" here. Ultimately the program has to be a person addressing another person, or nature, and as such it has to evolve within the whole.

> I don't bother giving it guidelines or guardrails or anything of the sort

Where do you give these guardrails? In the chat or CLAUDE.md?

Basic level information like how to build and test the project belong in CLAUDE.md, it knows to re-check that now and then.

Yeah, CLAUDE.md. Sometimes it just ignores what was in there after the context window gets big enough (as it tends to with planning mode).

That's the way.

I'll bite.

Here's my realtime Bluetooth heart rate monitor for linux, with text output and web interface.

   https://github.com/lowrescoder/BlueHeart

This was 100% written by Claude Code, my input was limited to mostly accepting Claude suggestions except a couple of cases where I could make suggestions to speed up development (skipping some tests I knew would work).

Particularly interesting because I didn't expect this to work, let along not to write any code. Note that I limited it to pure C with limited dependencies; initial prompt was just to get text output ("Heart Rate 76bpm"), when it got to that point I told Claude to add a web interface followed by creating a realtime graph to show the interface in use.

Every file is claude generated. AMA.

edit: this was particularly interesting as it had to test against the HRM sensor I was wearing during development, and to cope with bluetooth devices appearing and disappearing all the time. It took about a day for the whole thing and cost around $25.

further edit: I am by no means an expert with Claude (haven't even got to making a claude.md file); the one real objective here was to get a working example of using dBus to talk to blueZ in C, something I've failed at (more than once) before.

It's a good demonstration of when agents still don't get everything right when you place things into Markdown documentation. You have to be really valiant and verify everything from top to bottom, if you want to control how things are implemented to that degree, otherwise the agent will still take shortcuts where they can.

In https://github.com/lowrescoder/BlueHeart/blob/68ab2387a0c44e... for example, it doesn't actually do SSE at all, instead it queues up a complete HTTP response each time, returns once and then closes the stream, so basically a normal HTTP endpoint, "labeled" as a SSE one. SSE is mentioned a bunch of times in the docs, and the files/types/functions are labeled as such, but that doesn't seem to be what's going on internally, from what I could understand. Happy to stand corrected though!

Yes, I haven't even read most of the files, just threw it up there as an example for the OP (I too am tired of the lack of examples, so stepped up to the plate on this one).

As a personal bit of development last weekend. I can see inconsistences myself, some of which result from scope creep during development (starting with the idea of a text-only app and then grafting on the web side) - it literally only started because I wanted a working example of bluetooth and dBus in C, the rest of it just joined the ride.

As for the SSE, no expert on that myself, however if you watch the messages in the browser console it appears to push updates with sporadic notes about using polling instead.

> Yes, I haven't even read most of the files, just threw it up there as an example for the OP (I too am tired of the lack of examples, so stepped up to the plate on this one).

Right, kind of like an LLM skimming and missing the core points :)

OP didn't ask for "Anything you've vibe-coded" but explicitly asked for code written with LLMs that is high quality and structurally sound, and "creates more value than it creates technical debt". That's why I felt like reviewing the code in the first place, and why I gave the feedback.

I understand now that maybe it felt like my impromptu code review came out of nowhere, but I thought you were actually trying to give OP a accurate sample, so sorry if it felt like it came out of nowhere :)

NP, and the exact definition of vibe-coding is, I think, yet to be determined. This wasn't a yolo, it was read all the prompts and generally accept them. Overall I'd say the code and web page are at least of a quality I've seen in many commercial settings; the code itself looks reasonable and if I was to do anything to it for a real 'release', I'd update the documentation which has suffered due to the extensive scope creep during implementation.

> the exact definition of vibe-coding is, I think, yet to be determined

Huh? No, that's been established since Karpathy coined the term; you don't review the code, only use the agent and don't care about how it was done, just about the results.

The actual interesting stuff is how to use LLMs together with a human, to build high quality code. More "augmenting the human intellect" rather than "autonomous robots building for you".

Overall I'd say if someone handed you a specification that named SSE specifically, you created files with SSE in the name, and the implementation talks about doing SSE, yet it doesn't actually do SSE in the end, it's pretty much on par with code in commercial settings, yeah :) But maybe our bar should be slightly above the ground at least? :)

> Huh? No, that's been established since Karpathy coined the term; you don't review the code, only use the agent and don't care about how it was done, just about the results.

However, nowadays it is used as a synonym for everything that is somehow generated by an LLM. Regardless of whether it is a spec-driven, carefully reviewed and iterative piece of software or some yolo-style one-prompter with no idea how it was done.

Yes, by people who don't actually understand what they're talking about, doesn't mean we need to fall to lowest common denominator here on HN too.

Most people understanding "hacking" differently than us, but we've made that work, we can talk about hacking here without other HN users believing we're cracking passwords, why not the same for other terms?

Yes, an understanding of sockets and timing of interprocess communication & networking seems to be a weak point of current models.

Have you reviewed the code? What were the problems with it? Where did it do things better than you'd expect of humans? Have you compared the effort of making changes to it to the effort required for similar, human-written software?

I don't think anyone says it's not possible to get the LLM to write code. The problems OP has with them is that the code they write starts out good but then quickly devolves when the LLMs get stuck in the weird ruts they have.

Far short of a proper review, however I have scanned the code. Bear in mind this was a purely personal project, never intended to see the light of day and initially just done to create a small but operable chunk of dbus/blueZ glue code for another project.

I have no doubt that a C developer with sufficient knowledge of dBus, bluetooth, the HRM profile and linux could have written the C code in a day. Adding the HTTP server again would be easy if the developer also had experience of that (n.b. there was a minor compiler error when I tried it on another system due to a slightly different version of libmicrohttpd). Adding the API would be straighforward (but tedious) and similiarly the web page (the web page was an one-shot after Claude wrote the API, vis. "Create a web page to display a real time plot with history using the API").

So overall I'd answer that that human developers would could have pulled that off in a day are few and far between (and likely to cost a lot more than $25 plus a day of my time).

And do I think the code is good enough? Yes, more than good enough. I could take it and run with it, against that because it ended up 100% AI-generated I feel a bit like leaving it as a monument to "pure AI".

After all, I never intended to release it - it was this thread that made my throw it up on Github as an example for the OP.

Thank you for actually posting an example that people can look at; I think most other responders misunderstood the post as asking for more pointless anecdotes filled with superlatives and "trust me bro" sentiments.

Pretty neat!

Is there a name for the UI style of the web server page? I've noticed several web apps have a similar style to that.

I didn't ask for a style, it's just what Claude came up with by default. Here's what happened just now when I asked it:-

   What was the inspiration for the CSS styling of this web page?                                                                    

  ● Looking at the CSS styling in live_plot.html, the inspiration appears to be Tailwind CSS and modern minimalist design trends.

Just as a heads up, LLMs doesn't actually understand why they do what they do, you asking about it will make them reason about why it happened, but it's not the "motivation", it's essentially guesses with no anchoring to reality.

Just thought I'd clarify as I've seen prompts like this and people thinking this is the actual motivation from the "inside the LLM" or whatever, which is a bit far away from the truth.

Fair enough, I did ask for "inspiration" through rather than "motivation" - mainly because I recall a comment on here a few days ago that LLMs are carefully trained to never reveal where the training material came from. So the prompt was aimed at working around that.

Yeah, inspiration, motivation, justification etc are synonyms in this case, the point I was trying to make was something like "LLMs don't know why they do what they do", and asking for them to provide it, will make them come up with it on the spot afterwards, not actually share what the inspiration/motivation/justification was at the time the tokens were sampled.

I have the same experience despite using claude every day. As an funny anecdote:

Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests

There was an article on HN last week (?) which described this exact behaviour in the newer models.

Older, less "capable", models would fail to accomplish a task. Newer models would cheat, and provide a worthless but apparently functional solution.

Hopefully someone with a larger context window than myself can recall the article in question.

I think that article was basically wrong. They asked the agent not to provide any commentary, then gave an unsolvable task, and wanted the agent to state that the task was impossible. So they were basically testing which instructions the agent would refuse to follow.

Purely anecdotally, I've found agents have gotten much better at asking clarifying questions, stating that two requirements are incompatible and asking which one to change, and so on.

https://spectrum.ieee.org/ai-coding-degrades

From my experience: TDD helps here - write (or have AI write) tests first, review them as the spec, then let it implement.

But when I use Claude code, I also supervise it somewhat closely. I don't let it go wild, and if it starts to make changes to existing tests it better have a damn good reason or it gets the hose again.

The failure mode here is letting the AI manage both the implementation and the testing. May as well ask high schoolers to grade their own exams. Everyone got an A+, how surprising!

> TDD helps here - write (or have AI write) tests first, review them as the spec

I agree, although I think the problem usually comes in writing the spec in the first place. If you can write detailed enough specs the agent will usually give you exactly what you asked for. If you're spec is vague, it's hard to eyeball if the tests or even the implementation of the tests matches what you're looking for.

This happens with me every time I try to get claude to write tests. I've given up on it. Instead I will write the tests if I really care enough to have tests.

> they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests

A very human solution

I wonder if Volkswagen would've blamed AI if they got caught with Dieselgate nowadays...

In PR-lese: "To improve quality and reduce costs, we used AI to program some test code. Unfortunately the test code the AI generated fell below our standards, and it was missed during QA.".

Then again they got their supplier Bosch to program the "defeat device" and lied to them that "Oh don't worry, it's just for testing, we won't deploy it to production". (The "device" (probably just an algorithm) detects whether the steering wheel was being moved or not as the throttle is pushed, and if not, it assumes the car was undergoing emissions testing, and it runs the engine in the environmentally friendlier mode).

I've been programming for 20 years, and I've always been under-estimating how long things will take (no, not pressured by anyone to give firm estimates, just talking about informally when prioritizing work order together).

The other day I gave an estimate to my co-worker and he said "but how long is it really going to take, because you always finish a lot quicker than you say, you say two weeks and then it takes two days".

The LLMs will just make me finish things a lot faster and my gut feel estimation for how long things will take still is not yet taking that into account.

(And before people talk about typing speed: No that isn't it at all. I've always been the fastest typer and fastest human developer among my close co-workers.)

Yes, I need to review the code and interact with the agent. But it's doing a lot better than a lot of developers I've worked with over the years, and if I don't like the style of the code it takes very few words and the LLM will "get it" and it will improve it..

Some commenters are comparing the LLM to a junior. In some sense that is right in that the work relationship may be the same as towards a (blazingly fast) junior; but the communication style and knowledge area and how few words I can use to describe something feels more like talking to a senior.

(I think it may help that latest 10 years of my career a big part of my job was reviewing other people's code, delegating tasks, being the one who knew the code base best and helping others into it. So that means I'm used to delegating not just coding. Recently I switched jobs and am now coding alone with AI.)

I see your point in that you can use advanced terms with the LLM which makes it more like peer programming with a senior instead of a junior.

> "but how long is it really going to take, because you always finish a lot quicker than you say, you say two weeks and then it takes two days"

However these statement just kinda makes your comment smell of r/thatHappend. Since it is such a tremendous speed up.

Therefore I am intrigued what kind of problems you working on? Does it require a lot of boilerplate code or a lot of manually adjusting settings?

I obviously don't know that my past two days of work would have taken two weeks in the alternative route, but it's my feeling for this particular work:

I'm implementing a drawing tool on top of maps for fire departments (see demo.syncmap.no -- it's only in Norwegian for now though, plan to launch in English and Show HN it in some months). Typescript, Svelte, Go, Postgres.

This week I have been making the drawing tools more powerful (not deployed publicly yet).

* Gesture recognition to turn wobbly lines into straight lines in some conditions

* Auto-fill closed shapes: Vector graphics graph algorithms to segment the graph and compute the right fill regions that feel natural in the UI (default SVG fill regions were not right, took some trial and error to find something that just feels natural enough)

* Splines to make smoother curves .. fitting Catmull-Rom, converting those to other splines for SVG representation etc

* Constraints when dragging graph nodes around that shapes don't intersect when I don't want them to etc

I haven't been working all that much with polygon graphics before, so the LLM is very helpful in a) explaining me the concepts and b) providing robust implementations for whatever I need.

And I've had many dead ends that didn't feel natural in UI that I could discard after trying them out in full, without loosing huge investment.

These are all things that are very algorithm and formula intensive and where I would have had do to a lot of reading and research to do things right myself. (I could deal with it, but it takes a lot of time to read up on it.)

I review to see that it "looks sensible", not every single addition and division in the spline interpolations, or every step of the graph segmentation algorithms used to compute fill regions. I review function signatures and overall architecture, not the small details (in frontend -- obviously the backend authorization code is reviewed line by line..)

Fresh example:

I described a problem on UI level. LLM suggested the Ramer-Douglas-Peucker algorithm to solve it, which I have never heard about before. It implemented it. Works perfectly. It is 40 lines of code (of which I only really need to review the function signature, and note the fact that it's a recursive bisection algorithm). I would have spent a very long trying to figure out what to do here otherwise and the LLM handed me the solution.

Yes this kind of work will be sped up a lot by AI since you are not familiar with the intricacies of the subject matter. Especially with well documented but complex formats it can assist (vector graphics are not necessarily intuitive). Additionally in my experience UI design is quite pattern and boilerplate heavy.

The suggesting of algorithms sounds good, I don't know how you got there but I would ask for several algorithms that fit the bill and narrow it down myself (the first suggestion isn't always optimal).

Thank you for taking the time to shine some light onto what you're doing as I can see how you get that kind of speedup from using AI in this scenario.

I was expecting that evidence part that OP asked for in the top comments.

[deleted]

[dead]

I used Claude Opus 4.5 inside Cursor to write RISC-V Vector/SIMD code. Specifically Depthwise Convolution and normal Convolution layers for a CNN.

I started out by letting it write a naive C version without intrinsic, and validated it against the PyTorch version.

Then I asked it (and two other models, Gemini 3.0 and GPT 5.1) to come up with some ideas on how to make it faster using SIMD vector instructions and write those down as markdown files.

Finally, I started the agent loop by giving Cursor those three markdown files, the naive C code and some more information on how to compile the code, and also an SSH command where it can upload the program and test it.

It then tested a few different variants, ran it on the target (RISC-V SBC, OrangePI RV2) to check if it improves runtime, and then continue from there. It did this 10 times, until it arrived at the final version.

The final code is very readable, and faster than any other library or compiler that I have found so far. I think the clear guardrails (output has to match exactly the reference output from PyTorch, performance must be better than before) makes this work very well.

I am really surprised by this. While I know it can generate correct SIMD code, getting a performant version is non trivial, especially for RVV, where the instruction choices and the underlying micro architecture would significantly impact the performance.

IIRC, Depthwise is memory bound so the bar might be lower. Perhaps you can try some thing with higher compute intensity like a matrix multiply. I have observed, it trips up with the columnar accesses for SIMD.

I think the ability to actually run the code on the target helped a lot with understanding and optimizing for the specific micro architecture. Quite a few of the ideas turned out to not to be optimal and were discarded.

Also important to have a few test cases the agent can quickly check against, it will often generate wrong code, but if that is easily detectable the agent can fix it and continue quickly.

can you share the code?

You fundamentally misunderstand AI assisted coding if you think it does the work for you, or that it gets it right, or that it can be trusted to complete a job.

It is an assistant not a team mate.

If you think that getting it wrong, or bugs, or misunderstandings, or lost code, or misdirections, are AI "failing", then yes you will fail to understand or see the value.

The point is that a good AI assisted developer steers through these things and has the skill to make great software from the chaotic ingredients that AI brings to the table.

And this is why articles like this one "just don't get it", because they are expecting the AI to do their job for them and holding it to the standards of a team mate. It does not work that way.

What is the actual value of using agentic LLMs (rather than just LLM-powered autocomplete in your IDE) if it requires this much supervision and handholding? When is it actually faster / more effective?

Why use a nailgun instead of a hammer, if the nailgun still requires supervision and handholding?

Example: Say I discover a problem in the SPA design that can be fixed by tuning some CSS.

Without LLM: Dig around the code until I find the right spot. If it's been some months since I was there this can easily cost five minutes.

With LLM: Explain what is wrong. Perhaps description is vague ("cancel button is too invisible, I need another solution") or specific ("1px more margin here please"). The LLM makes a best effort fix within 30 secs. The diff points to just the right location so you can fine tune it.

The primary value is accrued by the AI labs. You pay hundreds or thousands of dollars a month to train their AI models. While you probably do increase your productivity saving time typing all the code, the feedback that you give the agent after it has produced mediocre or poor code is extremely valuable to the companies, because they train their reinforcement learning models with them. Now while you're happy you have such a great "assistant" that helps you type out code, you will at some point realize that your architectural/design skills really weren't all that special in the first place. All the models lacked to be good at that was sufficient data containing the correct rewards. Thankfully software engineers are some of the most naive people in the world, and they gave them that data by actually paying for it.

That’s not what I meant. What I’m asking is whether there’s any evidence that the latest “techniques” (such as Ralph) can actually lead to high quality results both in terms of code and end product, and if so, how.

I used Ralph recently, in Claude Code. We had a complex SQL script that was crunched large amounts of data and was slow to run even on tables that are normalized, have indexes for the right columns etc. We, the humans spent significant amount of time tweaking it. We were able to get some performance gains, but eventually hit a wall. That is when I let Ralph take a stab at it. I told it to create a baseline benchmark and I gave it the expected output. I told to keep iterating on the script until there was at least 3x improvement in performance number while the output was identical. I set the iteration limit to 50. I let it loose and went to dinner. When I came back, it had found a way to get 3x performance and stopped on the 20th iteration.

Is there another human that could get me even better performance given the same parameters. Probably yes. In the same amount of time? Maybe, but unlikely. In any case, we don't have anybody on our team that can think of 20 different ways to improve a large and complex SQL script and try them all in a short amount of time.

These tools do require two things before you can expect good results:

1. An open mind. 2. Experience. Lots of it.

BTW, I never trust the code an AI agent spits out. I get other AI agents, different LLMs, to review all work, create deterministic tests that must be run and must pass before the PR is ever generated. I used to do a lot of this manually. But now I create Claude skills that automate a lot of this away.

I don't understand what kind of evidence you expect to receive.

There are plenty of examples from talented individuals, like Antirez or Simonw, and an ocean of examples from random individuals online.

I can say to you that some tasks that would take me a day to complete are done in 2h of agentic coding and 1h of code review, with the additional feature that during the 2h of agenti coding I can do something else. Is this the kind of evidence you are looking for?

"You're holding it wrong"

[dead]

"I bought a subscription to Claude and it didn't write a perfectly coded application for me while I watched a game of baseball. AI sucks!"

Given the claims that AI is replacing jobs left and right, that there’s no more need for software developers or computer science education, then it had jolly well better be able to code a perfect application while I watch baseball.

As long as it makes already senior engineers work as quickly alone as when working in a team together with 3 juniors, it can lead to replacing jobs without producing code that doesn't need review.

I used the Junie AI coding agent by JetBrains with Claude and ChatGPT engines to create a utility web page and service to track PRs by devs across multiple repos and tied to our ticketing system.

I did it as an experiment with my constraint being that I refused to edit code, but I did review the code it made and made it make fixes.

I didn’t do it as a one shot. Roughly, I:

* sketched out a layout on paper and photographed it (very rough) * I made a list of requirements and has the AI review and augment them * I asked ChatGPT outside of the IDE to come up an architecture and guidelines I could give to the agent * I presented all of that info to the AI as project guidelines and requirements * I then created individual tasks and had it complete them one by one. Create a UI with stubbed API calls and fake data, Create the service that talks to AzureDevOps and test it, create my Node service, Hook it all up, Add features and fix bugs.

Result, fairly clean code, very attractive and responsive UI, all requirements met.

My other developers loved and immediately started asking for new features. Each new feature was another agentic task, completed over 1-3 iterations.

So it wasn’t push button automatic, but I wrote 0% of it (code wise) and probably invested 6-8 total hours. My web dev skills are rusty, so I think the same thing would have taken 4-5 days and would not have looked as nice.

Learning how to drive the models is a legit skill - and I don't mean "prompt engineering". There are absolutely techniques that help and because things are moving fast there is little established practice to draw from. But it's also been interesting seeing experienced coders struggle - I've found my time as a manager has been more help to me than my time as a coder. How to keep people on task and focused etc is very similar to managing humans. I suspect much of the next 5 years will be people rediscovering existing human and project management techniques and rebranding them as AI something.

Some techniques I've found useful recently:

- If the agent struggled on something once it's done I'll ask it "you were struggling here, think about what happened and if there are is anything you learned. Put this into a learnings document and reference it in agents.md so we don't get stuck next time"

- Plans are a must. Chat to the agent back and forth to build up a common understanding of the problem you want solved. Make sure to say "ask me any follow up questions you think are necessary". This chat is often the longest part of the project - don't skimp on it. You are building the requirements and if you've ever done any dev work you understand how important having good requirements are to the success of the work. Then ask the model to write up the plan into an implementation document with steps. Review this thoroughly. Then use a new agent to start work on it. "Implement steps 1-2 of this doc". Having the work broken down into steps helps to be able to do work more pieces (new context windows). This part is the more mindless part and where you get to catch up on reading HN :)

- The GitHub Copilot chat agent is great. I don't get the TUI folks at all. The Pro+ plan is a reasonable price and can do a lot with it (Sonnet, Codex, etc all available). Being able to see the diffs as it works is helpful (but not necessary) to catch problems earlier.

+1 for generating plans and then clearing context. I typically have a skill and an agent. I use the skill to generate an initial plan for an atomic unit of work, clear context and then use the agent to review said plan. Finally clear context and use the skill to implement the plan phase by phase, ensuring to review each phase for consistency with the next phase and the overall plan. I've had moderate success with this.

Another important thing to do is to instruct the agent to keep a <plan-name>-NOTES.md file where it tracks its progress and keeps implementation notes. The notes are usually short with Opus 4.5 but very helpful, especially when you need to reset mid-phase and restart it with a fresh context.

If you keep the notes around in repo, you can instruct future plan writers to review implementation notes from relevant plans to keep continuity.

I think I have evidence it works for me in that a bunch of unfinished projects suddenly finished themselves and work for me in the way I want them to. So whatever delta there was between my ideas and my execution, it has been closed for me.

If I'm being honest, the people who get utility out of this tool don't need any tutorials. The smattering of ideas that people mention is sufficient. The people who don't get utility out of this tool are insistent that it is useless, which isn't particularly inspiring to the kind of person who would write a good tutorial.

Consequently, you're probably going to have to pay someone if you want a handholding. And at the end you might believe it wasn't worth it.

So this is like Jesus.

One thing to call out, in my experience Coding Agents are bad a SwiftUI. Compared to doing JS frontends or any Python Backend work the difference is obvious.

I just don't think theres enough Swift in the LLMs corpus and the "right way" to do things in Swift has changes a few times in the last few years which I imagine compounds the sparse signal.

For SwiftUI work I'd personally recommend using Opus 4.5 with Axiom. Anytime you're designing something refer to Axiom, Claude needs its skills and agents to steer designs.

I have tried full-on agentic coding twice in the last month.

1) I needed a tool to consolidate *.dylib on macOS into the app bundle. I wanted this tool to be in JS because of some additional minor logic which would be a hassle to implement in pure bash.

2) I needed a simple C wrapper to parallelize /usr/bin/codesign over cores. Split list of binaries in batches and run X parallel codesigns over a batch.

Arguably, both tools are junior-level tasks.

I have used Claude Code and Opus 4.5. I have used AskUserTool to interview me and create a detailed SPEC.md. I manually reviewed and edited the final spec. I then used the same model to create the tool according to that very detailed spec.

The first tool, the dylib consolidation one, was broken horrendously. I did recurse into subdirs where no folder structure is expected or needed and did not recurse into folders where it was needed. It created a lot of in-memory structures which were never read. Unused parameters in functions. Unused functions. Incredible, illogical code that is impossible to understand. Quirks, "clever code". Variable scope all over the place. It appeared to work, but only in one single case on my dev workstation and failed on almost every requirement in the spec. I ended up rewriting it from scratch, because the only things worst saving from this generated code were one-liners for string parsing.

The second tool did not even work. You know this quirk of AI models that once they find a wrong solution they keep coming back at it, because the context was poisoned? So, this. Completely random code, not even close. I rewrote the thing from scratch [1].

Curiously, the second tool took way more time and tokens to create despite being quite simpler.

So yeah. We're definitely at most 6 month away from replacing programmers with AI.

[1] https://github.com/egorFiNE/codesign-parallel

After you created a spec did you ask the Claude to break down the spec into epics and tasks?

I've found that helps a lot.

I believe that's something it does automatically now.

In the past it did help a lot, but now it's the default behavior.

I've seen it create plans of 6-8 steps. I have not seen it automatically decide to break down a spec into 30+ tasks.

Ah, this.

Of course, but if your single request to claude code requires 30+ tasks then it's certainly way too big for llm to code from scratch.

This is false.

I've had Claude Code successfully work on a spec that requires 30+ tasks, with my only input being 'great, keep going'.

Excellent!

If LLMs are capable of delivering completely automatically a solution that includes that many tasks then surely in 6 months from now programmers will be out of job.

I have three uses of agentic coding at this time. All save me time.

1) low risk code

Let's say that we're building an MVP for something. and at this moment we just wanna get something working to get some initial feedback. So for example, the front-end code is not going to stick around. we just want something there to give a functionality and a feeling but it doesn't have to be perfect. AI is awesome at creating that kind of front-end code that will just live for a short time before it's probably all thrown out.

2) fast iterations and experimentation

In the past, if you had to build something and you were thinking, thinking maybe I can try this thing, then you're gonna spend hours or days getting it up and working to find out if it's even a good idea in the first place. but with AI And I find that I can just ask the AI to quickly get a working feature up and I can realize no this is not the best way to do it remove everything thing start over. I could not do that in the past with limited time to spend and they just doing the same thing over and over again with different libraries or different solutions. But with AI, I can do that. and then when you have something that you like you can go back and do it correctly.

3) typing for me.

And lastly, even when I write my own code, I don't really write it but I don't use the AI to to say, "hey, build me a to-do app" instead I use it to just give me the building blocks so more like in very advanced snippet tool so I might say "Can you give me a gen server that takes in this and that and returns this and that?" And then of course I review the result.

#2 is a big thing

I have an actual work service that uses a specific rule engine, which has some performance issues.

I could just go to Codex Web and say "try library A and library B as replacements for library X, benchmark all three solutions and give me a summary markdown file of the results"

Then I closed the browser tab and came back later, next day I think, and checked out the results.

That would've been a full day's work from me, maybe a bit more, that was now compressed to 5 minutes of active work.

This is a pretty good summary how it works for me, too. My main use case being the "advanced autocomplete" or what you call "typing for me".

But to answer the OP's question: I am on the same boat as you, I think the use cases are very limited and the productivity gains are often significantly overestimated by engineers who are hyping it up.

Hang in there. Yes it is possible; I do it every day. I also do iOS and my current setup is: Cursor + Claude Opus 4.5.

You still need to think about how you would solve the problem as an engineer and break down the task into a right-sized chunk of work. i.e. If 4 things need to change, start with the most fundamental change which has no other dependencies.

Also it is important to manage the context window. For a new task, start a new "chat" (new agent). Stay on topic. You'll be limited to about five back-and-forths before performance starts to suffer. (cursor shows a visual indicator of this in the for of the circle/wheel icon)

For larger tasks, tap the Plan button first, and guide it to the correct architecture you are looking for. Then hit build. Review what it did. If a section of code isn't high-quality, tell Claude how to change it. If it fails, then reject the change.

It's a tool that can make you 2 - 10x more productive if you learn to use it well.

I've been running an autonomous agent on a single codebase for about 5 weeks now, and my experience matches yours in some ways but diverges in others.

Where it matches: - First passes are often decent; long sessions degrade - The "fixing fixes" spiral is real - Quality requires constant human oversight

Where it diverges (what worked for me):

1. Single-project specialization beats generalist use. The agent works on ONE codebase with accumulated context (a CLAUDE.md file, handoff notes, memory system). It's not trying to learn your codebase fresh each session - it reads what previous sessions wrote. This changes the dynamic significantly.

2. Structured handoffs over raw context injection. Instead of feeding thousands of lines of history, I have the agent write structured state to files at session end. Next session reads those files and "recognizes" the project state rather than trying to "remember" it. Much more reliable than context bloat.

3. Autonomous runs work better for specific task types. Mechanical refactors, test generation, documentation, infrastructure scripts - these work well. Novel feature design still needs human involvement.

4. Code review is non-negotiable. I agree completely - unreviewed AI code is technical debt waiting to happen. The agent commits frequently in small chunks specifically so diffs are reviewable.

My evidence: ~675 journal entries, ~300k words of documentation, working infrastructure, and a public site - all built primarily through autonomous agent sessions with review.

The key shift for me was treating it less like "a tool that writes code" and more like "a very persistent junior developer who needs structure but never forgets what you taught it."

I've found it useful for getting features started and fixing bugs, but it depends on the feature. I use Claude Sonnet 4.5 and it usually does a pretty good job on well-known problems like setting up web sockets and drag and drop UIs which would take me much longer to do by hand. It also seems to follow examples well of existing patterns in my codebase like router/service/repository implementations. I've struggled to get it to work well for messy complicated problems like parsing text into structured objects that have thousands of edge cases and in which the complexity gets out of hand very quickly if not careful. In these cases I write almost all the code by hand. I also use it for writing ad-hoc scripts I need to run once and are not safety critical, in which case I use it's code as-is after a cursory review that it is correct. Sometimes I build features I would otherwise be too intimidated to try if doing by hand. I also use it to write tests, but I usually don't like it's style and tend to simplify them a lot. I'm sure my usage will change over time as I refine what works and what doesn't for me.

The thread keeps circling back to memory. Agents don't learn.

Everyone's building the same workarounds. CLAUDE.md files. Handoff docs. Learnings folders. Developer logs. All manual. All single-user. All solving the same problem: how do I stop re-teaching the agent things it should already know?

What nobody seems to ask: what if the insight that helped me debug a PayPal API timeout yesterday could help every developer who hits that bug tomorrow?

Stack Overflow was multiplayer. A million developers contributing solutions that benefited everyone. We replaced it with a billion isolated sessions that benefit no one else.

The "junior developer that never grows" framing is right. But it's worse - it's a junior who forgets everything at 5pm and shows up tomorrow needing the same onboarding. And there's no way for your junior's hard-won knowledge to help anyone else's.

We're building Memco to work on this. Shared memory layer for agents. Not stored transcripts - abstracted insights. When one agent figures something out, every agent benefits.

Still early. Curious if others are thinking about this or have seen attempts at it.

> "Stack Overflow was multiplayer. A million developers contributing solutions that benefited everyone. We replaced it with a billion isolated sessions that benefit no one else".

This.

(Thank you!)

When you first began learning how to program were you building and shipping apps the next day? No.

Agentic programming is a skill-set and a muscle you need to develop just like you did with coding in the past.

Things didn’t just suddenly go downhill after an arbitrary tipping point - what happened is you hit a knowledge gap in the tooling and gave up.

Reflect on what went wrong and use that knowledge next time you work with the agent.

For example, investing the time in building a strong test suite and testing strategy ahead of time which both you and the agent can rely on.

Being able to manage the agent and getting quality results on a large, complex codebase is a skill in itself, it won’t happen over night.

It takes practice and repetition with these tools to level-up, just like any thing else.

Your point is fair, but it rests on a major assumption I'd question: that the only limit lies with the user, and the tooling itself has none. What if it’s more like “you can’t squeeze blood from a stone”? That is, agentic coding may simply have no greater potential than what I've already tried. To be fair I haven't gone all the way in trying to make it work but, even if some minor workarounds exist, the full promise being hyped might not be realistically attainable.

How can one judge potential without fully understanding or having used it to its full potential?

I don’t think agentic programming is some promised land of instant code without bugs.

It’s just a force multiplier for what you can do.

My experience is the same. In short, agents cannot plan ahead, or plan at a high level. This means they have a blindspot for design. Since they cannot design properly, it limits the kind of projects that are viable to smaller scopes (not sure exactly how small but in my experience, extremely small and simple). Anything that exceeds this abstract threshold has a good chance of being a net negative, with most of the code being unmantainable, unextensible, and unreliable.

Anyone who claims AI is great is not building a large or complex enough app, and when it works for their small project, they extrapolate to all possibilities. So because their example was generated from a prompt, it's incorrectly assumed that any prompt will also work. That doesn't necessarily follow.

The reality is that programming is widely underestimated. The perception is that it's just syntax on a text file, but it's really more like a giant abstract machine with moving parts. If you don't see the giant machine with moving parts, chances are you are not going to build good software. For AI to do this, it would require strong reasoning capabilities, that lets it derive logical structures, along with long term planning and simulation of this abstract machine. I predict that if AI can do this then it will be able to do every single other job, including physical jobs as it would be able to reason within a robotic body in the physical world.

To summarize, people are underestimating programming, using their simple projects to incorrectly extrapolate to any possible prompt, and missing the hard part of programming which involves building abstract machines that work on first principles and mathematical logic.

>Anyone who claims AI is great is not building a large or complex enough app

I can't speak for everyone, but lots of us fully understand that the AI tooling has limitations and realize there's a LOT of work that can be done within those limitations. Also, those limitations are expanding, so it's good to experiment to find out where they are.

Conversely, it seems like a lot of people are saying that AI is worthless because it can't build arbitrarily large apps.

I've recently used the AI tooling to make a docusign-like service and it did a fairly good job of it, requiring about a days worth of my attention. That's not an amazingly complex app, but it's not nothing either. Ditto for a calorie tracking web app. Not the most complex app, but companies are making legit money off them, if you want a tangible measure of "worth".

Right, it has a lot of uses. As a tool it has been transformative on many levels. The question is whether it can actually multiply productivity across the board for any domain and at production level quality. I think that's what people are betting on, and it's not clear to me yet that it can. So far that level looks more like a tradeoff. You can spend time orchestrating agents, gaining some speedup at the cost of quality, or you can use it more like a tool and write things "manually" which is a lot higher quality.

> Anyone who claims AI is great is not building a large or complex enough app

That might be true for agentic coding (caveat below), but AI in the hands of expert users can be very useful - "great" - in building large and complex apps. It's just that it has to be guided and reviewed by the human expert.

As for agentic coding, it may depend on the app. For example, Steve Yegge's "beads" system is over a quarter million lines of allegedly vibe-coded Go code. But developing a CLI like that may be a sweet spot for LLMs, it doesn't have all the messiness of typical business system requirements.

> For example, Steve Yegge's "beads" system is over a quarter million lines of allegedly vibe-coded Go code. But developing a CLI like that may be a sweet spot

Is that really a success? I was just reading an article talking about how sloppy and poorly implemented it is: https://lucumr.pocoo.org/2026/1/18/agent-psychosis/

I guess it depends on what you’re looking to get out of it.

I haven't looked into it deeply, but I've seen people claiming to find it useful, which is one metric of success.

Agentic vibe coding maximalists essentially claim that code quality doesn't matter if you get the desired functionality out of it. Which is not that different from what a lot of "move fast and break things" startups also claim, about code that's written by humans under time, cost, and demand pressure. [Edit: and I've seen some very "sloppy and poorly implemented" code in those contexts, as well as outside software companies, in companies of all sizes. Not all code is artisanally handcrafted by connoisseurs such as us :]

I'm not planning to explore the bleeding edge of this at the moment, but I don't think it can be discounted entirely, and of course it's constantly improving.

I'd say it is a success at being useful, but yeah it does seem like the code itself has been a bit of a mess.

I've used a version that had a bd stats and a bd status that both had almost the same content in slightly different formats. Later versions appear to have made them an alias for the same thing. I've also had a version where the daemon consistently failed to start and there were no symptoms other than every command taking 5 seconds. In general, the optimization with the daemon is a questionable choice. It doesn't really need to be _that_ fast.

And yet, even after all of that it still has managed to be useful and generally fairly reliable.

Anything above a simple app and it becomes a tradeoff that needs to be carefully tuned so that you get the most out of it and it doesn't end up being a waste of time. For many use cases and domain combinations this is a net positive, but it's not yet consistent across everything.

From my experience it's better at some domains than others, and also better at certain kinds of app types. It's not nearly as universal as it's being made out to be.

Sure, here are my own examples:

* I came up with a list of 9 performance improvement ideas for an expensive pipeline. Most of these were really boring and tedious to implement (basically a lot of special cases) and I wasn't sure which would work, so I had Claude try them all. It made prototypes that had bad code quality but tested the core ideas. One approach cut the time down by 50%, I rewrote it with better code and it's saved about $6,000/month for my company.

* My wife and I had a really complicated spreadsheet for tracking how much we owed our babysitter – it was just complex enough to not really fit into a spreadsheet easily. I vibecoded a command line tool that's made it a lot easier.

* When AWS RDS costs spiked one month, I set Claude Code to investigate and it found the reason was a misconfigured backup setting

* I'll use Claude to throw together a bunch of visualizations for some data to help me investigate

* I'll often give Claude the type signature for a function, and ask it to write the function. It generally gets this about 85% right

>My wife and I had a really complicated spreadsheet for tracking how much we owed our babysitter – it was just complex enough to not really fit into a spreadsheet easily. I vibecoded a command line tool that's made it a lot easier.

Ok, please help me understand. Or is this more of a nanny?

Not technically a nanny, but not dissimilar. In this case, they do several types of work (house cleaning, watching 1-3 kids, daytime and overnights, taking kids out.) They are very competent – by far the best we've found in 3 years – and charge different rates for the different types of work. We also need to track mileage etc. for reimbursement.

They had a spreadsheet for tracking but I found it moderately annoying – it was taking 5-10 minutes a week, so normally I wouldn't have bothered to write a different tool, but with vibe coding it was fairly trivial.

How did you give Clause access to AWS?

It does ok with using the AWS cli

[dead]

Just awscli

Why is your babysitting bill so complicated?

There are several different types of work they can do, each one of which has a different hourly rate. The time of day affects the rate as well, and so can things like overtime.

It's definitely a bit of an unusual situation. It's not extremely complicated, but it was enough to be annoying.

Jesus, are you ok? Can’t you just, like, give em a 20 when you get home?

I find it quite funny you’ve invented this overly complex payment structure for your babysitter and then find it annoying. Now you’ve got a CLI tool for it.

why assume the billing model is being imposed by the customer rather than the service provider?

GP has provided an anecdote with no supporting evidence, nor any code examples. So it is as fair to assume the story is a fabrication as much as it is to assume it has any truth to it

I am really shocked at the response this trivial anecdote has gotten.

I could state it much more generically: we had an annoying Excel sheet that took ~10 minutes a week, I vibe coded a command line tool that brought it down to ~1 minute a week. I don't think this is unusual or hard to believe in any way.

Yes! You should absolutely always assume a random stranger on HN is outright lying about a trivial anecdote to farm meaningless karma.

Or instigating conflict?

[deleted]

What...what conflict do you think I'm instigating, exactly? Whether the command line is a better interface than Excel?

I didn't choose the payment structure, and the point is that a CLI is not a high bar. Something that we used to spend ~10 minutes a week on with spreadsheets is now ~1 minute/week.

Why didn’t you work out a more manageable billing structure with them?! Or to put it another way: if it took you 10 minutes a week with spreadsheets to even figure out what their bill is, how on earth did they verify your invoices were even correct? And if they couldn’t—or if it took more than 10 minutes each week—why wouldn’t they prefer a billing system they could verify they were being paid correctly?

Jesus! is this HN or personal finance forum? Who cares why they do it a certain way. Did they ask for your advices?

If you work like this in a company, you’ll end up with overcomplicated mess.

Now, people with Claude Code, are ready to produce a big pile of shit in a short time.

I’m not trying to give advice, I’m just curious about their arrangement. When I did consulting, I hated billing, and would have wanted a system that was as easy as possible.

Are you serious?

“Most of these were really boring and tedious to implement (basically a lot of special cases) and I wasn't sure which would work, so I had Claude try them all.”

I doubt you verified the boring edge cases.

I mean, as I said, I literally had Claude prototype them, and then I rewrote the working one from scratch. I didn't commit any of the code written by Claude.

Language is the network of abstractions that exists between humans. A model is a tool for predicting abstract or unobservable features in the world. So an LLM is a tool that explores the network of abstractions built into our languages.

Because the network of abstractions that is a human awareness (the ol' meat suit pilot model) is unique to all of us we cannot directly share components of our internal networks directly. Thus, we all interact through language and we all use language differently. While it's true that compute is fundamentally the same for all of us (we have to convert complex human abstractions into computable forms and computers don't vary that much), programming languages provide general mappings for diverse human abstractions back to basic compute features.

And so, just like with coding, the most natural path for interacting with a LLM is also unique to all of us. Your assumptions, your prior knowledge, and your world perspective all shape how you interact with the model. Remember you're not just getting code back though... LLMs represent a more comprehensive world of ideas.

So approach the process of learning about large language models the same way that you approach the process of learning a new language in general: pick a hello world project (something that's hello world for you) and walk through it with the model paying attention to what works and what doesn't. You'd do someone similar if you were handed a team of devs that you didn't know.

For general use, I start by having the model generate a req document that 1) I vet thoroughly. Then I have the model make TODO lists at all levels of abstraction (think procedural decomposition for the whole project) down to my code that 2) I vet thoroughly. Then I require the model to complete the TODO tasks. There are always hiccups same as when working with people. I know the places that I can count on solid, boiler plate results and require fewer details in the TODOs. I do not release changes to the TODO files without 3) review. It's not fire-and-forget but the process is modular and understandable and 4) errors finding from system design are mine to identify and address in the req and TODOs.

Good luck and have fun!

1. Start with a plan. Get AI to help you make it, and edit.

2. Part of the plan should be automated tests. AI can make these for you too, but you should spot check for reasonable behavior.

3. Use Claude 4.5 Opus

4. Use Git, get the AI to check in its work in meaningful chunks, on its own git branch.

5. Ask the AI to keep am append-only developer log as a markdown file, and to update it whenever its state significantly changes, or it makes a large discovery, or it is "surprised" by anything.

> Use Claude 4.5 Opus

In my org we are experimenting with agentic flows, and we've noticed that model choice matters especially for autonomy.

GPT-5.2 performed much better for long-running tasks. It stayed focused, followed instructions, and completed work more reliably.

Opus 4.5 tended to stop earlier and take shortcuts to hand control back sooner.

a ralph loop can make claude go til the end, or to a rate limit at least.

opus closes the task and ralph opens it right back up again.

i imagine there's something to the harness for that, too

Interesting! Was kinda disappointed with Codex last time I tried it ~2m ago, but things change fast.

So review the code. Our rule is that if your name is on the PR, you own the code; someone else will review it and expect you to be able to justify its contents. And we don't accept AI commits.

What this means in workflow terms is that the bottleneck has moved, from writing the code to reviewing it. That's forward progress! But the disparity can be jarring when you have multiple thousands of lines of code generated every day and people are used to a review cycle based on tens or hundreds.

Some people try to make the argument that we can accept standards of code from AI that we wouldn't accept from a human, because it's the AI that's going to have to maintain it and make changes. I don't accept that: whether you're human or not it's always possible to produce write-only code, and even if the position is "if we get into difficulty we'll just have the agent rewrite it" that doesn't stop you getting into a tarpit in the first place. While we still have a need to understand how the systems we produce work, we need humans to be able to make changes and vouch for their behaviour, and that means producing code that follows our standards.

Totally agree. If I don’t understand the code as if I’d written it myself, then I haven’t reviewed it properly. And during that review I’m often trimming and moving things around to simplify and clarify as much as possible.

This helps both me and the next agent.

Using these tools has made me realise how much of the work we (or I) do is editing: simplifying the codebase to the clearest boundaries, focusing down the APIs of internal modules, actual testing (not just unit tests), managing emerging complexity with constant refactoring.

Currently, I think an LLM struggles with the subtlety and taste aspects of many of these tasks, but I’m not confident enough to say that this won’t change.

I think it will change, and it might be possible now with the right prompting, to some degree. But the average project won't be there for a while.

For me, the only metric that matters is wall-time between initial idea and when it's solid enough that you don't have to think about it.

Agentic coding is very similar to frameworks in this regard:

1. If the alignment is right, you have saved time.

2. If it's not right, it might take longer.

3. You won't have clear evidence of which of these cases applies until changing course becomes too expensive.

4. Except, in some cases, this doesn't apply and it's obvious... Probably....

I have a (currently dormant) project https://onolang.com/ that I need to get back to that tries to balance these exact concerns. It's like half written. Go to the docs part to see the idea.

A loop I've found that works pretty well for bugs is this:

- Ask Claude to look at my current in-progress task (from Github/Jira/whatever) and repro the bug using the Chrome MCP.

- Ask it to fix it

- Review the code manually, usually it's pretty self-contained and easy to ensure it does what I want

- If I'm feeling cautious, ask it to run "manual" tests on related components (this is a huge time-saver!)

- Ask it to help me prepare the PR: This refers to instructions I put in CLAUDE.md so it gives me a branch name, commit message and PR description based on our internal processes.

- I do the commit operations, PR and stuff myself, often tweaking the messages / description.

- Clear context / start a new conversation for the next bug.

On a personal project where I'm less concerned about code quality, I'll often do the plan->implementation approach. Getting pretty in-depth about your requirements ovbiously leads to a much better plan. For fixing bugs it really helps to tell the model to check its assumptions, because that's often where it gets stuck and create new bugs while fixing others.

All in all, I think it's working for me. I'll tackle 2-3 day refactors in an afternoon. But obviously there's a learning curve and having the technical skills to know what you want will give you much better results.

> When I tried using Codex on my existing codebases, with or without guardrails, half of my time went into fixing the subtle mistakes it made or the duplication it introduced.

If you want to get good at this, when it makes subtle mistakes or duplicates code or whatever, revert the changes and update your AGENTS.md or your prompt and try again. Do that until it gets it right. That will take longer than writing it yourself. It's time invested in learning how to use these and getting a good setup in your codebase for them.

If you can't get it to get it right, you may legitimately have something it sucks at. Although as you iterate might also have some other insights into why it keeps getting it wrong and can maybe change something more substantial about your setup to make it able to get it right.

For example I have a custom xml/css UI solution that draws inspiration both from XML and SwiftUI, and it does an OK job of making UIs for it. But sometimes it gets stuck in ways it wouldn't if it was using HTML or some known (and probably higher quality/less buggy) UI library. I noticed it keeps trying things, adding redundant markup to both the xml and css, using unsupported attributes that it thinks should exist (because they do in HTML/CSS), and never cleans up on the way.

Some amount of fixing up its context made it noticeably better at this but it still gets stuck and makes a mess when it does. So I made it write a linter and now it uses the linter constantly which keeps it closer to on the rails.

Your pet feeding app isn't in this category. You can get a substantial app pretty far these days without running into a brick wall. Hitting a wall that quickly just means you're early on the learning curve. You may have needed to give it more technical guidance from the start, and have it write tests for everything, make sure it makes the app observable to itself in some way so it can see bugs itself and fix them, stuff like that.

I worked years as backend and desktop software programmer, then in gamedev and now back to SaaS development mostly backend. I didnt have much success with agentic coding with "agents", but had a great success with LLM code generation while keeping all the code in context with Google Gemini.

For gamedev you can really build quite complex 2D game prototype in Pygame or Unity rapidly since 20-50KLOC is enough for a lot of indie games. And it allow you to iterate and try different ideas much faster.

Most of features are either one-shots doing all changes across codebase in one prompt or require few fixing prompts only.

It really helps to isolate simulation from all else with mandatory CQRS for gamestate.

It also helps to generate markdown readmes along the way for all major systems and keep feature checklists ih header of each file. This way LLM dont lose context ot what is being generated.

Basically I generated in 2-3 weeks projects that would take 2-3 months to implement in a team simply because there is much less delay between idea of feature and testing it in some form.

Yes - ocassiinally you will fail to write proper spec or LLM fail to generate working code, but then usually it means you revert everything and rewrite the specification and try again.

So LLMs of today are certainly suitable when "good enough" is sufficient. So they are good for prototyping. Then if you want better architecture you just guide LLM to refactor complete code.

LLMs also good for small self contained projects or microservices where all relevant information fits into context.

Shipping unreviewed, untested code into production is irresponsible. But not all code is production code, and in my experience most valuable work happens long before anything deserves architectural sign-off / big commercial commitment.

Exploratory scripts, glue code—what I think of as digital duct tape between systems—scaffolding, probes, and throwaway POCs have always been messy and lightly governed. That’s kind of normal.

What’s different now is that more people can participate in that phase, and we’re judging that work by the same norms and processes we use for production systems. I know more designers now who are unafraid to code than ever before. That might be problematic or fantastic.

Where agentic coding does work for me is explicitly in those early modes, or where my own knowledge is extremely thin or I don’t have the luxury of writing everything myself (time etc). Things that simply wouldn’t get made otherwise: feasibility checks, bridging gaps between systems, or showing a half-formed idea fast enough to decide whether it’s worth formalising.

In those contexts, technical debt isn’t really debt, because the artefact isn’t meant to live long enough to accrue interest or be used in anger.

So I don’t think the real question is "does agentic coding work?" It’s whether teams are willing to clearly separate exploratory agency from production authority, and enforce a hard line between the two. ( I dont think they'll know the difference sadly) and without that, you’re right—you just get spaghetti that passes today and punishes EVERYONE six months later.

When I was in Product, one of my favourite roles was at a company where there was a massive distinction between protoyping code and live code.

To the extent that no prototype could EVER end up in live - it had to be rewritten.

This allowed prototypes to move at brilliant speed, using whatever tech you wanted (I saw plenty of paper, powerpoint and flash prototypes). Once you proved the idea (and the value) then it was iteratively rebuild 'properly'.

At other companies I have seen things hacked together as a proof of concept, live years later, and barely supported.

I can see agentic working great for prototyping, especially in the hands of those with limited technical knowledge.

Never forget that the majority of what you see online is biased towards edge framing, being the subject matter incredible or terrible. Just as people curate their online profiles to make their lives appear more "appealing" than they actually are, they do the same curating in other areas.

> edge framing

is this a term of art? I interpreted it as "people only show off the best of the best or the worst of the worst, while the averages don't post online", though I've never heard the term "edge framing" before

I'm still experimenting with it and finding out what works and what doesn't, but I have made some side projects with Claude including a web framework that doesn't require a build step/npm dependencies (great for my personal website so I don't have to depend on npm supply chain nightmares), a fully featured music player server, and also a tool that lets the agent review it's past conversations and update documentation based on patterns such as frequent mistakes, re-explored code, etc.

Web framework (includes basic component library, optional bundler/optimizer, tutorial/docs, e2e tests, and demos): https://github.com/iwalton3/vdx-web Music player web app (supports large music libraries, pwa offline sync, parametric eq, crossfade, crossfeed, semantic feature-based music search/radio, milkdrop integration, and other interesting features): https://github.com/iwalton3/mrepo-web Documentation update script (also allows exporting Claude conversations to markdown): https://github.com/iwalton3/cl-pprint

Regarding QC these are side projects so I validate them based on code review of key components, e2e testing, and manual testing where applicable. I find having the agent be able to check its work is the single biggest factor to reducing rework, but I make no promises about these projects being completely free of bugs.

Just for fun, I built a first person shooter game in UE5 from scratch using agentic coding. I've only spent a couple of months on it in my free time so far, and it isn't complete yet, but it's close enough that I could definitely release an early access version with another month or so of work. The most time consuming tasks have actually been tasks that agentic coding hasn't been able to help out with, like animations and mapping. The game is mostly written in C++ and sometimes the agent makes some bad decisions, but with a bit of extra guidance and being smart about my git commits so that I can revert and try again if necessary, I've always been able to make it work the way I want. I most definitely would not have been able to build this on my own in any reasonable amount of time.

FWIW it seems like it heavily depends on the agent + model you're using. I've had the most success with Claude Code (Sonnet), and only tried Opus 4.5 for more complex things. I've also tried Codex which didn't seem very good by comparison, plus a handful of other local models (Qwen3, GLM, Minimax, etc.) through OpenCode, Roo, and Cline that I'm able to run on my 128 GB M4 Max. The local ones can work for very simple agentic tasks, albeit quite slow.

The best way to think about Codex is an outsourced contractor.

You give it a well-defined task, it'll putter away quietly and come back with results.

I've found it to be pretty good at code reviews or large refactoring operations, not so much building new features.

Did you encounter any issues related to the fact that Unreal Engine uses a specific / custom flavor of C++?

Not really, no, at least not with Claude. It seems to already understand the UE5 way of doing things, but there were a couple of edge cases for new features beyond its cutoff date where I had to refer Claude to the UE5 documentation. Once it read the documentation however, it understood and continued without issue. Also, for any compilation errors, I just copy and paste the error messages into Claude Code and it usually fixes it immediately.

I echo your experience and the best use I've found is, to have it generate that first implementation which is often surprisingly good, and then take it manually from there, because getting an LLM to fix its own mistakes is an exercise in frustration ...

I treat it like a little jump off platform, for my own initial velocity, any more and it goes off the rails like you describe

It works for me but I do it incrementally. I use codex. I ignore the hypesters because I was around the last time self driving cars were just few quarters away.

What I do is - I write a skeleton. Then I write a test suite (nothing fancy just 1 or sanity tests). I'll usually start with some logic that I want to implement and break it down into XYZ steps. Now one thing to note here - TDD is very useful. If it makes your head hurt it means the requirements arent very clear. Otherwise its relatively easy to write test cases. Second thing, if your code isnt testable in parts, it probably needs some abstraction and refactoring. I typically test at the level of abstraction boundaries. eg if something needs to write to database i'll write a data layer abstraction (standard stuff) and test that layer by whatever means are appropriate. Once the spec reaches a level where its a matter of typing, I'll add annotations in the code and add todos for codex. Then I instruct it with some context, by this time its much easier to write the context since TdD clears out the brain fog. And I tell it to finish the todos and only the todos. My most used prompt is "DONT CHANGE ANYTHING APART FROM THE FUNCTIONS MARKED AS TODO." I also have an AGENTS.md file listing any common library patterns to follow. And if the code isnt correct, I'll ask codex to redo until it gets to a shape I understand. Most of the time it gets things the 2nd time around, aka iteration is easier than ground 0. Usually it takes me a day to finish a part or atleast I plan it that way. For me, codex does save me a whole bunch of time but only because of the upfront investment.

You personally should just ignore the YouTubers most of them are morons. If you'd like to checkout AI coding flows, checkout the ones from the masters like Antirez, Mitchell H. Thats a better way of learning the right tricks.

You are asking two very different questions here.

i.e. You are asking a question about whether using agents to write code is net-positive, and then you go on about not reviewing the code agents produce.

I suspect agents are often net-positive AND one has to review their code. Just like most people's code.

It seems that people feel code review is a cost, but time spent writing code is not a cost because it feels productive.

I don't think that's quite it - review is a recurring cost which you pay on every new PR, whereas writing code is a cost you pay once.

If you are continually accumulating technical debt due to an over-enthusiastic junior developer (or agent) churning out a lot of poorly-conceived code, then the recurring costs will sink you in the long run

"review is a recurring cost which you pay on every new PR, whereas writing code is a cost you pay once."

Huh ? Every new PR is new code which is a new cost ?

> Every new PR is new code which is a new cost ?

Every new PR interacts with existing code, and the complexity of those interactions increases steadily over time

A key thing I noticed in sincere anecdotes about LLM code is that is always seems to be outside of the author's area of expertise.

I work with an infrastructure team that are old school sysadmins, not really into coding. They are now prodigiously churning out apps that "work" for a given task. It is producing a ton of technical debt and slowing down new feature development, but this team doesn't really get it because they don't know enough software engineering to understand.

Likewise the recent example of an LLM "coding a browser" where the result didn't compile and wasn't useful. If you took it at face value you'd think "wow that's a hard task I couldn't do, and an LLM did it alone". In fact they spent a ton of effort on manually herding the LLM only for it to produce something pretty useless.

I've been having good results lately/finally with Opus 4.5 in Cursor. It still isn't one-shotting my entire task, but the 90% of the way it gets me is pretty close to what I wanted, which is better than in the past. I feel more confident in telling it to change things without it making it worse. I only use it at work so I can't share anything, but I can say I write less code by hand now that it's producing something acceptable.

For sysops stuff I have found it extremely useful, once it has MCP's into all relevant services, I use it as the first place I go to ask what is happening with something specific on the backend.

I use agentic coding daily and rarely write any code by hand.

Here's what works for me:

Spend a lot of time working out plans. If you have a feature, get Claude Opus to build a plan, then ask it "How many github issues should this be", and get it to create those issues.

Then for each issue ask it to plan the implementation, then update the issue.

Then get it to look at all the issues for the feature and look for inconsistencies.

Once this is done, you can add you architectural constraints. If you think one issue looks like it could potentially reinvent something, edit that issue to point it at the existing implementation.

Once you are happy with the plan, assign to your agents and wait.

Optionally you can watch them - I find this quite helpful because you do see them go offtrack sometimes and can correct.

As they finish, run a separate review agent. Again, if you have constraints make sure the agent enforces them.

Finally, do an overall review of the feature. This should be initially AI assisted.

Don't get frustrated when it does the wrong thing - it will! Just tell it how to do the correct thing, and add that to your AGENTS.md so next time it will do it. Consider adding it to your issue template manually too.

In terms of code review, I manually review critical calculations line-by-line, and do a broad sweep review over the rest. That broad sweep review looks for duplicate functionality (which happens a lot) and for bad test case generation.

I've found this methodology speeds up the coding task around 5-10x what I could do before. Tasks that were 5-10 days of work are now doable in around 1 day.

(Overall my productivity increase is a lot higher because I don't procrastinate dealing with issues I want to avoid).

I had some successes refactoring one instance of a pattern in our codebase, along with all the class' call sites, and having codex identify all the others instances of said pattern and refactor them in parallel by following my initial refactor.

Similarly, I had it successfully migrate a third (so far) of our tests from an old testing framework to a new one, one test suite at a time.

We also had a race condition, and providing Claude Code with both the unsymbolicated trace and the build’s symbols, it successfully symbolicated the trace, identified the cause. When prompted, it identified most of the similar instances of the responsible pattern in our codebase (the one it missed was an indirect one).

I didn’t care much about the suggested fixes on that last one, but consider it a success too, especially since I could just keep working on other stuff while it chugged along.

I manually wrote a "bad" spec, asked it for feedback, improved spec until the problem, the solution and overall implementation design were clear and had a very high level of detail and were trying to do exactly what I needed. The lots of thinking, reading and manual editing helped me understand the problem way better than where I began from.

New session: Fed the entire spec, asked to build generic scaffolding only. New session: Fed the entire spec, asked to build generic TEST scaffolding. New session: Extract features to implement out of spec doc into .md files New session: Perform research on codebase with the problem statement "in mind", write results to another .md. Performed manual review of every .md. New session(s): Fed research and feature .md and asked for ONE task at a time, ensuring tests were written as per spec and keep iterating until they passed. Code reviewed beginning with test assertions, and asked for modifications if required. Before commit, asked to update progress on .md.

Ended up with very solid large project including a technology I wasn't an expert on but familiar, that I would feel confident evolving without an agent if I had to, learned a lot in the process. It would've taken me at least 2 weeks to read docs about it and at least another 3 to implement by hand; I was done in 2 total.

I think it depends on what you consider structurally sound. I built https://skipthestamp.co.uk/ this weekend. Not only did I not write any code, I did it all from my phone! I'm in the middle of writing a blog post about the process.

This is obviously a _very_ simple website, but in my opinion there's no argument that agentic coding works.

You're in control of the sandbox. If you don't set any rules or guidelines to follow, the LLM will produce code you're not happy with.

As with anything, it's process. If you're building a feature and it has lots of bugs, there's been a misstep. More testing required before moving onto feature 2.

What makes you say "unreviewed code"? Isn't that your job now you're no longer writing it?

My latest attempt with the vibe coding used to have to directories, first one for specifications, which is a small markdown that has requirements and a UML diagram, the second directory is the actual code. I use AI to plan the requirements first, then a second agent to implement after the specification is complete and approved by me. So, I do check the requirements and specifications and architecture, but never check the code (almost)

More than a year ago I built my own coding agent called Claudine. I also made agentic anthropic-sdk-kotlin and few other AI libraries for the ecosystem. Maybe this low-level experience allows me to use these tools to deliver in 2 days what would have taken 2 months before.

My advice - embrace TDD. Work with AI on tests, not implementation - your implementation is disposable, to be regenerated, tests fully specify your system through contracts. This is more tricky for UI than for logic. Embracing architectures allowing to test view model in separation might help. I general anything reducing cognitive load during inference time is worth doing.

I had a fairly big custom Python 2 static website generator ( github.com/csplib/csplib ), which I'd about given up transfering to Python 3, after a couple of aborted attempts. My main issue was that the libraries I was using didn't have Python 3 versions.

AN AI managed to do basically the whole transfer. One big help is I said "The website output of the current version should be identical", so I had an easy way to test for correctness (assuming it didn't try cheating by saving the website of course, but that's easy for me to check for)

I was shoe horned into a dev role after an acquisition and it really sucked because it was not what I had been doing at my previous company. My boss was too involved in everyone’s code and went over every line in every PR. It got much worse over time because he started to get the toxic corporate jitters of being removed from his post if he didn’t deliver on his initiatives.

Long story short, since Claude 3.7 I haven’t written a single line of code and have had great success. I review it for cleanliness, anti-patterns, and good abstraction.

I was in charge of a couple full system projects and refactors and I put Claude Code on my work machine which no one seemed to care because the top down “you should use AI or else you aren’t a team player”. Before I left in November I basically didn’t work, was in meetings all the time while also being expected to deliver code, and I started moonlighting for the company I work at now.

My philosophy is, any tool can powerful if you learn how to use it effectively. Something something 10,000 hours, something something.

Edit: After leaving this post I came across this and it is spot on to my point about needing time. https://www.nibzard.com/agentic-handbook

https://arxiv.org/abs/2507.09089

“Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%--AI tooling slowed developers down.”

I am working on a game side project where I started to force myself not to look at the code (the foundation was written by hand). I just recently launched a big update, all written with agentic code. I really enjoy it and do believe it's the future...plug for reference: https://thefakeborzi.itch.io/tower-chess/devlog/1321604/the-...

Yes. The company I am working at currently is using it extensively and I have seen first hand what their senior people are producing with the AI, and I rarely have any comments. It's adding huge value, and increases the velocity of delivery.

I think it depends on your tooling, your code-base, your problem space, and your ability to intelligently inject context. If all four are aligned (in my case they are) it's the real deal.

Do you happen to have any information about which services and workflows they use?

Everyone is using Cursor, model preference varies. The best performers have large context in their repository, make very detailed plans using Planning mode and then execute. Use different models to individually review the work.

I'm build a 2d and 3d CAD in react. I have built an almost complete replacement of Solidworks Basic in the cloud using Claude Code CLI and GLM 4.7 unlimited in China. I have been doing full agent testing in Antigravity and bug fixing without writing a line of code. Total cost is around 300 dollars and 10 hours of time. Probably have another 20 hours of debugging and small refactoring by hand. I have been vibe coding for 9 months. Dev for 30 plus years. It is not easy to just pick up and run with but it gives me super powers when you understand the tools and limitations. Check out Steve Yegge on YouTube!

Can you update the title? "Do you have any evidence that agentic coding works?" sounds like you're seeking verifiable data, but in your post you're soliciting advice for how to use these tools. I clicked hoping for empiricism and got the same hamster wheel of "you're holding it wrong" that I see in every social media comment section about AI tooling.

My colleague coded a feature with Code Claude in a day. The code looks good, also seemingly works. The code was reviewed and pushed out to production.

The problem: there is no way, he verified the code in any way. The business logic behind the feature would take probably few days to check for correctness. But if it looks good -> done. Let the customer check it. Of course, he claims “he reviewed it”.

It feels to me, we just skip doing half the things proper senior devs did, and claim we’re faster.

I've had the opposite experience, but keep in mind that a lot of what gets worked on in large companies is glorified legacy CRUD apps. What I mean is, these applications have already been built with little thought about architecture, best practices and testing. These apps already have design flaws and bugs galore.

In these types of applications, there's already a lot of low hanging fruit to be had from working with an LLM.

If you're on a greenfield app where you get to make those decisions at the start, then I think I would still use the LLMs but I would be mindful of what you check into the code base. You would be better off setting up the project structure yourself and coding some things as examples of how you want the app to work. Once you have some examples in place, then you can use the LLMs to repeat the process for new screens/features.

In my case, I use JetBrains Junie(it uses various models underneath) and it mostly works fine, but I don't vibe code entire products with it, I just give it easy, neatly defined tasks. Improve readme, re-implement something using the same approach as X, create a script that does Y etc. It's fairly good at making one-off tools I need, we messed up something and need a tool to f.e. fill up db with missing records? I'll just make a console app that does this. We need a simple app that will just run somewhere and do one thing(like listen to new files and insert something to a db). It's perfectly fine for that. I wouldn't trust it with day to day job, features, anything more advanced(I do mostly backend work). I also have to verify thoroughly what it ends up doing, it tends to mess up sometimes but since most of what it does is non-critical, I don't mind. I too don't believe the claims of people who just straight up Claude they way through a codebase to do 'grand' things but I have a small sample.

I code powershell, and what really worked for me was defining a highly detailed and specific rules file that outlines exactly what kind of output I want. This includes coding style, design patterns, example function structure, and a whole other bunch of requirements.

In augment code (or any other IDE agent integration), I can just @powershell-advanced-function-design at the top so the agent references my rule file, and then I list requirements after.

Things like:

- Find any bugs or problems with this function and fix them.

- Optimize the function for performance and to reduce redundant operations.

- Add additional comments to the code in critical areas.

- Add debug and verbose output to the function.

- Add additional error handling to the function if necessary.

- Add additional validation to the function if necessary.

It was also essential for me to enable the "essential" MCP servers like sequential thinking, context7, fetch, filesystem, etc.

Powershell coding isn't particularly complex, so this might not work out exactly how you want if you're dealing with a larger codebase with very advanced logic.

Another tangent: Figma Make is actually extremely impressive. I tried it out yesterday to create a simple prompt manager application, and over a period of ~30min I had a working prototype with:

- An integrated markdown editor with HTML preview and syntax highlighting for code fences.

- A semi-polished UI with a nice category system for organizing prompts.

- All required modals / dialogs were automatically created and functioned properly.

I really think agentic coding DOES work. You just have to be very explicit with your instructions and planning.

YMMV.

I had the same question recently! So I did an experiment to see if I could create something of value using agentic coding.

I made the worlds fastest and most accurate JSON Schema validator.

https://github.com/sberan/tjs

Why do you say it's the fastest and most accurate? I don't see any raw stats, just AI generated readmes.

you also don't compare it to the top result on Google https://github.com/Stranger6667/jsonschema was that intentional?

Here are benchmark results vs AJV: https://github.com/sberan/tjs/blob/main/benchmarks/results/B...

There are raw stats in the main graph image of the readme (op/s)

I don't compare it to that validator because these are JS only - I'll update to specify the language preference. Thanks for the feedback.

> with the claim that we should move from “validating architecture” to “validating behavior.” In practice, this seems to mean: don’t look at the code; if tests and CI pass, ship it.

The thing that people don't seem to understand is that these are two separate processes with separate goals. You don't do code reviews to validate behaviour, nor do you test to validate code.

Code reviews are for maintainability and non-functional requirements. Maintainability is something that every longer term software project has run into, to the point where applications have been rewritten from scratch because the old code was unmaintainable.

In theory you can say "let the LLM handle it", but how much do you trust it? It's practially equivalent to using a 3rd party library, most people treat them as a black box with an API - the code details don't matter. And it can work, I'm sure, but do you trust it?

Does it work? Absolutely. It just depends how you use it and what for.

I had 12 legacy node apps running node 4, with jQuery front ends. Outdated dependencies and best practices all over. No build pipeline. No tests. Express 3. All of it worked but it's aging to the point of no return. And the upgrade work is boring, with very little ROI.

In a month, without writing any code, I've got them all upgraded to Node 22, with updated dependencies, removed jQuery completely, newer version of express, better logging, improved UI.

It's work that would have taken me a year of my free time and been soul crushing and annoying.

Did it with codex as a way of better learning the tooling. It felt more like playing a resource sim game than coding. Pretty enjoyable. Was able to work on multiple tasks at once while doing some other work.

It worked really well for that.

> producing code that’s structurally sound enough for someone responsible for the architecture to sign off on

1. It helps immensely if YOU take responsibility for the architecture. Just tell the agent not only what you want but also how you want it.

2. Refactoring with an agent is fast and cheap.

0. ...given you have good tests

---

Another thing: The agents are really good at understanding the context.

Here's an example of a prompt for a refactoring task that I gave to codex. it worked out great and took about 15 minutes. It mentions a lot of project specific concepts but codex could make sense of it.

""" we have just added a test backdoor prorogate to be used in the core module.

let's now extract it and move it into a self-contained entrypoint in the testing module (adjust the exports/esbuilds of the testing module as needed and in accordance with the existing patterns in the core and package-system modules).

this entrypoint should export the prorogate and also create its environment

refactor the core module to use it from there then also adjust the ui-prototype and package system modules to use this backdoor for cleanup """

I've spent a lot of time on my workflows so that the AI generates tests along with code, then it will try the tests, fix mistakes, and the re-try the tests, and so on, until it works. That will usually prevent it from making small mistakes, and your biggest problem then is if it misunderstood the problem and generated the wrong tests. You can ask it to generate documentation for you to review at that point, and then use that documentation to generate code and tests.

Vibe coding is a farce, but their is real value if you have the experience to set up a decent workflow.

I still think it's useful, but you have to make a heavy use of the 'plan' mode. I still ask the new hires to avoid doing more than just the plan (or at most generating tests cases), so they can understand the codebase before generating new code inside.

Basically my point of view is that if you don't feel comfortable reviewing your coworkers code, you shouldn't generate code with AI, because you will review it badly and then I will have to catch the bugs and fix it (happened 24 hours ago). If you generate code, you better understand where it can generate side effects.

My experience has been it does pretty well at writing a "rough draft" with sufficiently good instructions (in particular, telling it a general direction of how to implement it, rather than just telling it to what the end goal is). Then maybe do one or two passes at having the agent improve on that draft, then fix the rest by hand.

We suddenly have many many github workflows. I made some with AI and other developers also made many. Today I found out we suddenly have api calls test agains the beta deployment if the api. This was made in like a day or so. It's something like 200 api calls. And figuring out how the Github Actions works for this one task before it was not really worth the trouble but now it's click click go!

I pointed Claude at my docker compose setup with ~15 different applications and told it to convert it to an opentofu IAC setup.

Did it in one go, WAY better than I ever could have. Creates directories, generates configs from templates, all secrets are encrypted and managed etc.

I've iterated on it a bit more to optimise some bits, mostly for ergonomics, but the basic structure is still there.

---

Did a similar thing with Ansible. I gave claude a way to access my local computer (Mac) and a server I have (Linux), told it to create an ansible setup that sets up whatever is installed on both machines + configurations.

Again, managed it faster and way better than I every could have.

I even added an Arch Linux VM to the mix just to complicate things, that went faster than I could've done it myself too.

My experience of 100% agentic coding has been roughly the same as yours. That said, starting an agent off on a task has been the single most productive step I have introduced to my workflow in awhile.

95% of the time the code doesn't even build but it gets all the jigsaw pieces in place and it's a million times easier to start deleting and moving pieces around than to start from scratch.

Googler opinions are my own.

If agentic coding worked as well as people claimed on large codebases I would be seeing a massive shift at my Job... Im really not seeing it.

We have access to pretty much all the latest and greatest internally at no cost and it still seems the majority of code is still written and reviewed by people.

AI assisted coding has been a huge help to everyone but straight up agentic coding seems like it does not scale to these very large codebases. You need to keep it on the rails ALL THE TIME.

Yup, same experience here at a much smaller company. Despite management pushing AI coding really hard for at least 6 months and having unlimited access to every popular model and tool, most code still seems to be produced and reviewed by humans.

I still mostly write my own code and I’ve seen our claude code usage and me just asking it questions and generating occasional boilerplate and one-off scripts puts me in the top quartile of users. There are some people who are all in and have it write everything for them but it doesn’t seem like there’s any evidence they’re more productive.

as a second annectdote, at amazon last summer things swapped from nobody using llms to almost everyone using them in ~2months after a fantastic tech talk and a bunch of agent scripts being put together

said scripts are kinda available in kiro now, see https://github.com/ghuntley/amazon-kiro.kiro-agent-source-co... - specifically the specs, requirements, design, and exec tasks scripts

that plus serena mcp to replace all of gemini cli's agent tools actually gets it to work pretty well.

maybe google's choice of a super monorepo is worse for agentic dev than amazon's billions of tiny highly patterned packages?

I would think this is reasonable. My general understanding at Amazon is that things are expected to work via API boundaries (not quite the case at Google).

Yes, agentic coding works and has massive value. No, you can't just deploy code unreviewed.

Still takes much less time for me to review the plan and output than write the code myself.

> much less time for me to review the plan and output

So typing was a bottleneck for you? I’ve only found this true when I’m a novice in an area. Once I’m experienced, typing is an inconsequential amount of time. Understanding the theory of mind that composes the system is easily the largest time sink in my day to day.

I don't need to understand the theory of mind, I just tell it what to compose. Writing the actual lines after that takes longer than not writing them!

What do you mean you don’t need to understand? So what do you do when there’s a bug that an LLM can’t fix?

If your bottleneck is typing the code, you must be a junior programmer.

I don't need to understand the theory of mind because I don't have the LLM design the code, I tell it what the design is. If I need something, I can read the functions I told it to implement, which is really simple.

They said that they don't need to understand the LLM's theory of mind. I think that's crystal clear.

If there is a bug, it's vastly more likely that Opus 4.5 will spot it before I can.

Do you know one of the primary signifiers of a senior developer? Effective delegation.

Typing speed has nothing to do with any of this.

It's not about understanding the LLM's theory of mind - the direct quote was

> Understanding the theory of mind that composes the system

i.e. the logic underpinning how the system works

You are one of several people in this thread who clearly skipped their Descartes readings.

My professional workflow with Claude Code goes as follows.

I call it "moonwalk" because, when throwing away the intermediate vibe-coded prototype code in the middle, it feels like walking backwards while looking forward.

- Check out a spike branch

- Vibe code until prototype feels right.

- Turn prototype into markdown specification

- Throw away vibe'd code, keep specification

- Rebase specification into main, check out main

- Feed specification to our XP/TDD agents

- Wait, review a few short iterations if any

- Ship to production

This allows me to get the best of vibe-coding (exploring, fast iterating and dialing-in on the product experience) and writing production-grade code (using our existing XP practices via dedicated CC sub-agents and skills.)

Do not blame the tools? Given a clear description (overall design, various methods to add, inputs, outputs), Google Antigravity often writes better zero shot code than an average human engineer - consistent checks for special cases, local optimizations, extensive comments, thorough text coverage. Now in terms of reviews, the real focus is reviewing your own code no matter which tools you used to write it, vi or agentic AI IDE, not someone else reviewing your code. The later is a safety/mentorship tool in the best circumstances and all too often just an excuse for senior architects to assert their dominance and justify their own existence at the expense of causing unnecessary stress and delaying getting things shipped.

Now in terms of using AI, the key is to view yourself as a technical lead, not a people manager. You don't stop coding completely or treat underlying frameworks as a black box, you just do less of it. But at some point fixing a bug yourself is faster than writing a page of text explaining exactly how you want it fixed. Although when you don't know the programming language, giving pseudocode or sample code in another language can be super handy.

When LLMs transition from predatory pricing to rent seeking, the enthusiasm will evaporate.

[deleted]

Yes, you are probably doing wrong if you even care about the code at this point. Read https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...

The article is a bit over the top, but several people on my company/team are doing exactly that approach, and have built similar orchestrators

That is actually quite similar to my experience using multiple AI coding agents, including Codex and Claude Code. There is an initial phase where things go very well, but then things start to get slower and I feel like I'm stuck in a loop trying to get the agents to fix things without breaking anything else.

I find them most useful for making prototypes to show clients who are unimpressed with a presentation/document, but I end up doing most of the implementation myself. Which is fine.

I have had similar questions, and am still evaluating here. However, I've been increasingly frustrated with the sheer volume of anecdotal evidence from yay and naysayers of LLM-assisted coding. I have personally felt increased productivity at times with it, and frustrations at others.

In order to better research, I built (ironically, mostly vibe coded) a tool to run structured "self-experiments" on my own usage of AI. The idea is I've init a bunch of hypotheses I have around my own productivity/fulfillment/results with AI-assisted coding. The tool lets me establish those then run "blocks" where I test a particular strategy for a time period (default 2 weeks). So for example, I might have a "no AI" block followed by a "some AI" block followed by a "full agent all-in AI block".

The tool is there to make doing check-ins easier, basically a tiny CLI wrapper around journaling that stays out of my way. It also does some static analysis on commit frequency, code produced, etc. but I haven't fleshed out that part of it much and have been doing manual analysis at the end of blocks.

For me this kind of self-tracking has been more helpful than hearsay, since I can directly point to periods where it was working well and try to figure out why or what I was working on. It's not fool-proof, obviously, but for me the intentionality has helped me get clearer answers.

Whether those results translate beyond a single engineer isn't a question I'm interested in answering and feels like a variant of developer metrics-black-hole, but maybe we'll get more rigorous experiments in time.

The tool open source here (may be bugs, only been using it a few weeks): https://github.com/wellwright-labs/devex

I had the same question recently! So I did an experiment to see if I could create something of value using agentic coding.

I made the worlds fastest and most accurate JSON Schema parser.

https://github.com/sberan/tjs

But, nobody seems to care. The repo only has 18 stars and my Show HN post got no upvotes. I'm not sure what to take away from that.

The short answer to your question is “yes”. I’ve been a part of projects that largely leveraged agentic coding tools and produced net posititve results. BUT, we didn’t skip code review and we didn’t lower the bar for what good code looks like.

In my personal experience (working as part of a team and not a solo dev), good documentation and well-documented/enforced practices can produce great results. That said, it’s not 100% perfect but neither are humans.

> with the claim that we should move from “validating architecture” to “validating behavior.” In practice, this seems to mean: don’t look at the code; if tests and CI pass, ship it.

But... tests and CI are also code. It may be buggy, it may not cover enough, etc. It is also likely written by an LLM in this scenario. So, it's more like a move from “validating architecture” to “LLM-based self-validating”

My personal experience, agentic coding produces brittle code. Often it works but it violates every opinionated fiber of coding style in me as a developer, even when tuned to my style. It does not give me code that I want to maintain long term. But did I mention it often works. If you treat it like code, written by a petulant toddler, that needs review and refactoring then it works well. It’s also pretty good at code reviewing your code, finding subtle mistakes.

I can just tell you, that my whole website with the complex logic and systems (based on Astro), which produce more scientific metadata in html code than Nature journal, was spec-coded by me using Github Copilot. But this is my pet project, not an enterprise battle tested one with thousands of users, but it still works: https://david-osipov.vision/en/

[deleted]

Depending on the risk profile of the project, it absolutely works with amazing productivity gains. And the agent of today is the worst agent if will ever be because tomorrow its going to be even better. I am finding amazing results with the ideate -> explore -> plan -> code -> test loop.

I have started to use it to write small throwaway things. Like write a standalone debug shader that can display all this state on top of this image in real time. Not in a million years would I had spent time to mess with fonts in a shading language or bring in immediate gui framework or such. Codex could oneshot that kind of thing and the blast radius is one file that is not part of the project. Or write a separate python program that implements this core logic and double check my thinking. I am not a professional programmer though.

It works on my repos ¯\_(ツ)_/¯

Do you need evidence that auto complete makes you productive? It's stochastically useful. That also means a certain percentage of it will lead to garbage that could easily negate all the benefits. The lower you go down the stack, the more trickier it gets. We as humans are so obsessed with AI right now it's unbelievable. Lots of attention could be spent at other areas of innovation instead of obsessing on LLMs

Just watch the first 2 mins of this and try to keep a straight face.

https://www.youtube.com/watch?v=4OmlGpVrVtM

You'll have your answer.

PS: don't try to use his website/apps, half of it is broken... and he has generated a 'jobs' page on the main app's website which made me laugh so hard I got a coughing fit.

Yep, it works. Like anything getting the most out of these tools is its own (human) skill.

With that in mind, a couple of comments - think of the coding agents as personalities with blind spots. A code review by all of them and a synthesis step is a good idea. In fact currently popular is the “rule of 5” which suggests you need the LLM to review five times, and to vary the level of review, e.g. bugs, architecture, structure, etc. Anecdotally, I find this is extremely effective.

Right now, Claude is in my opinion the best coding agent out there. With Claude code, the best harnesses are starting to automate the review / PR process a bit, but the hand holding around bugs is real.

I also really like Yegge’s beads for LLMs keeping state and track of what they’re doing — upshot, I suggest you install beads, load Claude, run ‘!bd prime’ and say “Give me a full, thorough code review for all sorts of bugs, architecture, incorrect tests, specification, usability, code bugs, plus anything else you see, and write out beads based on your findings.” Then you could have Claude (or codex) work through them. But you’ll probably find a fresh eye will save time, e.g. give Claude a try for a day.

Your ‘duplicated code’ complaint is likely an artifact of how codex interacts with your codebase - codex in particular likes to load smaller chunks of code in to do work, and sometimes it can get too little context. You can always just cat the relevant files right into the context, which can be helpful.

Finally, iOS is a tough target — I’d expect a few more bumps. The vast bulk of iOS apps are not up on GitHub, so there’s less facility in the coding models.

And any front end work doesn’t really have good native visual harnesses set up, (although Claude has the Claude chrome extension for web UIs). So there’s going to be more back and forth.

Anyway - if you’re a career engineer, I’d tell you - learn this stuff. It’s going to be how you work in very short order. If you’re a hobbyist, have a good time and do whatever you want.

I still don't get what beads needs a daemon for, or a db. After a while of using 'bd --no-daemon --no-db' I was sick of it and switched to beans and my agents seem to be able to make use of it much better, on the one hand its directly editable by them as its just markdown, on the other hand the CLI still gives them structure and makes the thing queryable

Steve runs beads across like 100 coding environments simultaneously. So, you need some sort of coordination, whether that's your db or a daemon. Realistically with 100 simultaneous connections, I would probably reach for both myself. I haven't tried beans, thanks for the reference.

yeah that does make sense that these choices are related to it being a big part of gastown, still I feel it would be much more sensible to make a different abstraction separating beads core features from the coordination layer

[deleted]

Personally I use "agent mode" in Cecli for almost everything - I don't know about other AI coding agents but you can easily set up tests to run and validate the output.

Since MCP came out the quality of code has improved since there is always context7 and fetch to look up syntax.

But yes at some point you need to look at the code yourself just to be sure

I am in the same boat as you.

The only positive antigenic coding experience I had was using it as a "translator" from some old unmaintained shell + C code to Go.

I gave it the old code, told it to translate to Go. I pre-installed a compiled C binary and told it to validate its work using interop tests.

It took about four hours of what the vibecoding lovers call "prompt engineering" but at the end I have to admit it did give me a pretty decent "translation".

However for everything else I have tried (and yes, vibecoders, "tried" means very tightly defined tasks) all I have ever got is over-engineered vibecoding slop.

The worst part of of it is that because the typical cut-off window is anywhere between 6–18 months prior, you get slop that is full of deprecated code because there is almost always a newer/more efficient way to do things. Even in languages like Go. The difference between an AI-slop answer for Go 1.20 and a human coded Go 1.24/1.25 one can be substantial.

I have written a software which I wanted to do so for past 7-8 years within past 3 months. I have over 6000 pages of conversations between me and chatgpt Claude Gemini. And hoping to get patent soon. It consists of over to 260k loc, works well it is architected to support many different industries without much changes but configuration and has very good headed and headless qa coverage. I have spent about 16-18 hours a day because I am so bought into the idea and the outcome I am getting. My patent lawyer suggested getting provisional patent on my work. So for me it works

6000 pages have converted to 1 millions+ lines of product specs and granularly broken down work in phases and tasks. All tracked in the repo.

What does it do?

Track Mitsuhiko's work and blog posts:

https://news.ycombinator.com/user?id=the_mitsuhiko

https://lucumr.pocoo.org/about/

He has an extensive and impressive body of work in Python and Rust pre LLMs. He's now working on his own startup and doing much of it with AI and documenting his journey. I trust his opinions even though I don't use LLMs as much as he does.

Coding agent is a perfect simulation of a junior developer working under you. Developer that will tell you - “yes I can do that” about any language and any problem and will never ask you any questions trying very hard to appear competent.

Your job is to put them in constraints and give granular and clear tasks. Be aware that junior developer has very basic knowledge about architecture.

The good is that it does not simulate that part when developer tries shift blame or pin it on you. Because you’re to blame at all times.

it also doesnt simulate the part where the junior actually learns and is less clueless 6 months from now, unfortunately

That's not quite true actually, the context window has increased and models have definitely gotten smarter over the last year. So far you can think that part is being simulated.

[dead]

I've been using agentic coding tools for the past year and a half, and the pattern I've observed is that they work best when they're treated as a very fast, very knowledgeable junior developer, not completely as "autonomous engineer".

When I try to give agents broad architectural tasks, they flounder. When I constrain them to small, well-defined units of work within an existing architecture, they can produce clean, correct code surprisingly often.

In my experience agentic coding is still bad a context management.

I can get amazing results from a chatbot based workflow where I manually provide any context needed, if the agent can happily pull files into context on it's own it tends to pull in too much or completely irrelevant stuff and the quality of the output suffers.

It's significantly faster in many cases that having written the code by hand myself.

The only way to prove this is "show me the codebase before and after the agent." All evidence is that you are right: https://news.ycombinator.com/item?id=46197930

It works in the sense that there are lots of professional (as in they earn money from software engineering) developers out there who do the work of exactly same quality. I would even bet they are the majority (or at least were prior to late 2024).

General AI skeptic and manager reporting in.

Have had success at work, real value, real results.

Example: extracting a bunch of data from a tool we’re required to use at my company for getting a bunch of performance metrics. The data is useful but the interface is awful and it’s impossible to pull out trends and spot the real information I need from it. So, I threw Claude at it after months of dreaming of being able to better use the data. It generated for me in a few minutes all the data I could hope for in a CSV I was able to load into another tool that gave me deep insights almost immediately and allowed me to go make some different decisions I otherwise wouldn’t have.

What I did:

1. I have created and curated a set of sub-agents and commands/workflows for building things for me.

2. I used my build command, which details a workflow for refining, planning, implementing, code reviewing, testing, then conducting a final “product review” to determine if original requirements were met.

3. I then review the code myself before running it.

The code was solid (I’m also a very strong engineer and have tailored my agents and workflows to generate code I’d be comfortable with).

Another example: one of my teams went on a journey to convert one of our internal legacy frontend applications to a newer shared component library and eliminate old cruft that we inherited when we inherited the codebase.

The team was able to get this massive UI rewrite done in under two weeks, the updated code was better than the original code (it was all React to React, TypeScript to TypeScript), and we eliminated (literally) hundreds of thousands of lines of old hand-written over-abstracted code. Bundle sizes dramatically down, higher performance, more modern UX, and the thing is in production and working. Real value: faster product iteration in this now far smaller and easier-to-work-with codebase, far less technical debt, and faster builds, etc.

The team only used GitHub Copilot for this and it required a bunch of iteration and starting over with different instructions, but they got there and still managed to save a ridiculous amount of time; hand-writing the UI migration would have been one of those multi-month projects that went over schedule (I’ve seen and lived that movie many times before).

I’m still very skeptical of all the hype but I’ve seen very real, very valuable results out of this stuff.

edit: formatting

Since we are on this topic, how would I make an agent that does this job:

I am writing an automation software that interfaces with a legacy windows CAD program. Depending on the automation, I just need a picture of the part. Sometimes I need part thickness. Sometimes I need to delete parts. Etc... Its very much interacting with the CAD system and checking the CAD file or output for desired results.

I was considering something that would take screenshots and send it back for checks. Not sure what platforms can do this. I am stumped how Visual Studio works with this, there are a bunch of pieces like servers, agents, etc...

Even a how-to link would work for me. I imagine this would be extremely custom.

No joke you should ask one of the latest thinking models to plan this out with you.

What controls the legacy CAD app? Are you using AutoLISP? or VB scripting? Or something else?

I'm using VB.net with visual studio.

As far as I can tell, there are exactly 3 use cases that have demonstrably worked with AI, in the sense that their stakeholders (not the AI companies, the users) swear it works.

1. training a RAG on support questions for chat or documentation, w/good material

2. people doing GTM work in marketing, for things like email automation

3. people using a combination of expensive tools - Claude + Cursor + something else (maybe n8n, maybe a custom coding service) - to make greenfield apps

I know of 4th - line level interference. I love how it works in IDEA, 30% of time I just slap TAB to accept generated code, if it is wrong I just continue writing as usual. Avoids most drawbacks: you don't pay for it(it is running on local CPU), the "review" is instant while still providing boost to productivity.

A $200/month Cursor plan spent on Opus 4.5 calls is not expensive compared to the silly amount of work it will do if you make proper use of plan/agent/debug cycles.

I have a small-ish vertical SaaS that is used heavily by ~700 retail stores. I have enabled our customer success team to fix bugs using GitHub copilot. I approve the PRs, but they have fixed a surprising number of issues.

Define "works"

Easiest way to get value is building tests. These don't ship.

You can get value from LLM as an additional layer of linting. Reviews don't ship either.

You can use LLM for planning. They can quickly scan across the codebases catch side effects of proposed changes or do gap analysis from the desired state.

Argumenting that agentic coding must be on or off seem very limiting.

I think it is great for experimenting, and proving concepts. Alphas and personal projects, not shipped code.

I've been working on wasm sandboxing and automatic verification that code doesn't have the lethal trifecta and got something working in a couple of days.

I'd like to do a clean rewrite at some point.

The way I see it, is that for non-trivial things you have to build your method piece by piece. Then things start to improve. It's a process of... developing a process.

Write a good AGENTS.md (or CLAUDE.md) and you'll see that code is more idiomatic. Ask it to keep a changelog. Have the LLM write a plan before starting code. Ask it to ask you questions. Write abstraction layers it (along with the fellow humans of course) can use without messing with the low-level detail every time.

In a way you have to develop a framework to guide the LLM behavior. It takes time.

For me its a major change for personal projects. That said, since about 3 months ago VS Code Github Copilot is remarkably stable in working with existing code base and I could implement changes to those projects that would have taken me a substantially longer time. So at least for this use-case its there. Hidden game changers are Gradio/Streamlit for easy UI.

If you're building something new, stick with languages/problems/projects that have plenty of analogues in the opensource world and keep your context windows small, with small changes.

One-shotting an application that is very bespoke and niche is not going to go well, and same goes for working on an existing codebase without a pile of background work on helping the model understand it piece by piece, and then restricting it to small changes in well-defined areas.

It's like teaching an intern.

All of this is subjective. What does it mean for code to be high quality?

If you can express that in a form that can be easily tested, you can just instruct an agentic coding tool to do something about it. Most of my experience is with codex. Everytime I catch it doing something I don't like, I try to codify it in a skill or in my Agents.md or some test. I've been using codex specifically to work on addressing technical debt in my own code bases. There's a lot of stuff I never got around to fixing that I'm now actually addressing. Because it stopped being a monster project that would take weeks. You can actually nudge a code base in the right direction with agentic coding tools.

The same things that make it hard for people to iterate on code bases (complexity, technical debt, poor architectural decisions, etc.) also make it hard for LLMs to work on code bases. So, as soon as you start working on making those things better, you might get better results.

If you have a lot of regressions when iterating with an LLM, you don't have good enough regression tests. If code produces runtime type errors, maybe use something with a better type checker that can remove those bugs before they happen. If you see a lot of duplication, tell it to do something about it and/or use code quality tools that flag such issues and tell it to address those issues. This stuff requires a bit of discipline and skill. But they are fixable things. And the usual excuse that you can't be bothered doesn't apply here; just make the coding tools fix this for you.

As for evidence, the amount of dollars being spent by well respected people in the industry on these tools is increasing. That might not be the evidence you like but it's a clear indication that people are getting some value out of these tools.

I'm definitely getting more predictable results. I find myself merging most proposed changes after a few iterations. The percentage is trending up in the last months. I can only speak for myself. But essentially everybody I know and respect is using this stuff at this point. With very mixed results. But people are getting shit done. I think there are lots of things to improve with these tools. I'd like them to be faster and require less micro management. I'd like them to work across multiple repositories and issue trackers instead of suffering from perpetual tunnel vision. Mostly when I get bad results, it's a context problem. Some of these things are frustrating to fix. But in the end this is about good feedback loops, not about models magically getting what you want.

I’ve heard coding agents best described as a fleet of junior developers available to you 24/7 and I think that’s about right. With the added downside that they don’t really learn as they go so they will forever be junior developers (until models get better).

There are projects where throwing a dozen junior developers at the problem can work but they’re very basic CRUD type things.

Or you give them all specific little tasks that you think out. And then review their work of course. So yeah you are still needing to do a lot of work.

[dead]

Personally, I've had great gains in terms of small personal tools built on top of CLIs, and Emacs config. I'm also able to deliver PRs that demonstrate a general principle to someone I manage even if they work outside of my most comfortable stacks. But that's not what you asked about.

I don't have direct evidence of exactly what you're looking for (particularly the part about "someone responsible for the architecture to sign off"). Sticking strictly to that last caveat may prevent you from receiving some important signal.

> the claim that we should move from “validating architecture” to “validating behavior.”

I think these people are on the right track and the reason I think that is because of how I work with people right now.

I manage the work of ~10 developers pretty closely and am called on for advice to another ~10, while also juggling demanding stakeholders. For a while now, I've only been able to do spot checks on PRs myself. I don't consider that a major part of my job anymore. The management that is most valuable is:

1) Teaching developers about quality so that they start with better code, and give better reviews to each other 2) Teaching people to focus and move in small steps 3) Investing in guardrails 4) Metrics, e.g. it doesn't matter what code is merged, it doesn't matter if a "task" is "shipped", what matters is if the metrics say that we've had the result we expected.

As I acknowledge how flimsy my review process is, my impulse is to worry about architecture and security. But metrics and guardrails exist for those things too. Opinionated stacks help, for instance SQL injection opportunities look different enough from "normal" Rails to mean that there are linters that can catch many problems, and the linters are better than I am at this job.

Some of these tools are available for agents just as they are for humans. Some of them are woefully bad or lack good ergonomics for agents, but I wouldn't bet against them becoming better.

I agree that agentic coding changes code review, but I don't think that has to inevitably / long-term mean worse.

> half of my time went into fixing the subtle mistakes it made or the duplication it introduced

A cold hard evaluation of the effectiveness of agentic coding doesn't care about what percentage of time went into fixing bad code; it cares about the total time.

That said, I find that making an agent move in many small steps (just how I would advise a human) creates far less rework.

I can't speak for other use cases, but it's been ok (not great, but tolerable) for unit test generation and documentation.

I feel there's two contradicting statements in this post.

> Is there evidence that agentic coding works?

Yes plenty, tons, and growing every single day, people are producing code and tooling that works for them and is providing them value. My linkedin is even starting to show me none-programmers knocking up web front ends for us, and my brother who is a builder is now drawing up requirement specs for software!

> Is the code high-quality

Only if you are really careful, and constantly have the human in the loop guiding the agent, and it's not easy. This is where domain expertise and experience come in.

Do I think it's possible, yes, do I think there are a ton of good examples out there, absolutely not.

I've used Mistral via duck.ai for my hobby project a few times to convert C++ code snippets to C, which it did well and made it more clear what the code was actually doing.

try other harnesses than codex.

ive had more success with review tools, rather than the agent getting the code quality right the first time.

current workflow

1. specs/requirements/design, outputting tasks 2. implementation, outputting code and tests 3. run review scripts/debug loops, outputting tasks 4. implement tasks 5. go back to 3

the quality of specs, tasks, and review scripts make a big difference

one of the biggest things that gets the results better is if you can get a feedback loop in from what the app actually does back to the agent. good logs, being able to interact/take screenshots a la playwright etc

guidelines and guardrails are best if theyre tools that the agent runs, or that run automatically to give feedback.

Take a look at Github accounts of people pushing it, and you will see a clear pattern - it works, apparently, for a _very_ specific subset of people.

You are confusing concepts. Does a book or textbook make the code better? An LLM is a book, only more flexible.

I've had AI write ~100% of my code for the last 7 months, but I acted as the "agent" so the AI had very high levels of direction, and I approved all code changes at every step, including during debugging.

Mostly Gemini Pro 2.5 (and now Gemini Pro 3) and mostly Clojure and/or Java, with some JavaScript/Python. I require Gemini's long context size because my approach leans heavily on in context learning to produce correct code.

I've recently found Claude Code with Opus 4.5 can relieve me of some of the "agent" stuff I've done, allowing the AI to work for 10-20 minutes at a time on its own. But for anything difficult, I still do it the old way, intervening every 1-3 minutes.

Each interaction with the AI costs at least a $1, usually more (except Claude Code, where I use the $200/month plan), so my workflow is not cheap. But it 100% works and I developed more high-quality code in 2025 than in any previous year.

Don't use it myself. But I have a client who uses it. The bugs it creates are pretty funny. Constantly replacing parts of code with broken or completely incorrect things. Making things that previously worked broken. Deleting random things.

I think of coding agents more like "typing assistants" than programmers. If you know exactly what and how to do what you want, you can ask them to do it with clear instructions and save yourself the trouble of typing the code out.

Otherwise, they are bad.

well I made a machine to help my machine's work on their machine building through my knowledge management machine while being prioritized from another machine

but the machine's keep going down, and we haven't cleaned enough of the bugs in the meta machine to open source it yet.

so...kinda?

I’ve started asking people to show me their products. I don’t get good answers so far. Maybe that’s just my small sample size / bubble of people I interact with.

My thinking is that showcasing / interacting with products built by LLMs will tell you a lot about code quality AND maintenance.

I mean it’s easy to spin up static websites. It’s a whole another thing to create, maintain and iteratively edit/improve an actual digital product over time. That’s where the cracks will show up.

That also might be the core problem of agentic code: it’s fairly fresh so you won’t see products that have been maintained for long.

Thus, my current summary is: it’s great for prototyping and probably for specific tasks like test case generation but it’s not something you want to use when working on a multiyear product/project.

> The product has to work, but the code must also be high-quality.

For me this has completely changed. The code needs to work. Bonus points when it is easy for a human to follow and has comments. But I don't do a full review anymore. I skim it and if I don't see an obvious flaw, it's lgtm. I also don't look into the bytecode / assembly to check whether the compiler did a good job.

yes, +1 for the assembly analogy. I absolutely feel the same about it

if my understanding is correct, @steipete is using exclusively coding agents, and imo he does a good job (see clawd) https://steipete.me/posts/2025/shipping-at-inference-speed

Any real senior devs here using agentic coding?

Yes, constantly.

I don’t know what I do differently, but I can get Cursor to do exactly what I want all the time.

Maybe it’s because it takes more time and effort, and I don’t connect to GitHub or actual databases, nor do I allow it to run terminal commands 99% of the time.

I have instructions for it to write up readme files of everything I need to know about what it has done. I’ve provided instructions and created an allow list of commands so it creates local backups of files before it touches them, and I always proceed through a plan process for any task that is slightly more complicated, followed by plan cleanup, and execution. I’m super specific about my tech stack and coding expectations too. Tests can be hard to prompt, I’ll sometimes just write those up by hand.

Also, I’ve never had to pay over my $60 a month pro plan price tag. I can’t figure out how others are even doing this.

At any rate, I think the problem appears to be the blind commands of “make this thing, make it good, no bugs” and “this broke. Fix!” I kid you not, I see this all the time with devs. Not at all saying this is what you do, just saying it’s out there.

And “high quality code” doesn’t actually mean anything. You have to define what that means to you. Good code to me may be slop to you, but who knows unless it is defined.

Works pretty great for me, especially Spec-driven development using OpenSpec

- Cleaner code - Easily 5x speed minimum - Better docs, designs - Focus more on the product than than the mechanics - More time for family

Really interested in your workflow using OpenSpec. How do you start off a project with it? And what does a typical code change look like?

Honestly, I only use coding agents when I feel too lazy to type lots of boilerplate code.

As in "Please write just this one for me". Even still, I take care to review each line produced. The key is making small changes at a time.

Otherwise, I type out and think about everything being done when in ‘Flow State’. I don't like the feeling of vibe coding for long periods. It completely changes the way work is done, it takes away agency.

On a bit of a tangent, I can't get in Flow State when using agents. At least not as we usually define it.

I did the same experiment as you, and this is what I learned:

https://www.linkedin.com/pulse/concrete-vibe-coding-jorge-va...

The bottom line is this:

* The developer stop been a developer, and becomes a product designer with high technical skills.

  * This is a different set of skills than than a developer or a product owner currently have. It is a mix of both, and the expectations of how agentic development works need to be adjusted.

* Agents will behave like junior developers, they can type very fast, and produce something that has a high probability to work. They priority will be to make it work, not maintainability, scalability, etc. Agents can achieve that if you detail how to produce it.

  * The working with an agent feels more like mentoring the AI than ask and receive.

* When I start to work on a product that will be vibe coded, I need to have clear in my head all the user stories, code architecture, the whole system, then I can start to tell the agent what to build, and correct and annotate in the md files the code quality decisions so it remembers them.

* Use TDD, ask the agent to create the tests, and then code to the test. Don't correct the bugs, make the agent correct them and explain why that is a bug, specially with code design decisions. Store those in AGENTS.md file at the root of the project.

There are more things that can be done to guide the agent, but I need to have clear in an articulable way the direction of the coding. On the other side, I don't worry about implementation details like how to use libraries and APIs that I am not familiar with, the agent just writes and I test.

Currently I am working on a product and I can tell you, working no more than 10 hours a week (2 hours here, 3 there, leave the agent working while I am having dinner with family) I am progressing at I would say 5 to 10 times faster than without it. So, yeah it works, but I had to adjust how I do my job.

> Scaling long-running autonomous coding https://news.ycombinator.com/item?id=46624541

This is exactly the issue I have with what I'm seeing around: lots of "here's something impressive we did" but nearly nothing in terms of how it was actually achieved in clear, reproducible detail.

I'm not sure OP is looking for evidence like this. There are many optimistic articles from people or organizations who are selling AI products, AI courses, or AI newsletters.

Yes. Can I share it? No, sadly. It definitely works - but I think sometimes expectations are too high is all.

> The product has to work, but the code must also be high-quality.

I understand and admire your commitment to code quality. I share similar ideals.

But it's 2026 and you're asking for evidence that agentic coding works. You're already behind. I don't think you're going to make it. Your competitors are going to outship you.

In most cases, your customers don't care about your code. They only want something that works right.

My main rule is never to commit code you don’t understand because it’ll get away from you.

I employ a few tricks:

1- I avoid auto-complete and always try to read what it does before committing. When it is doing something I don’t want, I course correct before it continues

2- I ask the LLM questions about the changes it is making and why. I even ask it to make me HTML schema diagrams of the changes.

3- I use my existing expertise. So I am an expert Swift developer, and I use my Swift knowledge to articulate the style of what I want to see in TypeScript, a language I have never worked in professionally.

4- I add the right testing and build infrastructure to put guardrails on its work.

5- I have an extensive library of good code for it to follow.

"Agentic coding works" and "ship without review" are two different claims. The first is true for constrained tasks, the second is Silicon Valley brain rot. I use Claude Code daily for DevOps automation and data migrations. Every output gets reviewed. It saves me hours, not judgment.

Yes. Three complex websites in production. Basically 99,99% of code written by AI. One of them is a google-like vertical web crawler, parser and indexer, with all the complexity and challenges that come with it. AI coding works.

> Lately I’ve seen a push toward minimal or nonexistent code review, with the claim that we should move from “validating architecture” to “validating behavior.”

This is very bad. A programmers output is code, when we begin to measure the output of code as the primary product of a programmers work we will fall into a death spiral of unmaintainable code. I still strongly believe we write code for other humans, not for machines. Yes, yes we need to solve problems and you may argue that is the true measure of code. However this argument ignores maintenance and continued developmental requirements.

Treat it as a pair programmer. Ask it questions like "How do I?", "When I do X, Y happens, why is that?", "I think Z, prove me wrong" or "I want to do P, how do you think we should do it?"

Feed it little tasks (30 s-5 min) and if you don't like this or that about the code it gives you either tell it something like

   Rewrite the selection so it uses const, ? and :

or edit something yourself and say

   I edited what you wrote to make it my own,  what do you think about my changes?

If you want to use it as a junior dev who gets sent off to do tickets and comes back with a patch three days later that will fail code review be my guest, but I greatly enjoy working with a tight feedback loop.

When you have a hammer, everything looks like a nail. Ad nauseam.

AI has made it possible for me to build several one-off personal tools in the matter of a couple of hours and has improved my non-tech life as a result. Before, I wouldn't even have considered such small projects because of the effort needed. It's been relieving not to have to even look at code, assuming you can describe your needs in a good prompt. On the other hand, I've seen vibe coded codebases with excessive layers of abstraction and performance issues that came from a possibly lax engineering culture of not doing enough design work upfront before jumping into implementation. It's a classic mistake, that is amplified by AI.

Yes, average code itself has become cheap, but good code still costs, and amazing code, well, you might still have an edge there for now, but eventually, accept that you will have to move up the abstraction stack to remain valuable when pitted against an AI.

What does this mean? Focus on core software engineering principles, design patterns, and understanding what computer is doing at a low level. Just because you're writing TypeScript doesn't mean you shouldn't know what's happening at the CPU level.

I predict the rise in AI slop cleanup consultancies, but they'll be competing with smarter AIs who will clean up after themselves.

> Last weekend I tried building an iOS app for pet feeding reminders from scratch.

Just start smaller. I'm not sure why people try to jump immediately to creating an entire app when they haven't even gotten any net-positive results at all yet. Just start using it for small time saving activities and then you will naturally figure out how to gradually expand the scope of what you can use it for.

The default output from AI is much like the default output from experienced devs prioritizing speed over architecture to meet business objectives. Just like experienced devs, LLMs accept technical debt as leverage for velocity. This isn't surprising - most code in the world carries technical debt, so that's what the models trained on and learned to optimize for.

Technical debt, like financial debt, is a tool. The problem isn't its existence, it's unmanaged accumulation.

A few observations from my experience:

1. One-shotting - if you're prompting once and shipping, you're getting the "fast and working" version, not the "well-architected" version. Same as asking an experienced dev for a quick prototype.

2. AI can output excellent code - but it takes iteration, explicit architectural constraints, and often specialized tooling. The models have seen clean code too; they just need steering toward it.

3. The solution isn't debt-free commits. The solution is measuring, prioritizing, and reducing only the highest risk tech debt - the equivalent of focusing on bottlenecks with performance profiling. Which code is high-risk? Where's the debt concentrated? Poorly-factored code with good test coverage is low-risk. Poorly-tested code in critical execution paths is high-risk. Your CI pipeline needs to check the debt automatically for you just like it needs to lint and check your tests pass.

I built https://github.com/iepathos/debtmap to solve this systematically for my projects. It measures technical debt density to prioritize risk, but more importantly for this discussion: it identifies the right context for an LLM to understand a problem without looking through the whole codebase. The output is designed to be used with an LLM for automated technical debt reduction. And because we're measuring debt before and after, we have a feedback loop - enabling the LLM to iterate effectively and see whether its refactoring had a positive impact or made things worse. That's the missing piece in most agentic workflows: measurement that closes the loop.

To your specific concern about shipping unreviewed code: I agree it's risky, but the review focus should shift from "is every line perfect" to "where are the structural risks, and are those paths well-tested?" If your code has low complexity everywhere, is well tested (always review tests), and passing everything, then ask yourself what you actually gain at that point from further investing your time over-engineering the lesser tech debt away? You can't eliminate all tech debt, but you can keep it from compounding in the places that matter.

For simple stand alone things it works

My gf manages to get paid using Cursor/Copilot, despite not being able to branch herself out of a loop

In my experience Copilots work expertly at CRUD'ing inside a well structured project, and for MVPs in languages you aren't an expert on (Rust, C/C++ in my case)

The biggest demerit is that agents are increasingly trying to be "smart" and using powershell search/replace or writing scripts to skimp on tokens, with results that make me unreasonably angry

I tried adding i18n to an old react project, and copilot used all my credits + 10 USD because it kept shitting everything up with its maddening, idiotic use if search replace

If it had simply ingested each file and modified them only once, it would have been cheaper

As you can tell, I am still salty about it

news.ycombinator.com/item?id=46670279

Recently there was this post which is largely generated by Claude Code. Read it.

1. Work in sh-tty and buggy codebases to begin with

2. Then you can't see the AI slop in mountains of existing spaghetti

3. Profit

_For more life hacks like and subscribe_

I've built multiple new apps with it and manage two projects that I wrote. I barely write any code other than frontend, copy, etc.

One is a VSCode extension and has thousands of downloads across different flavors of the IDE -- won't plug it here to spare the downvotes ;)

Been a developer professionally for nearly 20 years. It is 100% replacing most of the things I used to code.

I spend most of my time while it's working testing what it's built to decide on what's next. I also spend way more time on DX of my own setup, improving orchestration, figuring out best practice guidance for the Agent(s), and building reusable tools for my Agents (MCP).

product work and high-quality code. It doesn't conflict.

This is a WILD thread.

I had no idea so many people still clung to these lines!

Personally, I see huge value. I’ve built more in the past year than I ever have before. Ideas that used to go on a list, now get implemented in a weekend. The ones that are good, I’ve shipped. Others get played with, tested, and set aside. Not due to lack of quality, but the idea wasn’t right. Some need more thought so I’ll keep testing, and tweaking over time.

Evidence? Those projects would not exist if not for these tools. Not because I couldn’t write them, but because I WOULDN’T have written them.

But I’ve given up trying to convince people. If you don’t think it works, great. Don’t use it. That’s fine with me!

I’d urge you to keep trying though. Hopefully you have the moment as well. And if you think, “it’s ok for some things, but not what I do”, well…just don’t wait a year before trying again. Keep an open mind, and integrate the tools into your workflow everyday. Not as a one off “ok fine, I’ll test this stuff” grumpy kind of way, but in an honest iterative way where you are using the tools on a daily basis, even in a small way. For docs. Whatever. Just keep using them. Eventually you’ll see. Eventually you’ll have the moment.

One thing I think is lost in a lot of these conversations is how much FUN I am having. I’ve been coding for…well if you count copying BASIC from books into a Timex Sinclair, about 40+ years. Maybe professionally for 30+? I haven’t had this much FUN building since back when I copied those games from that little red book of BASIC programs. These tools make even mundane work fun for me, cause you need to have your workflow right. So coding becomes more like Factorio or something. Keep optimizing your setup, then simplify, the optimize, then simplify.

Syntax was always something that felt like a hurdle. I always rolled my eyes at people that get lost in the minutia of a language. Like I remember the Perl neck beards from way back. lol.

I learned what I needed to get the task done, then moved on. Likely forgetting 90% of the details anyway, so I could keep an eye on what I considered important. Architecture. Functionality. Separation of Concerns. Could I dive back in and figure out the details again if needed, sure. No problem.

NOW these tools let me work at this abstraction level even more.

And I’m here for it.

Just amazing this is where we are at as an industry.

I am with you on this, although I was able to ship with Aider before as it uses less autonomous approach than the current wave of agentic tools.

I don't even care about abstract code quality. To me code quality means maintainability. If the agents are able to maintain the mess they are spewing out, that's quality code to me. We are decidedly not there yet though.

> The product has to work, but the code must also be high-quality.

I think in most cases the speed at which AI can produce code outweighs technical debt, etc.

But the thing with debt is that it has to be paid eventually.

project also have to be paid off financially. We have been there before - startup used to go fast and break things so that once MVP is validated they slow down and fix things or even rewrite to new tech/architecture. No you can validate idea even faster with AI. And probably there is a lot of code that you write for one time or throw away internal tools etc.

not if you get acquired

Is your argument that it's now someone else's problem? That it must be paid, just by someone else? Thanks, I hate it.

You will probably be able to just keep throwing AI at it in the coming years, as memory systems improve, if not already.

[deleted]

This is anecdotal and maybe reflects what other people are seeing.

If you know the field you want it to work in, then it can augment what you do very well.

Without that they all tend to create hot garbage that looks cool to a layperson.

I would also avoid getting it to write the whole thing up front. Creating a project plan and requirements can help ground them somewhat.

Your metric for “getting it to work” is wrong. Developing software is a means to an end, not a goal in and of itself.

The simplest metric you should be tracking is; has it generated income.

In this sense my use of agentic coding has performed very well. People asking for evidence and repos are a little naive to how capitalism works in the real world. If I’m making money on something I’m not going to let you copy it, and I’m sure as hell not going to devalue it by publicising that it was built by AI.

Claude Cowork was apparently completely written by Claude Code. So this appears to yet again be a skill issue.

> apparently completely written by Claude Code

https://www.promptarmor.com/resources/claude-cowork-exfiltra...

> Claude Cowork Exfiltrates files

That explains it

OMG that's right, no human has ever written vulnerable code! Shut it down yall the AI thing is over!! This guy nailed it!

You need to perturb the token distribution by overlaying multiple passes. Any strategy that does this would work.

No, it's clear that we are being lied to, you are not crazy. One of the only uses for these llms ironically is to create fake astroturf bots just believable enough to trick the masses who don't understand the technology. It works sometimes, but as you know sometimes is not good enough for most code that people actually need to you know, work.

My anecdote: an entire start to finish app. I didn’t write a single line.

> I personally can't accept shipping unreviewed code. It feels wrong. The product has to work, but the code must also be high-quality.

What’s the definition of high-quality? For the project I was working on, I just needed it to work without any obvious bugs. It’s not an app for an enterprise business critical purpose, life critical (it’s not a medical device or something), or regulated industry. It’s just a consumer app for convenience and novelty.

The app is fast, smaller than 50MB, doesn’t have any bugs that the AI couldn’t fix for my test users. Sounds like the code is high quality to me.

I literally don’t give a shit what the code looks like. You gotta remember that code is just one of many methods to implement business logic. If we didn’t have to write code to achieve the result of making apps and websites it would have no value and companies wouldn’t hire software engineers.

I don’t write all my apps this way, but in this specific case letting Jesus take the wheel made sense and saved me a ton of time.

There's an impedance mismatch between some people and LLMs and I think one of the major reasons boils down to having preconceived notions about how it should get things done and being frustrated and disappointed when it doesn't. If you explore how to get the best out of it you can by trying many different kinds of ways to interact with it you'll have much more success.

I have had similar problems with colleagues who couldn't abide by others solving problems in ways that they disagreed with and would only be agreeable coworkers if they thought in the same way.

Yes. Over the last month, I've made heavy use of agentic coding (a bit of Junie and Amp, but mostly Antigravity) to ship https://www.ratatui-ruby.dev from scratch. Not just the website... the entire thing.

The main library (rubygem) has 3,662 code lines and 9,199 comment lines of production Ruby and 4,933 code lines and 710 comment lines of Rust. There are a further 6,986 code lines and 2,304 comment lines of example applications code using the library as documentation, and 4,031 lines of markdown documentation. Plus, 11,902 code lines and 2,164 comment lines of automated tests. Oh, and 4,250 lines in bin/ and tasks/ but those are lower-quality "internal" automation scripts and apps.

The library is good enough that Sidekiq is using it to build their TUI. https://github.com/sidekiq/sidekiq/issues/6898

But that's not all I've built over this timeframe. I'm also a significant chunk of the way through an MVU framework, https://rooibos.run, built on top of it. That codebase is 1,163 code lines and 1,420 comment lines of production Ruby, 4,749 code lines and 521 comment lines of automated tests. I need to add to the 821 code lines 221 comment lines of example application code using the framework as documentation, and to the 2,326 lines of markdown documentation.

It's been going so well that the plan is to build out an ecosystem: the core library, an OOP and an FP library, and a set of UI widgets. There are 6,192 lines of markdown in the Wik about it: mailing list archives, AI chat archives, current design & architecture, etc.

For context, I am a long-time hobbyist Rubyist but I cannot write Rust. I have very little idea of the quality of the Rust code beyond what static analyzers and my test suite can tell me.

It's all been done very much in public. You can see every commit going back to December 22 in the git repos linked from the "Sources" tab here: https://sr.ht/~kerrick/ratatui_ruby/ If you look at the timestamps you'll even notice the wild difference between my Christmas vacation days, and when I went back to work and progress slowed. You can also see when I slowed down to work on distractions like https://git.sr.ht/~kerrick/ramforge/tree and https://git.sr.ht/~kerrick/semantic_syntax/tree.

If it keeps going as well as it has, I may be able to rival Charm's BubbleTea and Bubbles by summertime. I'm doing this to give Rubyists the opportunity to participate in the TUI renaissance... but my ultimate goal is to give folks who want to make a TUI a reason to learn Ruby instead of Go or Rust.

Yes.

Caveat: can't be pure vibes. Needs ownership, care, review and willingness to git reset and try again when needed. Needs a lot of tests.

Cavaet: Greenfield.

I review it as i generate it. for quality. i guide it to be self-testing. create unit tests and integration tests according to my standards

This is 1/3 response to a short prompt about implementation options for GitHub Runner form broken Server to Github Enterprise Cloud: # EC2-Based GitHub Actions Self-Hosted Runners - Complete Implementation

## Architecture Overview

This solution deploys auto-scaling GitHub Actions runners on EC2 instances that can trigger your existing AWS CodeBuild pipelines. Runners are managed via Auto Scaling Groups with automatic registration and health monitoring.

## Prerequisites

- AWS CLI configured with appropriate credentials - GitHub Enterprise Cloud organization admin access - Existing CodeBuild project(s) - VPC with public/private subnets

## Solution Components

### 1. CloudFormation Template### 2. GitHub Workflow for CodeBuild Integration## Deployment Steps

### Step 1: Create GitHub Personal Access Token

1. Navigate to GitHub → Settings → Developer settings → Personal access tokens → Fine-grained tokens 2. Create token with these permissions: - *Repository permissions:* - Actions: Read and write - Metadata: Read - *Organization permissions:* - Self-hosted runners: Read and write

```bash # Store token securely export GITHUB_PAT="ghp_xxxxxxxxxxxxxxxxxxxx" export GITHUB_ORG="your-org-name" ```

### Step 2: Deploy CloudFormation Stack

```bash # Set variables export AWS_REGION=us-east-1 export STACK_NAME=github-runner-ec2 export VPC_ID=vpc-xxxxxxxx export SUBNET_IDS="subnet-xxxxxxxx,subnet-yyyyyyyy"

# Deploy stack aws cloudformation create-stack \ --stack-name $STACK_NAME \ --template-body file://github-runner-ec2-asg.yaml \ --parameters \ ParameterKey=VpcId,ParameterValue=$VPC_ID \ ParameterKey=PrivateSubnetIds,ParameterValue=\"$SUBNET_IDS\" \ ParameterKey=GitHubOrganization,ParameterValue=$GITHUB_ORG \ ParameterKey=GitHubPAT,ParameterValue=$GITHUB_PAT \ ParameterKey=InstanceType,ParameterValue=t3.medium \ ParameterKey=MinSize,ParameterValue=2 \ ParameterKey=MaxSize,ParameterValue=10 \ ParameterKey=DesiredCapacity,ParameterValue=2 \ ParameterKey=RunnerLabels,ParameterValue="self-hosted,linux,x64,ec2,aws,codebuild" \ ParameterKey=CodeBuildProjectNames,ParameterValue="" \ --capabilities CAPABILITY_NAMED_IAM \ --region $AWS_REGION

# Wait for completion (5-10 minutes) aws cloudformation wait stack-create-complete \ --stack-name $STACK_NAME \ --region $AWS_REGION

# Get stack outputs aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --query 'Stacks[0].Outputs' \ --region $AWS_REGION ```

### Step 3: Verify Runners

```bash # Check Auto Scaling Group ASG_NAME=$(aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --query 'Stacks[0].Outputs[?OutputKey==`AutoScalingGroupName`].OutputValue' \ --output text)

aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --region $AWS_REGION

# List running instances aws ec2 describe-instances \ --filters "Name=tag:aws:autoscaling:groupName,Values=$ASG_NAME" \ --query 'Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]' \ --output table

# Check CloudWatch logs aws logs tail /github-runner/instances --follow ```

### Step 4: Verify in GitHub

Navigate to: `https://github.com/organizations/YOUR_ORG/settings/actions/r...`

You should see your EC2 runners listed as "Idle" with labels: `self-hosted, linux, x64, ec2, aws, codebuild`

## Using One Runner for Multiple Repos & Pipelines

### Organization-Level Runners (Recommended)

EC2 runners registered at the organization level can serve all repositories automatically.

*Benefits:* - Centralized management - Cost-efficient resource sharing - Simplified scaling - Single point of monitoring

*Configuration in CloudFormation:* The template already configures organization-level runners via the UserData script: ```bash ./config.sh --url "https://github.com/${GitHubOrganization}" ... ```

### Multi-Repository Workflow Examples### Advanced: Runner Groups for Access Control### Label-Based Runner Selection Strategy

*Create different runner pools with specific labels:*

```bash # Production runners RunnerLabels: "self-hosted,linux,ec2,production,high-performance"

# Development runners RunnerLabels: "self-hosted,linux,ec2,development,general"

# Team-specific runners RunnerLabels: "self-hosted,linux,ec2,team-platform,specialized" ```

*Use in workflows:*

```yaml jobs: prod-deploy: runs-on: [self-hosted, linux, ec2, production]

  dev-test:
    runs-on: [self-hosted, linux, ec2, development]
  
  platform-build:
    runs-on: [self-hosted, linux, ec2, team-platform]

```
## Monitoring and Maintenance
### Monitor Runner Health

```bash # Check Auto Scaling Group health aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[].[InstanceId,HealthStatus,LifecycleState]]'

# View instance system logs INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].Instances[0].InstanceId' \ --output text)

aws ec2 get-console-output --instance-id $INSTANCE_ID

# Check CloudWatch logs aws logs get-log-events \ --log-group-name /github-runner/instances \ --log-stream-name $INSTANCE_ID/runner \ --limit 50 ```

### Connect to Runner Instance (via SSM)

```bash # List instances aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].Instances[].[InstanceId,HealthStatus]' \ --output table

# Connect via Session Manager (no SSH key needed) aws ssm start-session --target $INSTANCE_ID

# Once connected, check runner status sudo systemctl status actions.runner. sudo journalctl -u actions.runner.* -f ```

### Troubleshooting Common Issues## Advanced Scaling Configuration

### Lambda-Based Dynamic Scaling

For more sophisticated scaling based on GitHub Actions queue depth:### Deploy Scaling Lambda

```bash # Create Lambda function zip function.zip github-queue-scaler.py

aws lambda create-function \ --function-name github-runner-scaler \ --runtime python3.11 \ --role arn:aws:iam::ACCOUNT_ID:role/lambda-execution-role \ --handler github-queue-scaler.lambda_handler \ --zip-file fileb://function.zip \ --timeout 30 \ --environment Variables="{ ASG_NAME=$ASG_NAME, GITHUB_ORG=$GITHUB_ORG, GITHUB_TOKEN=$GITHUB_PAT, MAX_RUNNERS=10, MIN_RUNNERS=2 }"

# Create CloudWatch Events rule to trigger every 2 minutes aws events put-rule \ --name github-runner-scaling \ --schedule-expression 'rate(2 minutes)'

aws events put-targets \ --rule github-runner-scaling \ --targets "Id"="1","Arn"="arn:aws:lambda:REGION:ACCOUNT:function:github-runner-scaler" ```

## Cost Optimization

### 1. Use Spot Instances

Add to Launch Template in CloudFormation:

```yaml LaunchTemplateData: InstanceMarketOptions: MarketType: spot SpotOptions: MaxPrice: "0.05" # Set max price SpotInstanceType: one-time ```

### 2. Scheduled Scaling

Scale down during off-hours:

```bash # Scale down at night (9 PM) aws autoscaling put-scheduled-action \ --auto-scaling-group-name $ASG_NAME \ --scheduled-action-name scale-down-night \ --recurrence "0 21 * * " \ --desired-capacity 1

# Scale up in morning (7 AM) aws autoscaling put-scheduled-action \ --auto-scaling-group-name $ASG_NAME \ --scheduled-action-name scale-up-morning \ --recurrence "0 7 * MON-FRI" \ --desired-capacity 3 ```

### 3. Instance Type Mix

Use multiple instance types for better availability and cost:

```yaml MixedInstancesPolicy: InstancesDistribution: OnDemandBaseCapacity: 1 OnDemandPercentageAboveBaseCapacity: 25 SpotAllocationStrategy: price-capacity-optimized LaunchTemplate: Overrides: - InstanceType: t3.medium - InstanceType: t3a.medium - InstanceType: t2.medium ```

## Security Best Practices

1. *No hardcoded credentials* - Using Secrets Manager for GitHub PAT 2. *IMDSv2 enforced* - Prevents SSRF attacks 3. *Minimal IAM permissions* - Scoped to specific CodeBuild projects 4. *Private subnets* - Runners not directly accessible from internet 5. *SSM for access* - No SSH keys needed 6. *Encrypted secrets* - Secrets Manager encryption at rest 7. *CloudWatch logging* - All runner activity logged

## References

- [GitHub Self-hosted Runners Documentation](https://docs.github.com/en/actions/hosting-your-own-runners/...) - [GitHub Runner Registration API](https://docs.github.com/en/rest/actions/self-hosted-runners) - [AWS Auto Scaling Documentation](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-i...) - [AWS CodeBuild API Reference](https://docs.aws.amazon.com/codebuild/latest/APIReference/We...) - [GitHub Actions Runner Releases](https://github.com/actions/runner/releases) - [AWS Systems Manager Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide...)

This solution provides a production-ready, cost-effective EC2-based runner infrastructure with automatic scaling, comprehensive monitoring, and multi-repository support for triggering CodeBuild pipelines.

Care to share the pet feeder's code and what the bugs are and how it went off the rails? Seems like a perfect scenario for us to see how much is prompting skill, how much is a setup, how much is just patience for the thing, and how much is hype/lies.

I haven't (yet) tried Claude but have good experiences with Codex CLI the last few weeks.

Previously I tried to use Aider and openAI about 6 or 7 months ago and it was terrible mess. I went back to pasting snippets in the browser chat window until a few weeks ago and thought agents were mostly hype (was wrong).

I keep a browser chat window open to talk about the project at a higher level. I'll post command line output like `ls` and `cat` to the higher level chat and use Codex strictly for coding. I haven't tried to one shot anything. I just give it a smallish piece of work at a time and check as it goes in a separate terminal window. I make the commits and delete files (if needed) and anything administrative. I don't have any special agent instructions. I do give Codex good hints on where to look or how to handle things.

It's probably a bit slower than what some people are doing but it's still very fast and so far has worked well. I'm a bit cautious because of my previous experience with Aider which was like roller skating drunk while juggling open straight razors and which did nothing but make a huge mess (to be fair I didn't spend much time trying to tame it).

I'm not sold on Codex or openAI compared to other models and will likely try other agents later, but so far it's been good.

they don't have reasoning yet

numerous attempts are being made by edge researchers to fix it but their just throwing stuff at the wall

I expect gradual reasoning improvements... not agi to pop out of a box and surprise everyone

and... CEO means "liar"

claude response to a query to give options for GitHub runner ... it haas generated 3 more files to review test and make it work: # EC2-Based GitHub Actions Self-Hosted Runners - Complete Implementation

## Architecture Overview

## Prerequisites

- AWS CLI configured with appropriate credentials - GitHub Enterprise Cloud organization admin access - Existing CodeBuild project(s) - VPC with public/private subnets

## Solution Components

### 1. CloudFormation Template### 2. GitHub Workflow for CodeBuild Integration## Deployment Steps

### Step 1: Create GitHub Personal Access Token

```bash # Store token securely export GITHUB_PAT="ghp_xxxxxxxxxxxxxxxxxxxx" export GITHUB_ORG="your-org-name" ```

### Step 2: Deploy CloudFormation Stack

```bash # Set variables export AWS_REGION=us-east-1 export STACK_NAME=github-runner-ec2 export VPC_ID=vpc-xxxxxxxx export SUBNET_IDS="subnet-xxxxxxxx,subnet-yyyyyyyy"

# Wait for completion (5-10 minutes) aws cloudformation wait stack-create-complete \ --stack-name $STACK_NAME \ --region $AWS_REGION

# Get stack outputs aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --query 'Stacks[0].Outputs' \ --region $AWS_REGION ```

### Step 3: Verify Runners

aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --region $AWS_REGION

# Check CloudWatch logs aws logs tail /github-runner/instances --follow ```

### Step 4: Verify in GitHub

Navigate to: `https://github.com/organizations/YOUR_ORG/settings/actions/r...`

You should see your EC2 runners listed as "Idle" with labels: `self-hosted, linux, x64, ec2, aws, codebuild`

## Using One Runner for Multiple Repos & Pipelines

### Organization-Level Runners (Recommended)

EC2 runners registered at the organization level can serve all repositories automatically.

*Benefits:* - Centralized management - Cost-efficient resource sharing - Simplified scaling - Single point of monitoring

*Configuration in CloudFormation:* The template already configures organization-level runners via the UserData script: ```bash ./config.sh --url "https://github.com/${GitHubOrganization}" ... ```

### Multi-Repository Workflow Examples### Advanced: Runner Groups for Access Control### Label-Based Runner Selection Strategy

*Create different runner pools with specific labels:*

```bash # Production runners RunnerLabels: "self-hosted,linux,ec2,production,high-performance"

# Development runners RunnerLabels: "self-hosted,linux,ec2,development,general"

# Team-specific runners RunnerLabels: "self-hosted,linux,ec2,team-platform,specialized" ```

*Use in workflows:*

```yaml jobs: prod-deploy: runs-on: [self-hosted, linux, ec2, production]

  dev-test:
    runs-on: [self-hosted, linux, ec2, development]
  
  platform-build:
    runs-on: [self-hosted, linux, ec2, team-platform]

```
## Monitoring and Maintenance
### Monitor Runner Health

# View instance system logs INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].Instances[0].InstanceId' \ --output text)

aws ec2 get-console-output --instance-id $INSTANCE_ID

# Check CloudWatch logs aws logs get-log-events \ --log-group-name /github-runner/instances \ --log-stream-name $INSTANCE_ID/runner \ --limit 50 ```

### Connect to Runner Instance (via SSM)

```bash # List instances aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].Instances[].[InstanceId,HealthStatus]' \ --output table

# Connect via Session Manager (no SSH key needed) aws ssm start-session --target $INSTANCE_ID

# Once connected, check runner status sudo systemctl status actions.runner. sudo journalctl -u actions.runner.* -f ```

### Troubleshooting Common Issues## Advanced Scaling Configuration

### Lambda-Based Dynamic Scaling

For more sophisticated scaling based on GitHub Actions queue depth:### Deploy Scaling Lambda

```bash # Create Lambda function zip function.zip github-queue-scaler.py

# Create CloudWatch Events rule to trigger every 2 minutes aws events put-rule \ --name github-runner-scaling \ --schedule-expression 'rate(2 minutes)'

aws events put-targets \ --rule github-runner-scaling \ --targets "Id"="1","Arn"="arn:aws:lambda:REGION:ACCOUNT:function:github-runner-scaler" ```

## Cost Optimization

### 1. Use Spot Instances

Add to Launch Template in CloudFormation:

```yaml LaunchTemplateData: InstanceMarketOptions: MarketType: spot SpotOptions: MaxPrice: "0.05" # Set max price SpotInstanceType: one-time ```

### 2. Scheduled Scaling

Scale down during off-hours:

### 3. Instance Type Mix

Use multiple instance types for better availability and cost:

## Security Best Practices

## References

I've been increasingly removing myself from the typing part since August. For the last few months, I haven't written a single line of code, despite producing a lot more.

I'm using Claude Code. I've been building software as a solo freelancer for the last 20+ years.

My latest workflow

- I work on "regular" web apps, C#/.NET on backend, React on web.

- I'm using 3-8 sessions in parallel, depending on the tasks and the mental bandwidth I have, all visible on external display.

- I've markdown rule files & documentation, 30k lines in total. Some of them describes how I want the agent to work (rule files), some of them describes the features/systems of the app.

- Depending on what I'm working on, I load relevant rule files selectively into the context via commands. I have a /fullstack command that loads @backend.md, @frontend.md and a few more. I have similar /frontend, /backend, /test commands with a few variants. These are the load bearing columns of my workflow. Agents takes a lot more time and produces more slop without these. Each one is written by agents also, with my guidance. They evolve based on what we encounter.

- Every feature in the app, and every system, has a markdown document that's created by the implementing agent, describing how it works, what it does, where it's used, why it's created, main entry points, main logic, gotchas specific to this feature/system etc. After every session, I have /write-system, /write-feature commands that I use to make the agent create/update those, with specific guidance on verbosity, complexity, length.

- Each session I select a specific task for a single system. I reference the relevant rule files and feature/system doc, and describe what I want it to achieve and start plan mode. If there are existing similar features, I ask the agent to explore and build something similar.

- Each task is specifically tuned to be planned/worked in a single session. This is the most crucial role of mine.

- For work that would span multiple sessions, I use a single session to create the initial plan, then plan each phase in depth in separate sessions.

- After it creates the plan, I examine, do a bit of back and forth, then approve.

- I watch it while it builds. Usually I have 1-2 main tasks and a few subtasks going in parallel. I pay close attention to main tasks and intervene when required. Subtasks rarely requires intervention due to their scope.

- After the building part is done, I go through the code via editor, test manually via UI, while the agent creates tests for the thing we built, again with specific guidance on what needs to be tested and how. Since the plan is pre-approved by me, this step usually goes without a hitch.

- Then I make the agent create/update the relevant documents.

- Last week I built another system to enhance that flow. I created a /devlog command. With the assist of some CLI tools and cladude log parsing, it creates a devlog file with some metadata (tokens, length, files updated, docs updated etc) and agent fills it with a title, summary of work, key decisions, lessons learned. First prompt is also copied there. These also get added to the relevant feature/system document automatically as changelog entries. So, for every session, I've a clear document about what got done, how long it took, what was the gotchas, what went right, what went wrong etc. This proved to be invaluable even with a week worth of develops, and allows me to further refine my workflows.

This looks convoluted at a first glance, but it's evolved over the months and works great. The code quality is almost the same with what I would have written by myself. All because of existing code to use as examples, and the rule files guiding the agents. I was already a fast builder before, but with agents it's a whole new level.

And this flow really unlocked with Opus 4.5. Sonnet 3.5/4/4.5 was also working OK, but required a lot more handholding and steering and correction. Parallel sessions wasn't really possible without producing slop. Opus 4.5 is significantly better.

More technical/close-to-hardware work will most likely require a different set of guidance & flow to create non-slop code. I don't have any experience there.

You need to invest in improving the workflow. The capacity is there in the models. The results all depends on how you use them.

Let me give a concrete example. I had a tool I built ten years ago on Rails 5.2. It's decent, mildly complex for a 1-man project, and I wanted to refresh it. Current Rails is 8. I've done upgrades before and it's...rough going more than one version up. It's _such_ a pain to get it right.

I pointed Claude Code at it, and a few hours later, it had done all of the hard work.

I babysat it, but I was doing other things while it worked. I didn't verify all the code changes (although I did skim the resultant PR, especially for security concerns) but it worked. It rewrote my extensive hand-rolled Coffeescript into modern JavaScript, which was also nice; it did it perfectly. The tests passed, and it even uncovered some issues that I had it fix afterwards. (Places where my security settings weren't as good as they should have been, or edge cases I hadn't thought of ten years ago.)

Now could I have done this? Yes, of course. I've done it before with other projects. But it *SUCKS* to do manually. Some folks suggest that you should only use these tools for tasks you COULD do, but would be annoyed to do. I kind of like that metric, but I bet my bar for annoyance will go down over time.

My experience with these systems is that they aren't significantly faster, ultimately, but I hate the sucky parts of my job VASTLY less. And there are a lot of sucky parts to even the code-creation side of programming. I *love* my career and have been doing it for 36 years, but like anything that you're very experienced in, you know the parts that suck.

Like some others, it helps that my most recent role was Staff Software Engineer, and so I was delegating and looking over the results of other folks work more than hand-rolling code. So the 'suggest and review' pattern is one that I'm very comfortable with, along with clearly separate small-scale plan and execute steps.

Ultimately I find these tools reduce cognitive load, which makes me happier when I'm building systems, so I don't care as much if I'm strictly faster. If at the end of the day I made progress and am not exhausted, that's a win. And the LLM coding tools deliver that for me, at least.

One of the things I've also had to come to terms with _in large companies_ is that the code is __never__ high quality. If you drill into almost any part of a huge codebase, you're going to start questioning your sanity (obligatory 'Programming Sucks' reference). Whether it's a single complex 750 line C++ function at the heart of a billion dollar payment system, or 2,000 lines in a single authentication function in a major CRM tool, or a microservice with complex deployment rules that just exists to unwrap a JWT, or 13 not-quite-identical date time picker libraries in one codebase, the code in any major system is not universally high quality. But it works. And there are always *very good reasons* why it was built that way. Those are the forces that were on the development team when it was built, and you don't usually know them, and you mustn't be a jerk about it. Many folks new to a team don't get that, and create a lot of friction, only to learn Chesterton's Fence all over again.

Coming to terms with this over the course of my career has also made coming to terms with the output of LLMs being functional, but not high quality, easier. I'm sure some folks will call this 'accepting mediocrity' and that's okay. I'd rather ship working code. (_And to be clear, this is excepting security vulnerabilities and things that will lose data. You always review for those kinds of errors, but even for those, reviews are made somewhat easier with LLMs._)

N.b. I pay for Claude Code, but I regularly test local coding models on my ML server in my homelab. The local models and tooling is getting surprisingly good...but not there yet.

I’ve had a major conversion on this topic within the last month.

I’m not exactly a typical SWE at the moment. The role I’m in is a lot of meeting with customers, understand their issues, and whip up demos to show how they might apply my company’s products to their problem.

So I’m not writing production code, but I am writing code that I want to to be maintainable and changeable so I can stash a demo for a year and then spin it up quickly when someone wants to see if or update/adapt it as products/problems change. Most of my career has been spent writing aircraft SW so I am heavily biased toward code quality and assurance. The demos I am building are not trivial or common in the training data. They’re highly domain specific and pretty niche, performance is very important, and usually span low level systems code all the way up to a decent looking gui. As a made up example, it wouldn’t be unusual for me to have a project to write a medical imaging pipeline from scratch that employs modern techniques from recent papers, etc.

Up until very recently, I only thought coding agents were useful for basic crud apps, etc. I said the same things a lot of people on this thread are saying, eg. people on twitter are all hype, their experience doesn’t match mine, they must be working on easy problems or be really bad at writing code

I recently decided to give into the hype and really try to use the tooling and… it’s kind of blown my mind.

Cursor + opus 4.5 high are my main tools and their ability to one shot major changes across many files and hundreds of lines of code, encompassing low level systems stuff, GOU accelerated stuff, networking, etc.

It’s seriously altering my perception of what software engineering is and will be and frankly I’m still kind of recoiling from it.

Don’t get me wrong, I don’t believe it fundamentally eliminates the need for SWEs, it still takes a lot of work on my part to come up with a spec (though I do have it help me with that part), correct things that I don’t like in its planning or catch it doing the wrong thing in real time in and re direct it. And it will make strange choices that I need to correct on the back end sometimes. But it has legitimately allowed me to build 10x faster than I probably could on my own.

Maybe the most important thing about it is what it enables you to build that would have been not worth the trouble before, Stuff like wrapping tools in really nice flexible TUIs, creating visualizations/dashboards/benchmark, slightly altering how an application works to cover a use case you hadn’t thought of before, wrapping an interface so it’s easy to swap libs/APIs later, etc.

If you are still skeptical, I would highly encourage you to immerse yourself in the SOTS tools right now and just give in to the hype for a bit, because I do think we’re rapidly going to reach a point here where if you aren’t using these tools you won’t be employable.

I’m honestly kind of amazed that more people aren’t seeing the value, because my experience has been almost the opposite of what you’re describing.

I agree with a lot of your instincts. Shipping unreviewed code is wrong. “Validate behavior not architecture” as a blanket rule is reckless. Tests passing is not the same thing as having a system you can reason about six months later. On that we’re aligned.

Where I diverge is the conclusion that agentic coding doesn’t produce net-positive results. For me it very clearly does, but perhaps it's very situation or condition dependent?

For me, I don’t treat the agent as a junior engineer I can hand work to and walk away from. I treat it more like an extremely fast, extremely literal staff member who will happily do exactly what you asked, including the wrong thing, unless you actively steer it. I sit there and watch it work (usually have 2-3 agents working at the same time, ideally on different codebases but sometimes they overlap). I interrupt it. I redirect it. I tell it when it is about to do something dumb. I almost never write code anymore, but I am constantly making architectural calls.

Second, tooling and context quality matter enormously. I’m using Claude Code. The MCP tools I have installed make a huge different: laravel-boost, context7, and figma (which in particular feels borderline magical at converting designs into code!).

I often have to tell the agent to visit GitHub READMEs and official docs instead of letting it hallucinate “best practices”, the agent will oftentimes guess and get stack, so if it's doing that, you’ve already lost.

Third, I wonder if perhaps starting from scratch is actually harder than migrating something real. Right now I’m migrating a backend from Java to Laravel and rebuilding native apps into KMP and Compose Multiplatform. So the domain and data is real and I can validate against a previous (if buggy) implimentation). In that environment, the agent is phenomenal. It understands patterns, ports logic faithfully, flags inconsistencies, and does a frankly ridiculous amount of correct work per hour.

Does it make mistakes? Of course. But they’re few and far between, and they’re usually obvious at the architectural or semantic level, not subtle landmines buried in the code. When something is wrong, it’s wrong in a way that’s easy to spot if you’re paying attention.

That’s the part I think gets missed. If you ask the agent to design, implement, review, and validate itself, then yes, you’re going to get spaghetti with a test suite that lies to you. If instead you keep architecture and taste firmly in human hands and use the agent as an execution engine, the leverage is enormous.

My strong suspicion is that a lot of the negative experiences come from a mismatch between expectations and operating model. If you expect the agent to be autonomous, it will disappoint you. If you expect it to be an amplifier for someone who already knows what “good” looks like, it’s transformative.

So while I guess plenty of hype exists, for me at least, they hype is justified. I’m shipping way (WAY!) more, with better consistency, and with less cognitive exhaustion than ever before in my 20+ years of doing dev work.

[dead]

1. give it toy assignment which is a simplified subcomponent of your actual task

2. wait

3. post on LinkedIn about how amazing AI now is

4. throw away the slop and write proper code

5. go home, to repeat this again tomorrow

lol no. it is all fomo clown marketing. they make outlandish claims and all fall short of producing anything more than noise.