I think a lot of the problem with the current discourse is how black-and-white it is. Either you're a luddite or "ai pilled".
In most cases, LLMs can get you 80-95% of the way, sometimes less, sometimes more. And heck, sometimes, it just gets you somewhere wrong.
But it seems everyone is arguing about whether LLMs can be perfect software engineers in isolation running in a closet, and using that to say that LLMs do not have a massive potential in other scenarios.
Sometimes, I like to imagine how much more productive most organizations could be from the things that the internet gave us, even to this day. Most companies never really do even a fraction of what is possible. That helps to ground my view of LLMs as well.
The fault dear Brutus isn't in our language models, but in ourselves.
It's funny, but the more I know about the true Luddites, the more I see their point of view.
" the original Luddites were primarily protesting against machinery used to "fraudulently and deceitfully" manufacture inferior goods, bypass labor standards, and strip skilled artisans of their livelihoods."
Goods are usually (although not always) inferior when made by a machine. A hand-crafted solid wood table is still superior to something from Ikea.
Of course hand made tables are expensive. They service a sliver of the market. Ikea serves the rest of us who'd prefer not to eat off the floor.
Fundamentally, Luddites didn't like being replaced by a machine. They were skilled workers, who used to have very desirable skills. Most people didn't need their standard of quality (but customers had no choice.)
Their name is well known today because we never stopped replacing people with machines. Every industry as been "optimized" over and over again since the Luddite times.
AI is the first threat to the Artisans of today (ie programmers). We are just the most recent in a long history of Luddites.
In every change of this nature, some move on embracing the change, others do not. Some will find other jobs, possibly new jobs, others won't. Carriage drivers became Chauffeurs, some grooms became mechanics.
So sure, I'm a Luddite - I don't want to see my skills become cheap - but I'm also pragmatic. The change is here. I'd rather adapt than die.
>Of course hand made tables are expensive. They service a sliver of the market. Ikea serves the rest of us who'd prefer not to eat off the floor.
And yet my working class grandparents didn't eat off the floor, they had great quality tables.
Mine is made of disguised cardboard.
This is a big part of the problem, there is zero trust that any potential improvement in cost or access will reach consumers. Companies don't even bother telling us it will.
We will just be slowly moved into accepting a degradation as the new normal.
And yet, clothes would have remained very expensive if we kept doing fabric by hand. Even destitute people in the poorest countries can have clothes these days, the meaning of “who wears the pants in this house” has lost its original (一条裤子) scarcity meaning.
>the meaning of “who wears the pants in this house”
That's pants as opposed to skirts. It's a gender implication, not a scarcity one.
I think you are misrepresenting the luddites. They were not against technological progress. Many of them were the ones who invented the machines. What they were fighting for were labour rights and distribution of power. They were fighting against enshitification by some guy who stole their collective inventions and by force kicked them out.
They were not against more clothes for everyone - quite the opposite. They were against fast fashion bad quality clothes made in horrible conditions by people (or children) who had no other choice.
I think the case of Luddites shows just how strong anti-worker and anti-union propaganda is - making it a mock word was very effective at preventing uneducated people from understanding what they fought for.
Doesn't that just say "a pair of pants"? Or literally one line pants.
I completely agree with your sentiment of “black or white.” I believe it comes from social media with primarily “radical” perspectives being the ones in the spotlight. Just not an environment that promotes nuance or friendly discussion
>I think a lot of the problem with the current discourse is how black-and-white it is.
There is too much money involved for any rational debate.
Only on the pro-AI side. The "it is bad" side is diverse on the reasons why, but being overwhelmed by bad content isn't a monetary concern.
Some "ai is bad" arguments come from the threat of losing one's livelihood, which is a monetary concern.
Which is why I phrased it the way I did.
Black and white thinking is not *limited to* monetary concerns.
> There is too much money involved for any rational debate.
For the Sam Altmans of this world, sure, but how much money is the average AI booster commenting on HN actually standing to make?
If you are just invested in an index fund, a lot. If you are an HN commenter who is more likely invested heavily in tech stocks, much more than a lot.
The other side is the stability of your job or job prospects, and we are adversely affected by that instead.
> In most cases, LLMs can get you 80-95% of the way, sometimes less, sometimes more.
That's my experience too, but it's 60-95% solutions in my case[1], with about 120-140% of lines of code required. I wish there was a harness that would let me mask code it should/n't change, because prompt-based refactors fail from the same over-eagerness.
1. I try faster, smaller models first.
We had the same issue until we created a review skill that we run after a LLM is done implementing a feature. We give it a list of things to check that is based on the problems we have observed previously, like writing too verbose code, and ask it to report on issues and suggest improvements. The developer can then give feedback and let the LLM fix the issues, or just address them manually. It’s still early but I’ve been much happier now with the results. It makes it much easier as well for humans to review since there’s a report about what the change is about, why, things to keep an eye on etc. This is something you can do with any harness you may be using and there’s nothing to buy, just a suggestion from someone trying to make the best use of this insane technology.
I think that's geohot's point as well. They're advocating against being fully "ai pilled". Saying we should be using AI as a tool, not for being a luddite.
Yes, exactly, it's 'us' not the AI, which is great.
Why on earth would we ever remotely compare a 'tool' to 'a software engineer' ?
The 'great delusion' is not that 'AI can't code' - because obviously it can, and very well.
The problem is the 'anthropomorphism' and all this AGI nonsense.
If we called it 'Stochastic Mechanisms' and did not 'personalize' our prompts, refer to them as 'chat' or give them 'personalities' but remained in the domain of 'Stochastic Language CLI' ... then our metaphors would pbably not cloud our judgments.
Let the philosophers argue about AGI.
You are a tool. You're a human resource, from the perspective of the organization. That pushes buttons on bunch of other tools. That's why you compare it.
Edit: I don't mean tool as a perjoritive.
I don't know why you are getting downvoted. Perhaps because people don't like the sentiment. But its true, people are hired as tools to write programs.
Both people and AI make mistakes. Perhaps the AI makes more, a lot more, but its so fast, and works around the clock, and has no ego, there is a chance that the benefits outweigh the costs.
The saner philosophers don’t need to argue about AGI because we’re absolutely nowhere near it.
[flagged]
So currently there are people who are buying grey market peptides[1], marked "not for human consumption" and injecting themselves with them based on dubious anecdotes and vibes, to make their skin clearer, build muscle mass, and so on.
Are they are all suddenly turning into zombies? No. Do they have any real idea what that is going to do to their body a few years down the line? Also no. Could it be catastrophic? Maybe!
I think about this when I think about how violently much of the industry has pivoted into AI being the primary generator of code in the last 6ish months. AI is the peptide, your codebase[2] is the body. Literally no one knows how maintainable this approach is, because there simply hasn't been enough time to find out. It could be fine. It could be a complete mess, with your entire engineering team falling asleep at the wheel, lulled into thinking they understand what is being built when they don't, completely impotent to fix or maintain it once the LLM is no longer able to.
[2] Well, _their_ codebase. I've stopped doing it with my own personal codebases, unless I genuinely don't care about maintainability or longevity
I think smart developers will be building isolated modules, so if your AI generated module keeps failing, you can amputate it and make a fresh one.
With the level of ability that AI is at right now, I've found it useful personally to think of it something like a very good search over existing knowledge. Another step up in searchability in the lineage of reference books, stack overflow, GitHub etc.
Programmers are rewriting and reinventing the same techniques more often than any other vocation I can think of, and so we were primed for a really good search over prior art. The fact that AI can also adapt that prior art to your particular use case makes it even more powerful.
Much like how great success never came from cobbling together various bits of copy-pasted code from Stack Overflow though, current AI can't really build your whole project.
> Programmers are rewriting and reinventing the same techniques more often than any other vocation I can think of
And the answer to that is clearly a tool that makes rewriting/reinventing cheaper than actually packaging nice reusable libraries
And on other hand I really do not understand how basic project boilerplate templating wasn't already a fully solved issue. Surely it should have been doable...
I'm all for packaging nice reusable libraries, but someone has to actually do it. A lot of them just don't exist (yet).
"The fact that AI can also adapt that prior art to your particular use case makes it even more powerful."
Well that's what everyone is claiming anyway
Yes, I don't have anything important to say other than I 100% agree with this comment. AI in its current state is akin to Stack Overflow and Google on steroids, but from my experience, it doesn't do well building out full-scale applications other than perhaps some initial scaffolding.
If I were to use it against a legacy, rather poorly written codebase, where the code may be hard to understand without some in-depth analysis. I could certainly ask an AI agent to read the code (How does application X do Y, for example), but I wouldn't have it start hammering out features or have it do any type of refactoring. That would cause far too many commits and confusion amongst the development team, leading to even more slop than whatever we'd already be dealing with.
Just leaving this comment here so I can come back to your comment. I've been getting a bit discouraged by AI lately, but this sums up my experience with it well enough.
> Yes, I don't have anything important to say other than I 100% agree with this comment. AI in its current state is akin to Stack Overflow and Google on steroids, but from my experience, it doesn't do well building out full-scale applications other than perhaps some initial scaffolding.
We're currently using it to build out a full-scale application. It does as well as you care to coax into doing tbh. You have to invest heavily in harness engineering, and at least my experience has been that as you do that, the results improve.
>It does as well as you care to coax into doing tbh. You have to invest heavily in harness engineering, and at least my experience has been that as you do that, the results improve.
That is also my experience.
When starting a project I observe how the agent fails, I add new rules to the harness to prevent it from falling and repeat the process until I am happy with the output.
I'm unfamiliar with harness engineering. Is there any good documentation about the subject you could point me to?
These were some of the first major articles on it. It's becoming a popular topic, so there's more content on it all the time.
I can't point you to a good complete documentation, because the field is changing very fast as people make new discoveries.
I learned by reading articles, success stories failure stories and mostly by doing, trying stuff, see how it works and adjusting it and burning a lot of tokens along the way.
What I would do in your shoes, I would ask an AI chat to find new articles on the matter (including on HN), explain how Codex, Claude, Pi are managing agents.
My compressed view is: you need to have a great specification both business and architecture wise that doesn't leave anything important for the model to guess because chances are it will make the wrong choices. That comprehensive spec should not be in one huge chunk. Have your plan divided in phases that each fit in a context window and have the spec for each phase.
Use TDD, strive for 100% coverage.
Force the model to behave: if it doesn't do what is supposed to, give it feedback and ask it to retry and don't allow it to progress to the next stage unless everything is perfect.
I also like to write comprehensive integration tests before building anything. The agents are not allowed to touch or read the integration tests, only run them and they will get feedback where the tests fail. I like to build the integration tests in a different language than the software I am building, to make sure there isn't something platform specific that the tests rely on. I use C#, Go, Rust and Zig for development and Python for the integration tests.
For now, to get good results, I can't just copy and paste the setup from a project to another, I have to work a lot to tailor the process for each new codebase.
And that's why I am working on an agent harness to try to force the agents to do the right things in most common development scenarios without wasting much tokens. By common development scenarios I mean that is a large goal, right now I am working towards backend web development and microservices.
Sounds like bag pipes to me LOL
In my experience, you’ll eventually hit a context window issue and it will just start spouting gibberish/doing wrong things, and nothing will significantly improve it. But hey, maybe it’s improved.
Well, auto-compaction is a thing in Claude Code now. Plus we have /goal command and some automated review stuff, so you can kinda just get it to loop until the automated reviews are satisfied and CI is passing. Does most of the heavy lifting.
One under-discussed phenomenon here, I think:
The hardest thing in software engineering is solving the right problem. The ability to identify the right problem to solve, is IMO, what distinguishes the top senior engineers. And we could have endless discussions about what constitutes the right problem, but for the sake of this discussion, let's reduce it to: the problem whose resolution adds the most value to the product for the amount of complexity and afferent costs that it incurs.
Once upon a time, long ago, I worked on a Web product whose original junior designer had figured it would be neat to be able to manage the backend with LDAP tools. So the database schema and structure that the product used mimicked that of OpenLDAP, with compound CN keys, and the entire codebase had to deal with that structure whenever reading from or writing to the DB. LDAP compatibility was not the right problem to solve when designing the DB schema.
But software that solves the right problems can be hard to identify because, quite often, how it does things seems so obvious that it's not readily apparent what other designs might have been chosen.
Now, the thing that usually keeps the blast radius of wrong-problem designs limited over time, is the very friction that they introduce. Development slows down, including the development of more wrong-problem designs. It's a self-limiting phenomenon.
And that's one major thing which worries me about LLM coding agents:
They paper over this friction. They don't repair it; they just make it so its cost is deferred.
So you gradually end up with codebases that grow unboundedly complex for the value they provide, with no controlling mechanisms.
You end up with juniors who never face the feedback loop from which they'd develop the engineering instincts and the taste for what makes a problem the right problem to solve in a given design.
At scale, as a field, you might end up forgetting there ever was such a thing as solving the right problem.
And I don't know what to do about that. Plan for an early retirement, maybe.
I'm in the "haven't written any code in a while" boat ATM. I'd love to see examples of issues that are so big that they warrant reverting to manual coding.
My main issue has been the inconsistent quality across between model releases and the tendency to insert older APIs or documentation, especially with command line tools.
I can understand if the model struggles with a million line monolithic codebase with a decade of cruft but can't think of why it'd be too much of a pain with new codebases.
When every prompt produces a thousand line PR, you’re not very far from another million line monolith.
I’m a little more hopeful than the author though. I feel like it’s possible to manage the process so that does not happen.
It's not difficult to avoid the 1000 lines per PR thing: Depending on what kind of thing I am adding, the plan might also receive as instructions to value making as small a code change as possible. It still requires judgement, as on something big, the smallest possible code base is not necessarily the most readable, but this is the kind of thing one can decide with some experience and little work.
I've also managed to use LLMs to cut a lot of manual duplication in code where we typically didn't do enough investment: "Claude, evaluate code duplication in the functional test suite" will have no problem finding things like insufficient helpers, or tests that are testing simpler things as prerequisites, so they can rely on each other. So I am not seeing my codebases growing all that much. There's some risks of functional changes that before would be rejected due to cost which now are not, but I am not all that sure of how much that is controllable without being relatively antagonistic with management.
Monolith is an ordinary form of software btw - big ball of mud is the problem.
> manage the process so that does not happen
This is the gold, right here.
It doesn't engineer. It writes code. Enthusiastically. Usually without thinking about the bigger picture, the design, the architecture, the trade-offs, etc.
It's up to us to manage that process.
It's why senior engineers are finding LLMs a really useful tool - because we've learned to think about all that other stuff before opening the text editor. Writing the actual code was always the easy (and least valuable) bit.
What type of projects you work on, in particular how rich it is in novelty, non-googlable data points and non-trivial project-specific deviations from industry standards?
Even if agents failed at that I'd wager that's a very small percentage of software projects anyway.
> I'm in the "haven't written any code in a while" boat ATM
How long do you think it will be before you can't write any code because you're out of practice?
One of the dangers of engineering management is that it can turn you into a person that can no longer do the thing.
Does that even matter?
That's fair. In all honesty I'm already feeling challenged but given how much time I save I can set aside some time to keep myself sharp. I can learn more languages. Additionally, as pointed out by others, I'm trading coding effort for design and and strategy, which generally control business outcomes a lot more.
Having said that, I won't use AI for production system if I don't understand the programming constructs in enough detail.
> I'd love to see examples of issues that are so big that they warrant reverting to manual coding
Ah I see your org hasnt yet had an outage caused by a bad LLM code push.
This shouldn't actually change virtually anything. We had this happen recently, and were able to rollback within minutes. Devs hand-coding stuff breaks things too. If you already have good observability, fast rollback processes, and feature flag new changes plus do % based rollouts to limit the blast-radius, then it's more or less the same.
sounds like bad deployment practices - canaries, guardrails, fast rollbacks, ring based promotions, cell based architecture, blah blah etc... humans write bad code too, there should be systems in place to protect it from releasing
I think people spend way too much time trying to say that LLMs are bad / shouldn't be used / etc because the LLM can't get it right the first time and/or makes mistakes. I think this is because we all hope that software/computers work like this in an ideal world, and LLMs are software.
This is the wrong mental model.
The way to think about an LLM is like a human: prone to following bad examples if it sees them, needs guardrails to catch mistakes, needs code review. It also needs access to what "correct" looks like: architectural design documents, skills that explain each type of change, etc. It needs prompting/skills telling it to follow a safe workflow, telling it to consider how a safe rollout would work, what a safe rollback would look like, what the performance implications are - just like a human.
The nice thing is that you now have a very knowledgable assistant that can help write additional guardrails that would have always ended at the bottom of your backlog. Perhaps it used to take many hours to research and understand how to write a custom linter to catch a specific coding pattern. Today, ask Claude to do it and an hour later you'll have a custom linter rule for your language of choice, guaranteeing the same mistake can't happen again because CI will block it.
"Ah I see your org hasnt yet had an outage caused by a bad LLM code push"
"We went back to shovelling by hand because someone ran over the pole with the front-loader, even though he had no experience driving it."
This is definitely user error; obviously it's a hard tool to wrangle but it's entirely possible to use it safely.
Seems like AI isn't really solving complex bugs and issues (that it itself created) in my MAUI project over the last 18 months.
Seems like it is completely hopeless at doing anything netcode consistency and performance related in game dev. Seems like unique game mechanics it doesn't do well either.
Seems like asking it specific UI stylistic changes is basically like throwing darts at a board and hoping it sticks.
Even with relatively simple things, frontier models get me about 90% of the way - and this is without evaluating how good that 90% actually is. It's the last 10% that the model fucking sucks at. And it's often the simplest things. It takes a lot of tokens and a lot of time to cajole the AI to get that last 10% working. And even then, I've just given up and had to go read the slop and fix the bug myself because it become so frustrating.
Part of my job is working on trying to make these models productive for the large corporation I work for. It's a lot of throwing tomatoes at a wall and to a degree I see the issue he is talking about output seemingly having a certain ceiling.
At the same time in no part of his post is any code snippet or anything to latch on to of "the model performed poorly here when it should have done this" - this style of criticism seems to be a pattern of most of these "the LLMs will never work" style posts on blogs and twitter.
They obviously can perform better than autocomplete and in my own day to day development build out huge portions of a codebase that I would have expected a junior or midlevel engineer to perform at.
How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.
> How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.
The mistakes they make are pretty subtle. Coding with LLMs can be like that scene in Whiplash – <excellent drumming >, not quite my tempo, <excellent drumming >, downbeat on 18, <excellent drumming>, you’re rushing, <excellent drumming>, dragging, …
Like yeah it produces working code almost always and the code usually does what you asked. And yet it makes you want to throw a chair because it’s not quite right in frustrating ways and it doesn’t even have the taste to know how it’s wrong.
Yeah. It does this. Pretty consistently and replicably depending on the issue, in fact! Yet I can point exactly where it fails.
Why are we not showing the bad choices? On my computer I have hundreds of diffs stored by my agent code review tool that point to style/architecture failures (and in the end, the result of that iteration on the AI output)
I'm not quite sure how people are generating unsalvageable outputs. I'd never ship the result of a first AI pass, either. I review all the code and the architecture, within reason (eg: in Rust I don't preoccupy myself anymore with precisely scoping pub, or whatever, unless I'm making a library crate). I sent a "changes requested" prompt+json to my agent, and it interactively fixes everything (even style, even comments with manual patches with my in-review-tool editor)
This is an excellent point, and as a novice using LLMs for projects I could never previously dream of doing I find myself looking for the same, examples or citations of what exactly agents are writing incorrectly and how would the human do it better. I'm sure they're out there, maybe someone can refer some good content showing such examples.
I have no doubt the top nth percent of coders could write circles around Claude or Codex, but how much worse are they than your average schnook?
Reality: the top nth percent of coders are seeing absurd, dramatic gains in productivity using LLMs. See: antirez, Simon Willison, Steve Yegge.
The more experience you bring to the table, the more value you get from these tools.
Look, about 12 years ago articles about how if you're not pair programming you're doing it wrong were on HN's home page every day. Doing well prompted plan -> agent -> debug cycles is like pair programming with someone that knows every SDK and API intuitively and doesn't have to pick up their kids from daycare at 4pm.
antirez is famous for creating Redis, which took a dump in quality and everyone switched to a fork called Valkey.
Yeah, "everyone".
The name of your 35 day old account is appropriate. antirez has taken dumps that have accomplished more than you.
The problem is what they do to large existing systems: subtle misunderstandings mean subtle bugs are constantly being introduced, and very few shops have adequate systems in place to receive reports of subtle issues at the rates they occurred 10 years ago, let alone today. And don't even get me started on llm-assisted support that some might suggest as a solution.
When people write blog posts about how LLMs failed for some particular task, the responses from boosters invariably fall along the lines of "just use this other model/just tweak your prompt like so/you're just not skilled enough—you can't make fundamental arguments about AI by citing specific examples."
So we can't make arguments by citing specific examples, and also can't make arguments by not citing specific examples. Whelp, I guess that's the ball game.
(yes yes, I'm committing a group attribution error, but still)
AFL didn't find more vulnerabilities than LLMs. AFL and skilled practitioners found vulnerabilities. AFL triggers faults, many (most?) of which aren't exploitable, and humans (or, now, agents) have to triage and evaluate them. And they did so in a pre-AFL corpus of memory-unsafe software. The heyday of AFL was a decade ago. Every target is harder now.
My guess is the models just continue to get better and better
When I got into agentic coding a year or two ago I was sure it was only good at autocomplete. Something happened earlier this year where the models hit a new level of capability.
Everyone I know now just does agentic coding, and it’s really amazing. I think we should just try pushing this as far as we can possibly go, it really feels like the acceleration of the human race is upon us.
We're already hitting some logistical limits. Even if transformers don't have an inherent capability plateau, we only have so many GPUs and so much power to improve them, and we're finding it very difficult to expand that infrastructure. Something like 6 GW of new DCs have been announced over the past 2 years of which less than 1 GW has actually been turned on and started serving, and the deliverable dates for the rest just keep slipping. (Not to mention that the DCs are all talking as if the chips in them will last 6 years, which is turning out to be a stretch.)
Sounds like we just need a Dyson sphere.
Besides, I have been hearing "this is the limit" since the doomers of "this is just a markov chain and can't be useful".
Yet the limits keep being broken.
Most of your comment history is defending LLM psychosis
Most of your comment history is less than a month old
what if we're accelerating to a brick wall?
More like neo-feudalism by way of breadlines.
It's hard to care much about how software is written when superpowers are invading their neighbors, democracy itself is under attack, the food supply chain is failing, and the world is hotter than it's ever been.
software plays a big role in enabling 2 (arguably 3) out of those. Least we could do is not make it worse. But it seems most of us are still just children. They give us something shiny that makes pretty lights and we happily burn the world to keep it.
Acceleration of the human race is the biggest cope I’ve read all year.
>... I was sure it was only good at autocomplete. Something happened earlier this year where the models hit a new level of capability.
Yes, something happened, it got better at autocomplete. What else could be? The underlying model hasn't changed.
>acceleration of the human race
Please just stop with this bullshit. Nobody's curing cancer, climate change, inequality or whatever important real problem there is with LLMs. Nobody.
If this tech is good enough to make you more productive is just because you're not working in anything new or cutting edge or innovative. The only reason a LLM knows how to do your job is because that code has been literally written before enough times to appear in the training data. Try to use llms to write C++26, some HDL or in any niche stack and you'll get a nice reality check about LLMs.
I’m sure I would be just as useless as an LLM in the “niche stack” examples that you cited.
Why do you think that is actually a good argument against?
Most “business” problems have already been solved in some way and the times I had to write really novel code in my career have been very very few.
Also sure LLMs haven’t solved cancer or unequality in the few years they exist - but humans also failed here in the last couple thousand
Not reviewing outputs, which is my main issue, is one-way to subpar experience. No amount of "make it right" will fix that.
I hope that professionalism still matters as these new ways of doing things strikes me as unprofessional as f...
Yeah, the next macOS will be worse... time to place bet on prediction market
ai agents can program, in fact, in our current time with current models, i'd say they program better than most people in the industry (an industry where people were literally copying and pasting from stack overflow for years prior)
being able to program is not the only skill required to be a successful software engineer, so no ai agents cannot be software engineers
very important distinction - i personally like the radiologist example - looking at scans is a part of a radiologist job, AI can do it better than most of them, but looking at scans is a small part of the job, most of it communicating with doctors to help their patients
Data from six months of production from one SaaS codebase provides a more limited response. Maintainability doesn't depend on the level of AI usage. Maintainability depends on the discipline during diff reviews.
Good sessions: One topic per session; scope defined prior to the agent starting; all diffs read prior to committing. Poor sessions: Broad scope; undefined constraints; rubber-stamped results.
The quality of the codebase decays precisely at the rate you stop reading the results. This is not an issue of AI writing the code. This is an issue of unreviewed code. geohot's issue is entirely valid. This problem does exist. But this isn't dependent on the generation phase.
I agree that I can write better code than an agent.
But it can write working code much faster than I can.
And in a lot of cases, unfortunately, faster beats better.
> Unfortunately, faster beats better.
I think you have just written the epitaph for corporate software.
I agree it can write faster in a language I don’t know very well
At a granular level, it's almost guaranteed that you cannot write better code than an agent.
Agents now are writing extremely consistent, normalized canonical code, that usually compiles the first time.
Right out of the 'textbook'.
For what it's trying to do - it writes nearly perfect code.
The only thing you could nominally disagree with are some of the conventions and idioms.
It 'writes a perfect novel, in perfect prose'.
What it will not do however, is 'write the novel that's in your head'.
And that's the crux of it.
It's not even your job to 'write code' at this point, but rather to be the storyteller - and a very good editor who has enough taste and grasp of gammar to be able to know when it's going awry.
It will make mostly what you tell it too, the quality of the output is the quality of your guidance, but at the lowest levels it's generating extremely high quality syntactic prose.
I don't think LLMs inherently do anything perfectly. They can make sure it compiles and passes tests and they can be trained to do an enormous array of tasks, but the code it generates isn't perfect, it's selecting one of many possible outputs based off of some numbers it came up with after a few matrix multiplications and ReLU activations.
Those matrix multiplications aren't a divine perfect thing. They suffer from floating point precision issues and training data issues and there's still debate if adversarial examples are just an unsolveable property of our linear-algebra based neural network architecture.
Can they do things way faster than a human? No doubt. Can they do very complex tasks? Yes. Do they do things with perfection? Not by our human definition of perfect.
"Those matrix multiplications aren't a divine perfect thing. They suffer from floating point precision issues " - this is not the right intuition.
"Not by our human definition of perfect."?
'Human definition' has nothing to do with it.
Your job is to define what you want, to the extent you can do that, the AI does really well at a certain scale, at the 'functional' scale, nearly perfectly.
Syntax is the least of my concerns.
Modelling a problem is what I'm concerned about. And I'm currently better than any AI agent at doing that, given enough time.
> the quality of the output is the quality of your guidance
I wish you just started with the copout.
>When people see an artifact, they make assumptions about the process that was used to create it. Without even thinking about it, they assume the creator had a basically human state of mind. This assumption is no longer true. Things can be broken in ways that weren’t previously possible, and old proxies of underlying quality like syntax and grammar are useless. AI produced artifacts are not produced by the same process as human ones, and this difference, while extremely subtle in statistics, makes itself obvious when you try to interact with and build on the artifact in human ways.
Once Humans just had oral language, and we could us words to pass ideas from one human mind to another. Then with writing ideas could pass to minds that weren't immediately close together in space or time.. and with this we made complext global spanning civilization. When words just become noise, that one has to be suspect of each one as to whither they'er coming from another human mind, or just a statistical process, can this civilization even survive?
But each time I suspected I could have done it better and faster manually
There is a class of tasks that can't be done faster manually, unless you're some sort of colour-smells-like-chicken-and-numbers-have-taste genius. And there is other class (my suspicion now is any non-standard task+framework) that are slower than using agents. So I can imagine you have excellent experience with some tasks like USB hacking and would do it faster than LLM. On the other hand for me, as a Java developer, hacking a USB is finally possible with LLM. Otherwise I'd need to stop-and-learn for some time, which I wouldn't, so either I'd by a more expensive hardware that fulfills my requirements, or put the USB reverse engineering project to my 100 acre todo list
It all depends on the code quality bar. If it's high, a lot of tasks will not be completed much faster. The main speed comes from trusting LLMs output. When you review each change and reprompt LLMs to make the code look like you want. Suddenly, things become much slower and reviews/reprompts are very mentally exhausting.
While I was reading this post, Anthropic sent me an email with the subject line "Your account has been suspended". What a coincidence :D
I wish these posts that talk about non-human mistakes that agents make would post some examples. They would be interesting to see.
Wonder if LLMs in autoreasearch loops would be able to complete tasks geohot has in mind in say 100x average token budget.
If the answer is yes, the argument doesn’t matter: you just run the loop and wait for llm analog of moore’s law to get costs down.
Loops make the code even worse. The more local the changes, the better LLMs at it.
When a blog like this goes completely black or white on a topic I get skeptical. Nothing in life is 0 or 1. So is AI. Has some good to it and some obvious issues. All not that big of a deal. Ppl try to position themselves on the edges bc that’s what’s polarizes and engages conversation…
I think geohot is somewhat of a clown but I think he is speaking reason here and I'm happy to see voices address this. Most seniors I work with agree.
For context: the author is George "geohot" Hotz, who has a long list of exploits, likely the best known of which is basically vibe coding (I mean that in the nicest possible way) comma.ai for autonomous cars on a shoestring budget long before actual AI vibe coding was a thing.
When digital cameras replaced traditional ones, we thought it would make photography more democratic: each of us would be Helmut Newton for 15 minutes. But it didn't give us the beautiful portraits and inspired lanscapes we expected, only millions of pictures of food.
How much will it take for AI agents to pass from distilling decades of collective wisdom to copying each other's worst mistakes?
>But it didn't give us the beautiful portraits and inspired lanscapes we expected, only millions of pictures of food.
Here's a sample of my work using digital cameras, not a food picture in sight.
The thing about having the ability to take effectively free photographs is that it really lets you experiment and learn the edges of what's possible.
I was inspired by Stanford's camera array, and wound up doing virtual focus synthetic aperture photography. I'm hoping to build a rig to do it on near real time, instead of the manual process I used to do on my train rides to and from work.
Sure, the removal of cost lead to a flood of the mundane, but it also means we can capture our lives in ways that even kings couldn't afford in the past. I have thousands of good photos, and even some video, of friends and family.
"Things can be broken in ways that weren’t previously possible" and also "Things can work in ways that weren’t previously possible". It all depends on what the use the tool for, if you're a carpenter you're going to do a bad job regardless if you have a fancy hammer or a basic one. If you're an expert, give them a basic hammer and they'll do the work, give them a fancy hammer and they'll do the same, perhaps a little faster (or not).
I don't think Geohot has a good idea about LeCun and Hutter's views on the limitations of LLMs. I think that on abstract, textual domains, LLMs perform superbly, and they would agree. I am not too well-informed about LeCun and Hutter's views either, but I think that:
LeCun thinks that LLMs are a bad fit for AI that understands the physical, dynamical systems that we inhabit, and that understanding this is necessary for AGI/ASI.
I don't know that Hutter is bearish on LLMs, but Hutter is interested in AI that can reason exceptionally well given infinite compute, and approximations of such a reasoning AI. I think he is open to the idea that LLMs can be such an approximation.
From the article:
> Without fully endorsing all their ideas, I’m now in the LeCun/Marcus camp on LLMs.
I'm pretty sure he means "Yann LeCun and Gary Marcus" not "Yann LeCun and Marcus Hutter".
A problem that's impossible to detect is indistinguishable from it working. Hence it works. Hence it's not a problem.
> and it’s taking longer and longer to realize that they can’t
For something to take "longer and longer" to realise, doesn't they imply that it's been realised at least once before or that there was an expected deadline for the realisation?
Okay, that's a nitpick.
I read it as "agents can't program, and with each new generation of agents it's taking longer and longer to realize that that specific iteration can't". Maybe taking the Principle of Charity too far, I dunno.
Nah, that's fair, I was just being a bit tongue-in-cheek.
I don't think you can go completely hands-off for quality products but you can relax and let the agent do as much as possible. It does enable things that probably wouldn't have happened otherwise.
If you are already comfortable with letting other devs work on features then it's easier, because it's similar (arguably you have more control with AI, because what you say goes regardless of hierarchy).
> It’s definitely a better Google for most searches
I can't agree with this. You tend to get one point of view, often without any actual resources and references so you have to go look it up yourself, on [insert search engine]. Plus, what does it say when we consider an AI the one stop for our data intakes.
I find that it's typically better than Google search has been for a while, but not better than it's ever been.
More so tells me just how bad Google search has become and just how bad content in general has become.
how do you measure if google’s engineering org is more productive than meta’s?
What about comparing 2 startups/small teams?
I think the discussion about methods (coding agents included) depends on answering those questions. Seems pointless to claim these agents [dont] make you more productive.
Although, at a first glance, the productivity increase does seem like nothing I’ve seen before. Even more than the transition of making webapps in plain js -> jquery -> frameworks or going from something like Flask to using Rails.
Problem is this is not evidence based. I just feel prototyping has speed up 100x. So the number of iterations/attempts has gone up. Transforming specs into a test suite takes a fraction of the time. Dunno, feels weird not to be able to be overall more productive (do more with less time) if you have these new tools.
Smart guy but whoever eventually actually fixes X search will probably use AI coding assistance to do it.
This post hits the nail at a bit of an angle.
The AI agents are great, and any expert can prompt them correctly to get good code. LLMs occasionally pick wrong patterns and start digging a hole, but this is why an expert is required. The code itself is just not worth writing when a detailed prompt can get you the same code typing 20x less text.
Where I agree with the post is:
The adoption of AI agents into software engineering is a problem. Solo projects are great, but our teams have not adjusted to the speed-of-change to a mental model of a project. So I see orgs making a choice to either: slow down or forgo the shared mental model.
Anybody choosing to forgo the mental model is building crooked legacy slop at scale.
You can and should save the mental model to an AGENTS.md, but devs need it in their brain to prevent the digging a hole behavior.
To be fair the digging a hole behavior is something humans do just as well. But in teams you'd communicate enough to catch it - hopefully^1. It's the combination of higher speeds and teams that's creating a bit of a disaster.
I'm not sure what a good solution is either. There is a case for solo devs running for 2-month sprints with much more freedom. Perhaps we'll have an "AI Agile manifesto" within a year.
[1] Though you should not underestimate the amount of poor code being created before LLMs. There are enough teams for whom LLMs are practically all upsides. Stay very far away from those.
‘If it hurts, stop’ vs ‘if it hurts, do more of it’. Organizations have a choice, some slow down to avoid, some speed up in hope to make issues… non-issues. If the go-fast orgs find workflows that actually truly speed things up without loss of quality, it’s like hitting the jackpot - you’ve found a way to run away from competition without them even realizing it’s possible (for a while anyway, until they notice they’re grossly outpaced).
Eh but statistical models are obviously useful, because statistically 99% of your codebase wont involve new idea invention. Tools that write all the boilerplate code used to have names and job titles.
I hate how both the for and against case for LLMs are just so bloody terrible at addressing these things.
This is a good take. The most effective combination of AI and skilled practitioner is using AI to amplify the abilities of the skilled practitioner. And in particular, max benefit comes from exploiting comparative advantage. AIs are really good at boilerplate -- in many cases better than humans because humans will optimize the process by doing copy/paste and often inject errors in the process -- whereas humans are better at abstract and critical reasoning. There's a very real and valuable use case for AI, but it's not replacing humans, it's taking the things that humans don't like doing (and that a computer can do well already) off the human's plate, so humans can focus more exclusively on the things that they do better than the AI. And at least with the current architecture of AI models, there will _always_ be higher-level reasoning that humans do better than the machine.
This. A ~staff software engineer designing big changes at one level above the raw implementation details using Opus 4.7 + superpowers today can genuinely ship multiple times more at the same quality level than pre-AI. The level of what a whole team could ship before.
You have to use something like superpowers, the key is that the humans need to make the important decisions.
You have to review the code - just like you had to review the code humans wrote. There will be iterations.
You have to give the LLM skills and patterns to follow, access to architectural documents, etc, just like humans needed to be onboarded at a company and do the same.
If you get all of these right with today's LLMs, you will never write code at all because it is so obviously not the best use of your time. If you feel that you are still better at writing the code manually, you have not done the above right, fix your workflow and try again.
If nothing else, Eternal Sloptember is a term that seems obvious once you have it. I can’t believe this is the first time I’m seeing it.
If you were on Usenet before '93 the words still haunt you
Joined Usenet in 1990 and for a few years it was great.
I always wonder whether HN suffers from periodic influxes of newbies who don't get it yet and rile up the regulars.
There was the Reddit Exodus of 2023 and there seems to be a fair few new accounts posting recently, which feels like an Attempted Agentic Takeover.
Its one of my favorite parts of online trivia.
I was very much unborn in '93, could you share the folk history for the record?
> The bottom performers won’t have that self check. They are the ones producing 10x output with the agents. What do you think is happening to the average output of that organization?
Nailed it!
At my last place this was encouraged (by non-technical leadership driving the AI adoption policies, as well as setting salaries) and seen as a huge win.
The "step change in number of created PR's" was celebrated (cult-style), and by one of the (co) CEO's praised as a paradigm shift of the same magnitude as the personal computer. Meanwhile, I was stuck finding insta-reject level bugs in pull requests from people one-shotting 6000 line PR's "finally solving" long-standing issues from the backlog.
Needless to say I left.
Why sloptember when it's may
It's a reference to Eternal September, the name applied to the cultural shift of the internet as the general public started gaining access to it en masse in the 90s.
Sloptember is clearly a reference to this - the similarity being that masses of AI generated content, from social media posts to open source contributions are replacing the human internet. In a way this is related to the "dead internet theory", an idea I previously found hard to believe, but these days could easily be true.
If the history of the internet interests you, both these are worth looking up.
> It is a golden era for buckets and buckets of slop, and a dark age for gems of quality.
I mean, this has been the trend for decades really, before LLMs were a thing. The incentive is skewed toward quantity rather than quality. The new tools just add more fuel to the fire.
Code quality is also really lacking in much of the industry. The truth is, these LLM models, as limited as they are, program at a level above that of the median junior programmer.
People misunderstand how AI is used in coding in normal work environments. New feature requirement comes - maybe you need a new service or some new classes. You need to do some research first.
You guide the AI with some prompts and give it some guidance on how to scenario-test it. It makes some classes, test methods. Maybe ~2000 lines and you do a quick verification, check if the overall idea looks okay. Ask it to fix a few design things and then merge it.
Its much easier than doing it yourself with all the boilerplate and understanding each esoteric language specific thing. Which library do I use for UDP communication in golang? The agent might have made a good assumption. These kind of things is where it speeds it up.
If you don’t know what library to use in your specific language, do you think you know enough to have an LLM generate most of it?
I just joined a new team and have been using copilot with opus models.
We have our core code in a weird dialect of C and rust. C I know well, but not rust. Our tests are in Python. The pipeline descriptions are in Yaml.
Outside of the core code there are so many arcana to learn. Writing syntactically and semantically correct yaml/Python test code would be a nightmare. The Agents have flaws, but they provide a huge leg up in improving the tests.
And they are great at providing a first pass review of the core code before bothering a human reviewer. Lastly we run some of our test failures through AI triage, which often enough finds the root cause or rules out simple failures.
This shows up in a higher checkin rate. I'm curious to see whether this will lead to quality end product since we have more support for the more manually written and reviewed core product code.
YES. This line of thought is exactly why people are still skeptical of LLM's.
LLM's are directionally right and if their answer "fits" then I take it at face value.
As an example: I asked the LLM "synonym for "provides" that also means "places" on you" and it gave me 5 answers and I immediately knew the right one was "confers". How? It just fits. Just like most things.
People are skeptical of LLMs because the experiences they’ve had with LLMs. You can’t blog your way out those experiences.
I’m skeptical because I’ve seen this exact situation and I’ve seen the result be something that anyone experienced wouldn’t do.
Sometimes I think folks having ‘experiences’ should play more poker.
The point is, it’s a game of chance and yet good players beat bad players in the long run. Your job in the new era of software engineering is to design the process so LLMs doing your code monkeying avoid the losses (including discarding bad changes) and take the wins. Win often enough and you’ll come out ahead.
that's a very dangerous analogy, because you would be considered the domain expert and you are just asking for synonyms for something you already know but may not remember off-hand.
now, what if you asked for the synonym for "provides" in a language that has gender differences (e.g. spanish/portuguese) as well as societal nuances (e.g. japanese) and it gives you "confers", how would you now know that's correct?
ah, so you say you tell it to take into consideration gender differences, as well as societal nuances. What are those, if you were not already familiar with the language?
Yes, just throw 2kloc over the wall for some feature. Your coworkers must love you.
My coworkers don’t care, they’re doing the exact same thing.
"It’s definitely a better Google for most searches"
This is dangerously incorrect. AI summaries of search results consistently return incorrect information and grossly oversimplified and thus misleading summaries, neither of which are detectable unless one either has prior domain knowledge or spends time drilling into search results to validate the AI output.
My experience with ChatGPT as a search engine - it is totally paranoid about checking and re-checking its answers by referencing them in multiple places (I usually read its thinking output). I have not seen an outright hallucination for at least a year. (It is of course a different situation with Google's "AI summary" which is wrong half of the time.)
Ironically I quit using ChatGPT a while back. I decided to run it through it's paces and asked it some rather detailed questions about a range of topics that I have significant domain knowledge on. Without exception the responses I got back where glibly superficial to the point the responses were almost totally devoid of meaningful information. The AI summary on Google search results is so bad it represents an assault on reason.
> And whenever you need a quick prototype and don’t care about polish, it is absurdly fast. But is it a software engineer? Not close to the bar at any company I have worked at.
This line which he wrote, will override any quality gaps, because the cost to produce that shitty software will be lower than the cost to produce good software.
I don't think LeCun is saying they won't be able to program. I think he says we won't hit AGI. Programming does not require AGI; it's a pretty specific skill!
--
I think this article is COPE, if I'm being quite honest. I thought of putting cute analogies, like the C programmers saying the Python and Javascript programmers are not "hardcore" enough... but the truth should be obvious to anyone using LLMs effectively.
--
Current AI is a much better programmer than 100% of people and when directed by someone in that top 10%, it's a force majeur.
This is my experience. Though ice been writing LLM harnesses, agents, tooling, etc for 5 years now and believe it requires several hundred hours of experience before understanding how to consistently outperform at scale.
preach it
There's a time and a place for assembly language programming. Of course, I knew someone who would say there's a time and a place for machine language programming (improved it by reprogramming a device by flipping the 17 switches on the front panel)
Wake me up.... when sloptember ends.
It won‘t..
It really feels like a mass psychosis. I'm not an AI sceptic insofar as I fully expect to get replaced by some future AI system. But what we have now isn't it.
To use a Geohot-inspired analogy, what we have now is like the Google self-driving car of 2010. It works most of the time, yet sometimes fails in unpredictable ways. So you need a safety driver behind the wheel to constantly watch what it's doing (the code review).
A real AI agent would not need a safety driver. We don't have that but many people are basically saying "fuck it, I'm just going to set this car off on its own and see what happens". And sure if you're prototyping it's not dangerous. But for production systems that is dangerous.
That is a fair analogy.
There is some very cool tech it just needs continued refinement, there is a path forwards even if it isn't always the clearest. This is happening but it is taking years and a lot of work to get done.
The more specific your work is, the more these LLM’s seem to struggle.
If your work was previously googling stack overflow, it can be incredibly useful at working through that. Which let’s face it, that’s what a lot of us do.
The horse is better than a car!
Good News! The horse and the car can coexist
Bad news! The horse population declined by 85% after the widespread introduction of the car.
Good News! I learned something today.
[deleted]
Bro claims to write good code.
He got fired <4 weeks from twitter.
AI is hyped but code isnt that bad.
spill the tea, fired or quit?
On the other hand we see success stories such as antirez using agents to work on Redis and Deepseek v4 flash inference.
> They are a highly sophisticated statistical model designed to mimic the distribution of programming
Are we really still doing this?
Author here. I have never said that phrase before this blog post and certainly understand the absurdity of it. I certainly don't mean that you need something biological or whatever consciousness might or might not be.
However there's still a distinction. Unless I'm responding to an LLM, you had a childhood. You learned about the world and space and agency before you ever learned how to program. And you didn't learn it from billions of examples, you learned from a few examples, some self directed experiments, some feedback from teachers, etc...
I'm saying that's what matters. The process matters. You didn't learn to mimic a distribution, you learned to program. Of course in the perfect mathematical limit it's the same, but in practice it's not.
For a lot (most) of what we do with programming, the process actually doesn't matter. I understand you are a real ass dude who is in this shit for the love of the game. I respect that. You are a true artisan and exist in a kind of rarified space. There will always be a place for people like you and in some senses you are correct - you are not replaceable by any AI as they currently function today.
However, 99.9999% of coding is not like that. Non-coders don't care about the code at all. They just care about outcomes. People don't care if it's "slop" if it works. Similar to bug prevalence, the optimal level of slop is not zero and will be decided by the market, not by coders.
Yes, until everyone gets it through their head.
I'm confused what else you think they are?
Its fundamentally how LLMs work.
"Magic"
Well, since the fundamental underlying structure is still the same, yes.
It's not exactly what it is; they now model an incredibly complex markov process, and harnesses that control how that thinking is done.
Is this any different than how a PM gets a programmer to work on a project? They think, then they deliver. If given more time, maybe they deliver something better. Maybe they consult some text and try to apply a design pattern.
The LLM in this use case is perfect because almost everything involved is text based, and the model is able to take in all the expressive that is language.
> Is this any different than how a PM gets a programmer to work on a project?
Yes, it's very different. You seem to be suggesting that the current frontier LLMs, when tied to their tools and harnesses, have emergent properties that are similar to human consciousness. If you truly believe that, I'm not sure how to have a productive discussion here.
It's not just that, but the core is just that, even with reasoning models. Harness can only get you closer to the good result, but can't save you from every pitfall.
As for PM analogy - don't forget that models don't learn and keep doing same stupid stuff they were doing a month ago.
Agents are perfectly capable of learning. Why would the model need to learn? The harness and tooling are all that matter.
But its not useful because even humans are like that - a bunch of neurons slapped together. Overall a tired analogy that is more suited to stay in 2024 where it belongs. Right now it is clear that it is _much_ more than a statistical model semantically. It is misleading to claim it is _just_ that just like a human is _just_ a statistical model.
Neurons? Go lower. Just atoms. Dumb, senseless atoms.
It's the mantra of the AI skeptics. Sounds so clever because it's technically true. Just like humans are just piles of oxygen, hydrogen, nitrogen and carbon atoms, along with maybe a dozen or so other trace elements, none of which have any intelligence or will or desire to do anything - hence humans cannot possibly be intelligent or have any free will.
That's a very good way to describe what they do (better than 'AI') but ironically, it really well explains the mechanism, and how they are in fact able to 'code so well' which is contrary to the authors own premise.
Agents code extremely well.
They're not particularly good at 'architecture' and I think that's where his specific concerns about 'not being able to see the problems' arise - the issues are are almost never in the syntax, because the AI writes perfect code. The issue is that it's not doing exactly what you intended.
Instead of 'missing the target' ... it's 'hit the wrong target perfectly'.
Any senior developer working with AI daily should be able to have a baseline intuition for all of this, and would therefore reject the hyperbole of the premise 'it can't code!'.
Of course it's producing gargantuan amounts of slop - that's not because 'it can't code', that's something else entirely.
Yes, the beatings will continue until morale improves!
To me this sounds like an old cobbler complaining that machines aren't producing good shoes if left unsupervised and that the old process of making shoes completely by hand is far superior.
So what he is telling us? That agents are not infaillable and they are not capable to one shot complex software and they do not produce perfect code?
We know what and the solution is to use agents for what they are good at and work around their limitations and we have a human in the loop.
>not some RLVR shit that comments out the failing test and tells you all the tests are now passing
That's what harnesses should be about: detect when the agent is misbehaving and force it to take the right approach.
This example in particular should be easy to solve if we generated the tests before coding and we have a workflow or state machine that doesn't allow the agent to disable tests and doesn't allow it to reach the next stage unless all tests are passing.
LLM proponents always use some language like "these old, stuck up dinosaurs with their manual labor vs us cool smart kids with automated labor", but they forget one thing - with automated labor the performance and cost difference was easily measurable in favor of the automation. With LLMs it's neither measurable nor visible (no better software, no faster delivery overall in the industry), and the costs are pretty bad. Besides personal anecdotes of someone toiling away at yet another AI harness project on GitHub.
Right now, to get some good results from AI and save time, you will have to spend a lot of tokens and money. Maybe in the future, the things will get better, I don't know.
Saving money is the wrong reason to use AI now. AI is expensive if you want good results.
But what AI is good for, is it allows you to build fast.
Also, I don't see everything being automated. To get good results you have to drive the AI.
The factories still have workers supervising the process and doing some high value manual processes even if most of the production is done by machines.
Nah this person is dead wrong. Lets come back in 2 years and check on it. I'm willing to make a reasonable bet on these terms: companies will go even more AI native, will use even more tokens and spend even more money.
EDIT: To people downvoting me, please come up with a reasonable bet and lets try to work it out.
EDIT 2: $500 bet paid to your account on whether LLM's are going to still be used productively or not. No one?
EDIT 3: Any bet that would express the author's argument in a way that can be disproven in the future
The author is not arguing against your bet. I think the author wouldn't be surprised if you won that bet. But that doesn't change the argument he does make.
You could both be right.
No, I don't think people will be spending even more money on AI if it is not becoming productive. 2 years is a long time to get used to it.
Coders underestimate the utility of AI in so many boring day to day tasks. If you freelance, that’s where the money is at, not in creating a startup that fills holes in AI offerings or in creating generic slop while hoping for ad money.
The amount of domain specific apps that will be created will likely make excel look like yesterday’s news.
I heard that a year ago, I guess we still need to wait a bit more. Thought agents were fast!
People have been trying to replace Excel for the last 40 years...
[flagged]
We all remember cryptocurrency. Everyone in tech proclaimed fiat was dead, every office buzzed with talk of every possible way that cryptocurrency could be used, billions of dollars flooded in to projects losing money hand over fist. The cynics reacted to the froth with outright rejection of the idea. And today… cryptocurrency exists, it has some use, but it didn’t take over the world, it didn’t kill fiat, it was useful in some areas and worthless in others. AI will be the same. The noisiest proponents will be over exaggerating. The most cynical cynics will be underestimating. The result will be somewhere in the middle. Success will not be predicated on adoption of the technology. We, nerds, are bad at predicting the impact of technology.
Almost nobody used crypto, everyone is using AI, every day for doing productive work and almost nobody would give it up.
I don't understand how it's remotely reasonable to try to make the comparison.
I wish people would stop comparing AI with cryptocurrency. The hype/perception was the only thing that was similar between them. The fundamental usefulness of the respective technologies are not comparable.
Two other similarities: they both rely on burning huge amounts of electricity, and have driven up costs of GPUs around the world.
The cognitive dissonance around this is astounding.
First of all define productive. Would someone using AI to build software at a startup which is likely to fail be considered productive? What if there is already similar software available that solves the same problems? What about the broad use of LLMs to draft emails or make silly memes?
It’s funny how everyone’s concerns around climate change just disappeared when they realised AI was useful to them.
One of the reasons it gets compared with crypto is because the same people who were all in on crypto schemes are now all in on AI schemes.
That's the fallacy. You think technology usefulness dictates the outcomes. There are billions of people living in poverty the world over, starving every day, we have the technology and resources to solve that right now, we have for decades, but we don't because we don't want to.
Could AI technology change the world? Sure. Will it? That depends on so much more than what the technology can do. Why are we all still working 40 hours a week? Why are people still hungry? We could have radically changed our world with the technology we have had for decades. Yet, we have not, we have continued, nothing has really changed.
The internet is a great example. What is the most impactful part of the internet today? Social media. Social media has radically changed our culture. What is social media? A database, a few endpoints and an app? The technology is the least consequential part, the consequence comes from how we use it.
Nerds focus on what is possible with the technology, not what society is likely to do with it. What evidence is there that AI is going to change the world? What change is going to come from... being able to generate plausible sounding text? From being able to instruct agents? How many companies are using garbage software from 20 years ago despite dozens of revolutionarily better equivalents being available out there today that could have drastically reshaped their workforce? What are agents if not better macros? How many businesses have hundreds of employees doing the same tasks over and over again that could have been replaced by a few macros? How much of the code that you and I have written in our careers has already been written before?
The fundamental usefulness is the least important part of a technology when discussing how that technology will impact the world.
What "society is likely to do" with a technology drops to zero for things that technology cannot do. That's a technical property. On the other hand, people quite reliably find unexpected uses for technology, beyond the initial hype, again, due to inherent technical properties. You're vastly overcorrecting against "nerds".
It’s a matter of timeframes too.
Cryptocurrency is more popular and intertwined with the financial system than it ever was so while the claim isn’t currently true it doesn’t mean it won’t be on a long enough timeframe.
If you are old enough then you would be aware that similar claims were made about email but only one country that I know of (the Netherlands) no longer processes mail. Still if we had to guess I would say that we are still early and email will replace the worlds postal systems.
You are predicting the weather by saying tomorrow will be just like today.
Which has a good track record of being right, most of the time.
I agree!
Geohot's next venture will be writing a book titled "Fear & Trembling".
" Agents cannot program, and it’s taking longer and longer to realize that they can’t. They are a highly sophisticated statistical model designed to mimic the distribution of programming"
In other words - they can program, and probably better than you.
I don't like being too critical but this is a really superficial post - as if either 'AI is a Software Engineer - or - It must be Fraud'
It's an extremely powerful tool that is very 'pattern oriented' and with guidance can absolutely write great code - and even across modules given the right basis.
It's also great at so many other tasks - finding bugs in big code bases, doing migrations etc.
It's not going to make very goo architectural decisions for you, and if you're doing anything novel you have to read most of the code ... but that's too be expected.
In fact, he’s done several things that are truly hard, and has a well-deserved engineering reputation.
Well he doesn't write CRUD apps, which plenty of us do, and with a decent harness, agents can do decent enough work on them.
The author is absurdly wrong.
It's ridiculous to suggest that 'AI can't code' - when the entire development world has moved into agentic coding, including all of the best developers in the world, and it's yielding positive results in most scenarios.
It's a callow 'bad twitter take' the length of an article.
He's not wrong to suggest that IA is a 'stochastic mechanism' over all the code that's ever been written, but that's evidence of the mechanism, and frankly, describe how it is able to code.
And yes - organizations will misappropriate AI at scale as they do with everything.
His premise is so far out of proportion and misguided, it's tantamount to 'fake moon landing' conspiracy theory.
> when the entire development world has moved into agentic coding
I think a lot of the problem with the current discourse is how black-and-white it is. Either you're a luddite or "ai pilled".
In most cases, LLMs can get you 80-95% of the way, sometimes less, sometimes more. And heck, sometimes, it just gets you somewhere wrong.
But it seems everyone is arguing about whether LLMs can be perfect software engineers in isolation running in a closet, and using that to say that LLMs do not have a massive potential in other scenarios.
Sometimes, I like to imagine how much more productive most organizations could be from the things that the internet gave us, even to this day. Most companies never really do even a fraction of what is possible. That helps to ground my view of LLMs as well.
The fault dear Brutus isn't in our language models, but in ourselves.
It's funny, but the more I know about the true Luddites, the more I see their point of view.
" the original Luddites were primarily protesting against machinery used to "fraudulently and deceitfully" manufacture inferior goods, bypass labor standards, and strip skilled artisans of their livelihoods."
Goods are usually (although not always) inferior when made by a machine. A hand-crafted solid wood table is still superior to something from Ikea.
Of course hand made tables are expensive. They service a sliver of the market. Ikea serves the rest of us who'd prefer not to eat off the floor.
Fundamentally, Luddites didn't like being replaced by a machine. They were skilled workers, who used to have very desirable skills. Most people didn't need their standard of quality (but customers had no choice.)
Their name is well known today because we never stopped replacing people with machines. Every industry as been "optimized" over and over again since the Luddite times.
AI is the first threat to the Artisans of today (ie programmers). We are just the most recent in a long history of Luddites.
In every change of this nature, some move on embracing the change, others do not. Some will find other jobs, possibly new jobs, others won't. Carriage drivers became Chauffeurs, some grooms became mechanics.
So sure, I'm a Luddite - I don't want to see my skills become cheap - but I'm also pragmatic. The change is here. I'd rather adapt than die.
>Of course hand made tables are expensive. They service a sliver of the market. Ikea serves the rest of us who'd prefer not to eat off the floor.
And yet my working class grandparents didn't eat off the floor, they had great quality tables.
Mine is made of disguised cardboard.
This is a big part of the problem, there is zero trust that any potential improvement in cost or access will reach consumers. Companies don't even bother telling us it will.
We will just be slowly moved into accepting a degradation as the new normal.
And yet, clothes would have remained very expensive if we kept doing fabric by hand. Even destitute people in the poorest countries can have clothes these days, the meaning of “who wears the pants in this house” has lost its original (一条裤子) scarcity meaning.
>the meaning of “who wears the pants in this house”
That's pants as opposed to skirts. It's a gender implication, not a scarcity one.
I think you are misrepresenting the luddites. They were not against technological progress. Many of them were the ones who invented the machines. What they were fighting for were labour rights and distribution of power. They were fighting against enshitification by some guy who stole their collective inventions and by force kicked them out.
They were not against more clothes for everyone - quite the opposite. They were against fast fashion bad quality clothes made in horrible conditions by people (or children) who had no other choice.
I think the case of Luddites shows just how strong anti-worker and anti-union propaganda is - making it a mock word was very effective at preventing uneducated people from understanding what they fought for.
Doesn't that just say "a pair of pants"? Or literally one line pants.
I completely agree with your sentiment of “black or white.” I believe it comes from social media with primarily “radical” perspectives being the ones in the spotlight. Just not an environment that promotes nuance or friendly discussion
>I think a lot of the problem with the current discourse is how black-and-white it is.
There is too much money involved for any rational debate.
Only on the pro-AI side. The "it is bad" side is diverse on the reasons why, but being overwhelmed by bad content isn't a monetary concern.
Some "ai is bad" arguments come from the threat of losing one's livelihood, which is a monetary concern.
Which is why I phrased it the way I did.
Black and white thinking is not *limited to* monetary concerns.
> There is too much money involved for any rational debate.
For the Sam Altmans of this world, sure, but how much money is the average AI booster commenting on HN actually standing to make?
If you are just invested in an index fund, a lot. If you are an HN commenter who is more likely invested heavily in tech stocks, much more than a lot.
The other side is the stability of your job or job prospects, and we are adversely affected by that instead.
> In most cases, LLMs can get you 80-95% of the way, sometimes less, sometimes more.
That's my experience too, but it's 60-95% solutions in my case[1], with about 120-140% of lines of code required. I wish there was a harness that would let me mask code it should/n't change, because prompt-based refactors fail from the same over-eagerness.
1. I try faster, smaller models first.
We had the same issue until we created a review skill that we run after a LLM is done implementing a feature. We give it a list of things to check that is based on the problems we have observed previously, like writing too verbose code, and ask it to report on issues and suggest improvements. The developer can then give feedback and let the LLM fix the issues, or just address them manually. It’s still early but I’ve been much happier now with the results. It makes it much easier as well for humans to review since there’s a report about what the change is about, why, things to keep an eye on etc. This is something you can do with any harness you may be using and there’s nothing to buy, just a suggestion from someone trying to make the best use of this insane technology.
I think that's geohot's point as well. They're advocating against being fully "ai pilled". Saying we should be using AI as a tool, not for being a luddite.
Yes, exactly, it's 'us' not the AI, which is great.
Why on earth would we ever remotely compare a 'tool' to 'a software engineer' ?
The 'great delusion' is not that 'AI can't code' - because obviously it can, and very well.
The problem is the 'anthropomorphism' and all this AGI nonsense.
If we called it 'Stochastic Mechanisms' and did not 'personalize' our prompts, refer to them as 'chat' or give them 'personalities' but remained in the domain of 'Stochastic Language CLI' ... then our metaphors would pbably not cloud our judgments.
Let the philosophers argue about AGI.
You are a tool. You're a human resource, from the perspective of the organization. That pushes buttons on bunch of other tools. That's why you compare it.
Edit: I don't mean tool as a perjoritive.
I don't know why you are getting downvoted. Perhaps because people don't like the sentiment. But its true, people are hired as tools to write programs.
Both people and AI make mistakes. Perhaps the AI makes more, a lot more, but its so fast, and works around the clock, and has no ego, there is a chance that the benefits outweigh the costs.
The saner philosophers don’t need to argue about AGI because we’re absolutely nowhere near it.
[flagged]
So currently there are people who are buying grey market peptides[1], marked "not for human consumption" and injecting themselves with them based on dubious anecdotes and vibes, to make their skin clearer, build muscle mass, and so on.
Are they are all suddenly turning into zombies? No. Do they have any real idea what that is going to do to their body a few years down the line? Also no. Could it be catastrophic? Maybe!
I think about this when I think about how violently much of the industry has pivoted into AI being the primary generator of code in the last 6ish months. AI is the peptide, your codebase[2] is the body. Literally no one knows how maintainable this approach is, because there simply hasn't been enough time to find out. It could be fine. It could be a complete mess, with your entire engineering team falling asleep at the wheel, lulled into thinking they understand what is being built when they don't, completely impotent to fix or maintain it once the LLM is no longer able to.
[1] https://www.bbc.co.uk/news/articles/cdr268m5pxro
[2] Well, _their_ codebase. I've stopped doing it with my own personal codebases, unless I genuinely don't care about maintainability or longevity
I think smart developers will be building isolated modules, so if your AI generated module keeps failing, you can amputate it and make a fresh one.
With the level of ability that AI is at right now, I've found it useful personally to think of it something like a very good search over existing knowledge. Another step up in searchability in the lineage of reference books, stack overflow, GitHub etc.
Programmers are rewriting and reinventing the same techniques more often than any other vocation I can think of, and so we were primed for a really good search over prior art. The fact that AI can also adapt that prior art to your particular use case makes it even more powerful.
Much like how great success never came from cobbling together various bits of copy-pasted code from Stack Overflow though, current AI can't really build your whole project.
> Programmers are rewriting and reinventing the same techniques more often than any other vocation I can think of
And the answer to that is clearly a tool that makes rewriting/reinventing cheaper than actually packaging nice reusable libraries
And on other hand I really do not understand how basic project boilerplate templating wasn't already a fully solved issue. Surely it should have been doable...
I'm all for packaging nice reusable libraries, but someone has to actually do it. A lot of them just don't exist (yet).
"The fact that AI can also adapt that prior art to your particular use case makes it even more powerful."
Well that's what everyone is claiming anyway
Yes, I don't have anything important to say other than I 100% agree with this comment. AI in its current state is akin to Stack Overflow and Google on steroids, but from my experience, it doesn't do well building out full-scale applications other than perhaps some initial scaffolding.
If I were to use it against a legacy, rather poorly written codebase, where the code may be hard to understand without some in-depth analysis. I could certainly ask an AI agent to read the code (How does application X do Y, for example), but I wouldn't have it start hammering out features or have it do any type of refactoring. That would cause far too many commits and confusion amongst the development team, leading to even more slop than whatever we'd already be dealing with.
Just leaving this comment here so I can come back to your comment. I've been getting a bit discouraged by AI lately, but this sums up my experience with it well enough.
> Yes, I don't have anything important to say other than I 100% agree with this comment. AI in its current state is akin to Stack Overflow and Google on steroids, but from my experience, it doesn't do well building out full-scale applications other than perhaps some initial scaffolding.
We're currently using it to build out a full-scale application. It does as well as you care to coax into doing tbh. You have to invest heavily in harness engineering, and at least my experience has been that as you do that, the results improve.
>It does as well as you care to coax into doing tbh. You have to invest heavily in harness engineering, and at least my experience has been that as you do that, the results improve.
That is also my experience.
When starting a project I observe how the agent fails, I add new rules to the harness to prevent it from falling and repeat the process until I am happy with the output.
I'm unfamiliar with harness engineering. Is there any good documentation about the subject you could point me to?
https://openai.com/index/harness-engineering/
https://www.anthropic.com/engineering/harness-design-long-ru...
https://www.anthropic.com/engineering/effective-harnesses-fo...
These were some of the first major articles on it. It's becoming a popular topic, so there's more content on it all the time.
I can't point you to a good complete documentation, because the field is changing very fast as people make new discoveries.
I learned by reading articles, success stories failure stories and mostly by doing, trying stuff, see how it works and adjusting it and burning a lot of tokens along the way.
What I would do in your shoes, I would ask an AI chat to find new articles on the matter (including on HN), explain how Codex, Claude, Pi are managing agents.
My compressed view is: you need to have a great specification both business and architecture wise that doesn't leave anything important for the model to guess because chances are it will make the wrong choices. That comprehensive spec should not be in one huge chunk. Have your plan divided in phases that each fit in a context window and have the spec for each phase. Use TDD, strive for 100% coverage. Force the model to behave: if it doesn't do what is supposed to, give it feedback and ask it to retry and don't allow it to progress to the next stage unless everything is perfect. I also like to write comprehensive integration tests before building anything. The agents are not allowed to touch or read the integration tests, only run them and they will get feedback where the tests fail. I like to build the integration tests in a different language than the software I am building, to make sure there isn't something platform specific that the tests rely on. I use C#, Go, Rust and Zig for development and Python for the integration tests.
For now, to get good results, I can't just copy and paste the setup from a project to another, I have to work a lot to tailor the process for each new codebase.
And that's why I am working on an agent harness to try to force the agents to do the right things in most common development scenarios without wasting much tokens. By common development scenarios I mean that is a large goal, right now I am working towards backend web development and microservices.
Sounds like bag pipes to me LOL
In my experience, you’ll eventually hit a context window issue and it will just start spouting gibberish/doing wrong things, and nothing will significantly improve it. But hey, maybe it’s improved.
Well, auto-compaction is a thing in Claude Code now. Plus we have /goal command and some automated review stuff, so you can kinda just get it to loop until the automated reviews are satisfied and CI is passing. Does most of the heavy lifting.
One under-discussed phenomenon here, I think:
The hardest thing in software engineering is solving the right problem. The ability to identify the right problem to solve, is IMO, what distinguishes the top senior engineers. And we could have endless discussions about what constitutes the right problem, but for the sake of this discussion, let's reduce it to: the problem whose resolution adds the most value to the product for the amount of complexity and afferent costs that it incurs.
Once upon a time, long ago, I worked on a Web product whose original junior designer had figured it would be neat to be able to manage the backend with LDAP tools. So the database schema and structure that the product used mimicked that of OpenLDAP, with compound CN keys, and the entire codebase had to deal with that structure whenever reading from or writing to the DB. LDAP compatibility was not the right problem to solve when designing the DB schema.
But software that solves the right problems can be hard to identify because, quite often, how it does things seems so obvious that it's not readily apparent what other designs might have been chosen.
Now, the thing that usually keeps the blast radius of wrong-problem designs limited over time, is the very friction that they introduce. Development slows down, including the development of more wrong-problem designs. It's a self-limiting phenomenon.
And that's one major thing which worries me about LLM coding agents:
They paper over this friction. They don't repair it; they just make it so its cost is deferred.
So you gradually end up with codebases that grow unboundedly complex for the value they provide, with no controlling mechanisms.
You end up with juniors who never face the feedback loop from which they'd develop the engineering instincts and the taste for what makes a problem the right problem to solve in a given design.
At scale, as a field, you might end up forgetting there ever was such a thing as solving the right problem.
And I don't know what to do about that. Plan for an early retirement, maybe.
I'm in the "haven't written any code in a while" boat ATM. I'd love to see examples of issues that are so big that they warrant reverting to manual coding.
My main issue has been the inconsistent quality across between model releases and the tendency to insert older APIs or documentation, especially with command line tools.
I can understand if the model struggles with a million line monolithic codebase with a decade of cruft but can't think of why it'd be too much of a pain with new codebases.
Here's one that hit the frontpage recently:
https://blog.k10s.dev/im-going-back-to-writing-code-by-hand/
When every prompt produces a thousand line PR, you’re not very far from another million line monolith.
I’m a little more hopeful than the author though. I feel like it’s possible to manage the process so that does not happen.
It's not difficult to avoid the 1000 lines per PR thing: Depending on what kind of thing I am adding, the plan might also receive as instructions to value making as small a code change as possible. It still requires judgement, as on something big, the smallest possible code base is not necessarily the most readable, but this is the kind of thing one can decide with some experience and little work.
I've also managed to use LLMs to cut a lot of manual duplication in code where we typically didn't do enough investment: "Claude, evaluate code duplication in the functional test suite" will have no problem finding things like insufficient helpers, or tests that are testing simpler things as prerequisites, so they can rely on each other. So I am not seeing my codebases growing all that much. There's some risks of functional changes that before would be rejected due to cost which now are not, but I am not all that sure of how much that is controllable without being relatively antagonistic with management.
Monolith is an ordinary form of software btw - big ball of mud is the problem.
> manage the process so that does not happen
This is the gold, right here.
It doesn't engineer. It writes code. Enthusiastically. Usually without thinking about the bigger picture, the design, the architecture, the trade-offs, etc.
It's up to us to manage that process.
It's why senior engineers are finding LLMs a really useful tool - because we've learned to think about all that other stuff before opening the text editor. Writing the actual code was always the easy (and least valuable) bit.
What type of projects you work on, in particular how rich it is in novelty, non-googlable data points and non-trivial project-specific deviations from industry standards?
Even if agents failed at that I'd wager that's a very small percentage of software projects anyway.
> I'm in the "haven't written any code in a while" boat ATM
How long do you think it will be before you can't write any code because you're out of practice?
One of the dangers of engineering management is that it can turn you into a person that can no longer do the thing.
Does that even matter?
That's fair. In all honesty I'm already feeling challenged but given how much time I save I can set aside some time to keep myself sharp. I can learn more languages. Additionally, as pointed out by others, I'm trading coding effort for design and and strategy, which generally control business outcomes a lot more.
Having said that, I won't use AI for production system if I don't understand the programming constructs in enough detail.
> I'd love to see examples of issues that are so big that they warrant reverting to manual coding
Ah I see your org hasnt yet had an outage caused by a bad LLM code push.
This shouldn't actually change virtually anything. We had this happen recently, and were able to rollback within minutes. Devs hand-coding stuff breaks things too. If you already have good observability, fast rollback processes, and feature flag new changes plus do % based rollouts to limit the blast-radius, then it's more or less the same.
sounds like bad deployment practices - canaries, guardrails, fast rollbacks, ring based promotions, cell based architecture, blah blah etc... humans write bad code too, there should be systems in place to protect it from releasing
I think people spend way too much time trying to say that LLMs are bad / shouldn't be used / etc because the LLM can't get it right the first time and/or makes mistakes. I think this is because we all hope that software/computers work like this in an ideal world, and LLMs are software.
This is the wrong mental model.
The way to think about an LLM is like a human: prone to following bad examples if it sees them, needs guardrails to catch mistakes, needs code review. It also needs access to what "correct" looks like: architectural design documents, skills that explain each type of change, etc. It needs prompting/skills telling it to follow a safe workflow, telling it to consider how a safe rollout would work, what a safe rollback would look like, what the performance implications are - just like a human.
The nice thing is that you now have a very knowledgable assistant that can help write additional guardrails that would have always ended at the bottom of your backlog. Perhaps it used to take many hours to research and understand how to write a custom linter to catch a specific coding pattern. Today, ask Claude to do it and an hour later you'll have a custom linter rule for your language of choice, guaranteeing the same mistake can't happen again because CI will block it.
"Ah I see your org hasnt yet had an outage caused by a bad LLM code push"
"We went back to shovelling by hand because someone ran over the pole with the front-loader, even though he had no experience driving it."
This is definitely user error; obviously it's a hard tool to wrangle but it's entirely possible to use it safely.
Seems like AI isn't really solving complex bugs and issues (that it itself created) in my MAUI project over the last 18 months.
Seems like it is completely hopeless at doing anything netcode consistency and performance related in game dev. Seems like unique game mechanics it doesn't do well either.
Seems like asking it specific UI stylistic changes is basically like throwing darts at a board and hoping it sticks.
Even with relatively simple things, frontier models get me about 90% of the way - and this is without evaluating how good that 90% actually is. It's the last 10% that the model fucking sucks at. And it's often the simplest things. It takes a lot of tokens and a lot of time to cajole the AI to get that last 10% working. And even then, I've just given up and had to go read the slop and fix the bug myself because it become so frustrating.
Part of my job is working on trying to make these models productive for the large corporation I work for. It's a lot of throwing tomatoes at a wall and to a degree I see the issue he is talking about output seemingly having a certain ceiling.
At the same time in no part of his post is any code snippet or anything to latch on to of "the model performed poorly here when it should have done this" - this style of criticism seems to be a pattern of most of these "the LLMs will never work" style posts on blogs and twitter.
They obviously can perform better than autocomplete and in my own day to day development build out huge portions of a codebase that I would have expected a junior or midlevel engineer to perform at.
How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.
> How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.
The mistakes they make are pretty subtle. Coding with LLMs can be like that scene in Whiplash – <excellent drumming >, not quite my tempo, <excellent drumming >, downbeat on 18, <excellent drumming>, you’re rushing, <excellent drumming>, dragging, …
Like yeah it produces working code almost always and the code usually does what you asked. And yet it makes you want to throw a chair because it’s not quite right in frustrating ways and it doesn’t even have the taste to know how it’s wrong.
Yeah. It does this. Pretty consistently and replicably depending on the issue, in fact! Yet I can point exactly where it fails.
Why are we not showing the bad choices? On my computer I have hundreds of diffs stored by my agent code review tool that point to style/architecture failures (and in the end, the result of that iteration on the AI output)
I'm not quite sure how people are generating unsalvageable outputs. I'd never ship the result of a first AI pass, either. I review all the code and the architecture, within reason (eg: in Rust I don't preoccupy myself anymore with precisely scoping pub, or whatever, unless I'm making a library crate). I sent a "changes requested" prompt+json to my agent, and it interactively fixes everything (even style, even comments with manual patches with my in-review-tool editor)
This is an excellent point, and as a novice using LLMs for projects I could never previously dream of doing I find myself looking for the same, examples or citations of what exactly agents are writing incorrectly and how would the human do it better. I'm sure they're out there, maybe someone can refer some good content showing such examples.
I have no doubt the top nth percent of coders could write circles around Claude or Codex, but how much worse are they than your average schnook?
Reality: the top nth percent of coders are seeing absurd, dramatic gains in productivity using LLMs. See: antirez, Simon Willison, Steve Yegge.
The more experience you bring to the table, the more value you get from these tools.
Look, about 12 years ago articles about how if you're not pair programming you're doing it wrong were on HN's home page every day. Doing well prompted plan -> agent -> debug cycles is like pair programming with someone that knows every SDK and API intuitively and doesn't have to pick up their kids from daycare at 4pm.
antirez is famous for creating Redis, which took a dump in quality and everyone switched to a fork called Valkey.
Yeah, "everyone".
The name of your 35 day old account is appropriate. antirez has taken dumps that have accomplished more than you.
The problem is what they do to large existing systems: subtle misunderstandings mean subtle bugs are constantly being introduced, and very few shops have adequate systems in place to receive reports of subtle issues at the rates they occurred 10 years ago, let alone today. And don't even get me started on llm-assisted support that some might suggest as a solution.
When people write blog posts about how LLMs failed for some particular task, the responses from boosters invariably fall along the lines of "just use this other model/just tweak your prompt like so/you're just not skilled enough—you can't make fundamental arguments about AI by citing specific examples."
So we can't make arguments by citing specific examples, and also can't make arguments by not citing specific examples. Whelp, I guess that's the ball game.
(yes yes, I'm committing a group attribution error, but still)
This article goes into quite a lot of detailed examples that include code snippets that demonstrate poor architecture: https://blog.k10s.dev/im-going-back-to-writing-code-by-hand/
AFL didn't find more vulnerabilities than LLMs. AFL and skilled practitioners found vulnerabilities. AFL triggers faults, many (most?) of which aren't exploitable, and humans (or, now, agents) have to triage and evaluate them. And they did so in a pre-AFL corpus of memory-unsafe software. The heyday of AFL was a decade ago. Every target is harder now.
My guess is the models just continue to get better and better
When I got into agentic coding a year or two ago I was sure it was only good at autocomplete. Something happened earlier this year where the models hit a new level of capability.
Everyone I know now just does agentic coding, and it’s really amazing. I think we should just try pushing this as far as we can possibly go, it really feels like the acceleration of the human race is upon us.
We're already hitting some logistical limits. Even if transformers don't have an inherent capability plateau, we only have so many GPUs and so much power to improve them, and we're finding it very difficult to expand that infrastructure. Something like 6 GW of new DCs have been announced over the past 2 years of which less than 1 GW has actually been turned on and started serving, and the deliverable dates for the rest just keep slipping. (Not to mention that the DCs are all talking as if the chips in them will last 6 years, which is turning out to be a stretch.)
Sounds like we just need a Dyson sphere.
Besides, I have been hearing "this is the limit" since the doomers of "this is just a markov chain and can't be useful".
Yet the limits keep being broken.
Most of your comment history is defending LLM psychosis
Most of your comment history is less than a month old
what if we're accelerating to a brick wall?
More like neo-feudalism by way of breadlines.
It's hard to care much about how software is written when superpowers are invading their neighbors, democracy itself is under attack, the food supply chain is failing, and the world is hotter than it's ever been.
software plays a big role in enabling 2 (arguably 3) out of those. Least we could do is not make it worse. But it seems most of us are still just children. They give us something shiny that makes pretty lights and we happily burn the world to keep it.
Acceleration of the human race is the biggest cope I’ve read all year.
>... I was sure it was only good at autocomplete. Something happened earlier this year where the models hit a new level of capability.
Yes, something happened, it got better at autocomplete. What else could be? The underlying model hasn't changed.
>acceleration of the human race
Please just stop with this bullshit. Nobody's curing cancer, climate change, inequality or whatever important real problem there is with LLMs. Nobody.
If this tech is good enough to make you more productive is just because you're not working in anything new or cutting edge or innovative. The only reason a LLM knows how to do your job is because that code has been literally written before enough times to appear in the training data. Try to use llms to write C++26, some HDL or in any niche stack and you'll get a nice reality check about LLMs.
I’m sure I would be just as useless as an LLM in the “niche stack” examples that you cited.
Why do you think that is actually a good argument against? Most “business” problems have already been solved in some way and the times I had to write really novel code in my career have been very very few.
Also sure LLMs haven’t solved cancer or unequality in the few years they exist - but humans also failed here in the last couple thousand
Not reviewing outputs, which is my main issue, is one-way to subpar experience. No amount of "make it right" will fix that.
I hope that professionalism still matters as these new ways of doing things strikes me as unprofessional as f...
Yeah, the next macOS will be worse... time to place bet on prediction market
ai agents can program, in fact, in our current time with current models, i'd say they program better than most people in the industry (an industry where people were literally copying and pasting from stack overflow for years prior)
being able to program is not the only skill required to be a successful software engineer, so no ai agents cannot be software engineers
very important distinction - i personally like the radiologist example - looking at scans is a part of a radiologist job, AI can do it better than most of them, but looking at scans is a small part of the job, most of it communicating with doctors to help their patients
Data from six months of production from one SaaS codebase provides a more limited response. Maintainability doesn't depend on the level of AI usage. Maintainability depends on the discipline during diff reviews. Good sessions: One topic per session; scope defined prior to the agent starting; all diffs read prior to committing. Poor sessions: Broad scope; undefined constraints; rubber-stamped results.
The quality of the codebase decays precisely at the rate you stop reading the results. This is not an issue of AI writing the code. This is an issue of unreviewed code. geohot's issue is entirely valid. This problem does exist. But this isn't dependent on the generation phase.
I agree that I can write better code than an agent.
But it can write working code much faster than I can.
And in a lot of cases, unfortunately, faster beats better.
> Unfortunately, faster beats better.
I think you have just written the epitaph for corporate software.
I agree it can write faster in a language I don’t know very well
At a granular level, it's almost guaranteed that you cannot write better code than an agent.
Agents now are writing extremely consistent, normalized canonical code, that usually compiles the first time.
Right out of the 'textbook'.
For what it's trying to do - it writes nearly perfect code.
The only thing you could nominally disagree with are some of the conventions and idioms.
It 'writes a perfect novel, in perfect prose'.
What it will not do however, is 'write the novel that's in your head'.
And that's the crux of it.
It's not even your job to 'write code' at this point, but rather to be the storyteller - and a very good editor who has enough taste and grasp of gammar to be able to know when it's going awry.
It will make mostly what you tell it too, the quality of the output is the quality of your guidance, but at the lowest levels it's generating extremely high quality syntactic prose.
I don't think LLMs inherently do anything perfectly. They can make sure it compiles and passes tests and they can be trained to do an enormous array of tasks, but the code it generates isn't perfect, it's selecting one of many possible outputs based off of some numbers it came up with after a few matrix multiplications and ReLU activations.
Those matrix multiplications aren't a divine perfect thing. They suffer from floating point precision issues and training data issues and there's still debate if adversarial examples are just an unsolveable property of our linear-algebra based neural network architecture.
Can they do things way faster than a human? No doubt. Can they do very complex tasks? Yes. Do they do things with perfection? Not by our human definition of perfect.
"Those matrix multiplications aren't a divine perfect thing. They suffer from floating point precision issues " - this is not the right intuition.
"Not by our human definition of perfect."?
'Human definition' has nothing to do with it.
Your job is to define what you want, to the extent you can do that, the AI does really well at a certain scale, at the 'functional' scale, nearly perfectly.
Syntax is the least of my concerns.
Modelling a problem is what I'm concerned about. And I'm currently better than any AI agent at doing that, given enough time.
> the quality of the output is the quality of your guidance
I wish you just started with the copout.
>When people see an artifact, they make assumptions about the process that was used to create it. Without even thinking about it, they assume the creator had a basically human state of mind. This assumption is no longer true. Things can be broken in ways that weren’t previously possible, and old proxies of underlying quality like syntax and grammar are useless. AI produced artifacts are not produced by the same process as human ones, and this difference, while extremely subtle in statistics, makes itself obvious when you try to interact with and build on the artifact in human ways.
Once Humans just had oral language, and we could us words to pass ideas from one human mind to another. Then with writing ideas could pass to minds that weren't immediately close together in space or time.. and with this we made complext global spanning civilization. When words just become noise, that one has to be suspect of each one as to whither they'er coming from another human mind, or just a statistical process, can this civilization even survive?
It all depends on the code quality bar. If it's high, a lot of tasks will not be completed much faster. The main speed comes from trusting LLMs output. When you review each change and reprompt LLMs to make the code look like you want. Suddenly, things become much slower and reviews/reprompts are very mentally exhausting.
While I was reading this post, Anthropic sent me an email with the subject line "Your account has been suspended". What a coincidence :D
I wish these posts that talk about non-human mistakes that agents make would post some examples. They would be interesting to see.
Wonder if LLMs in autoreasearch loops would be able to complete tasks geohot has in mind in say 100x average token budget.
If the answer is yes, the argument doesn’t matter: you just run the loop and wait for llm analog of moore’s law to get costs down.
Loops make the code even worse. The more local the changes, the better LLMs at it.
When a blog like this goes completely black or white on a topic I get skeptical. Nothing in life is 0 or 1. So is AI. Has some good to it and some obvious issues. All not that big of a deal. Ppl try to position themselves on the edges bc that’s what’s polarizes and engages conversation…
I think geohot is somewhat of a clown but I think he is speaking reason here and I'm happy to see voices address this. Most seniors I work with agree.
For context: the author is George "geohot" Hotz, who has a long list of exploits, likely the best known of which is basically vibe coding (I mean that in the nicest possible way) comma.ai for autonomous cars on a shoestring budget long before actual AI vibe coding was a thing.
https://en.wikipedia.org/wiki/George_Hotz
When digital cameras replaced traditional ones, we thought it would make photography more democratic: each of us would be Helmut Newton for 15 minutes. But it didn't give us the beautiful portraits and inspired lanscapes we expected, only millions of pictures of food.
How much will it take for AI agents to pass from distilling decades of collective wisdom to copying each other's worst mistakes?
>But it didn't give us the beautiful portraits and inspired lanscapes we expected, only millions of pictures of food.
Here's a sample of my work using digital cameras, not a food picture in sight.
https://flickr.com/photos/---mike---/albums/7217772029640662...
The thing about having the ability to take effectively free photographs is that it really lets you experiment and learn the edges of what's possible.
I was inspired by Stanford's camera array, and wound up doing virtual focus synthetic aperture photography. I'm hoping to build a rig to do it on near real time, instead of the manual process I used to do on my train rides to and from work.
Sure, the removal of cost lead to a flood of the mundane, but it also means we can capture our lives in ways that even kings couldn't afford in the past. I have thousands of good photos, and even some video, of friends and family.
"Things can be broken in ways that weren’t previously possible" and also "Things can work in ways that weren’t previously possible". It all depends on what the use the tool for, if you're a carpenter you're going to do a bad job regardless if you have a fancy hammer or a basic one. If you're an expert, give them a basic hammer and they'll do the work, give them a fancy hammer and they'll do the same, perhaps a little faster (or not).
I don't think Geohot has a good idea about LeCun and Hutter's views on the limitations of LLMs. I think that on abstract, textual domains, LLMs perform superbly, and they would agree. I am not too well-informed about LeCun and Hutter's views either, but I think that:
LeCun thinks that LLMs are a bad fit for AI that understands the physical, dynamical systems that we inhabit, and that understanding this is necessary for AGI/ASI.
I don't know that Hutter is bearish on LLMs, but Hutter is interested in AI that can reason exceptionally well given infinite compute, and approximations of such a reasoning AI. I think he is open to the idea that LLMs can be such an approximation.
From the article:
> Without fully endorsing all their ideas, I’m now in the LeCun/Marcus camp on LLMs.
I'm pretty sure he means "Yann LeCun and Gary Marcus" not "Yann LeCun and Marcus Hutter".
A problem that's impossible to detect is indistinguishable from it working. Hence it works. Hence it's not a problem.
> and it’s taking longer and longer to realize that they can’t
For something to take "longer and longer" to realise, doesn't they imply that it's been realised at least once before or that there was an expected deadline for the realisation?
Okay, that's a nitpick.
I read it as "agents can't program, and with each new generation of agents it's taking longer and longer to realize that that specific iteration can't". Maybe taking the Principle of Charity too far, I dunno.
Nah, that's fair, I was just being a bit tongue-in-cheek.
I don't think you can go completely hands-off for quality products but you can relax and let the agent do as much as possible. It does enable things that probably wouldn't have happened otherwise.
If you are already comfortable with letting other devs work on features then it's easier, because it's similar (arguably you have more control with AI, because what you say goes regardless of hierarchy).
> It’s definitely a better Google for most searches
I can't agree with this. You tend to get one point of view, often without any actual resources and references so you have to go look it up yourself, on [insert search engine]. Plus, what does it say when we consider an AI the one stop for our data intakes.
I find that it's typically better than Google search has been for a while, but not better than it's ever been.
More so tells me just how bad Google search has become and just how bad content in general has become.
how do you measure if google’s engineering org is more productive than meta’s? What about comparing 2 startups/small teams?
I think the discussion about methods (coding agents included) depends on answering those questions. Seems pointless to claim these agents [dont] make you more productive.
Although, at a first glance, the productivity increase does seem like nothing I’ve seen before. Even more than the transition of making webapps in plain js -> jquery -> frameworks or going from something like Flask to using Rails.
Problem is this is not evidence based. I just feel prototyping has speed up 100x. So the number of iterations/attempts has gone up. Transforming specs into a test suite takes a fraction of the time. Dunno, feels weird not to be able to be overall more productive (do more with less time) if you have these new tools.
Smart guy but whoever eventually actually fixes X search will probably use AI coding assistance to do it.
This post hits the nail at a bit of an angle.
The AI agents are great, and any expert can prompt them correctly to get good code. LLMs occasionally pick wrong patterns and start digging a hole, but this is why an expert is required. The code itself is just not worth writing when a detailed prompt can get you the same code typing 20x less text.
Where I agree with the post is:
The adoption of AI agents into software engineering is a problem. Solo projects are great, but our teams have not adjusted to the speed-of-change to a mental model of a project. So I see orgs making a choice to either: slow down or forgo the shared mental model.
Anybody choosing to forgo the mental model is building crooked legacy slop at scale. You can and should save the mental model to an AGENTS.md, but devs need it in their brain to prevent the digging a hole behavior.
To be fair the digging a hole behavior is something humans do just as well. But in teams you'd communicate enough to catch it - hopefully^1. It's the combination of higher speeds and teams that's creating a bit of a disaster.
I'm not sure what a good solution is either. There is a case for solo devs running for 2-month sprints with much more freedom. Perhaps we'll have an "AI Agile manifesto" within a year.
[1] Though you should not underestimate the amount of poor code being created before LLMs. There are enough teams for whom LLMs are practically all upsides. Stay very far away from those.
‘If it hurts, stop’ vs ‘if it hurts, do more of it’. Organizations have a choice, some slow down to avoid, some speed up in hope to make issues… non-issues. If the go-fast orgs find workflows that actually truly speed things up without loss of quality, it’s like hitting the jackpot - you’ve found a way to run away from competition without them even realizing it’s possible (for a while anyway, until they notice they’re grossly outpaced).
Eh but statistical models are obviously useful, because statistically 99% of your codebase wont involve new idea invention. Tools that write all the boilerplate code used to have names and job titles.
I hate how both the for and against case for LLMs are just so bloody terrible at addressing these things.
This is a good take. The most effective combination of AI and skilled practitioner is using AI to amplify the abilities of the skilled practitioner. And in particular, max benefit comes from exploiting comparative advantage. AIs are really good at boilerplate -- in many cases better than humans because humans will optimize the process by doing copy/paste and often inject errors in the process -- whereas humans are better at abstract and critical reasoning. There's a very real and valuable use case for AI, but it's not replacing humans, it's taking the things that humans don't like doing (and that a computer can do well already) off the human's plate, so humans can focus more exclusively on the things that they do better than the AI. And at least with the current architecture of AI models, there will _always_ be higher-level reasoning that humans do better than the machine.
This. A ~staff software engineer designing big changes at one level above the raw implementation details using Opus 4.7 + superpowers today can genuinely ship multiple times more at the same quality level than pre-AI. The level of what a whole team could ship before.
You have to use something like superpowers, the key is that the humans need to make the important decisions.
You have to review the code - just like you had to review the code humans wrote. There will be iterations.
You have to give the LLM skills and patterns to follow, access to architectural documents, etc, just like humans needed to be onboarded at a company and do the same.
If you get all of these right with today's LLMs, you will never write code at all because it is so obviously not the best use of your time. If you feel that you are still better at writing the code manually, you have not done the above right, fix your workflow and try again.
If nothing else, Eternal Sloptember is a term that seems obvious once you have it. I can’t believe this is the first time I’m seeing it.
If you were on Usenet before '93 the words still haunt you
Joined Usenet in 1990 and for a few years it was great.
I always wonder whether HN suffers from periodic influxes of newbies who don't get it yet and rile up the regulars.
There was the Reddit Exodus of 2023 and there seems to be a fair few new accounts posting recently, which feels like an Attempted Agentic Takeover.
Its one of my favorite parts of online trivia.
I was very much unborn in '93, could you share the folk history for the record?
https://en.wikipedia.org/wiki/Eternal_September
> The bottom performers won’t have that self check. They are the ones producing 10x output with the agents. What do you think is happening to the average output of that organization?
Nailed it!
At my last place this was encouraged (by non-technical leadership driving the AI adoption policies, as well as setting salaries) and seen as a huge win.
The "step change in number of created PR's" was celebrated (cult-style), and by one of the (co) CEO's praised as a paradigm shift of the same magnitude as the personal computer. Meanwhile, I was stuck finding insta-reject level bugs in pull requests from people one-shotting 6000 line PR's "finally solving" long-standing issues from the backlog. Needless to say I left.
Why sloptember when it's may
It's a reference to Eternal September, the name applied to the cultural shift of the internet as the general public started gaining access to it en masse in the 90s.
Sloptember is clearly a reference to this - the similarity being that masses of AI generated content, from social media posts to open source contributions are replacing the human internet. In a way this is related to the "dead internet theory", an idea I previously found hard to believe, but these days could easily be true.
If the history of the internet interests you, both these are worth looking up.
https://en.wikipedia.org/wiki/Eternal_September
> It is a golden era for buckets and buckets of slop, and a dark age for gems of quality.
I mean, this has been the trend for decades really, before LLMs were a thing. The incentive is skewed toward quantity rather than quality. The new tools just add more fuel to the fire.
Code quality is also really lacking in much of the industry. The truth is, these LLM models, as limited as they are, program at a level above that of the median junior programmer.
People misunderstand how AI is used in coding in normal work environments. New feature requirement comes - maybe you need a new service or some new classes. You need to do some research first.
You guide the AI with some prompts and give it some guidance on how to scenario-test it. It makes some classes, test methods. Maybe ~2000 lines and you do a quick verification, check if the overall idea looks okay. Ask it to fix a few design things and then merge it.
Its much easier than doing it yourself with all the boilerplate and understanding each esoteric language specific thing. Which library do I use for UDP communication in golang? The agent might have made a good assumption. These kind of things is where it speeds it up.
If you don’t know what library to use in your specific language, do you think you know enough to have an LLM generate most of it?
I just joined a new team and have been using copilot with opus models.
We have our core code in a weird dialect of C and rust. C I know well, but not rust. Our tests are in Python. The pipeline descriptions are in Yaml.
Outside of the core code there are so many arcana to learn. Writing syntactically and semantically correct yaml/Python test code would be a nightmare. The Agents have flaws, but they provide a huge leg up in improving the tests.
And they are great at providing a first pass review of the core code before bothering a human reviewer. Lastly we run some of our test failures through AI triage, which often enough finds the root cause or rules out simple failures.
This shows up in a higher checkin rate. I'm curious to see whether this will lead to quality end product since we have more support for the more manually written and reviewed core product code.
YES. This line of thought is exactly why people are still skeptical of LLM's.
LLM's are directionally right and if their answer "fits" then I take it at face value.
I wrote a blog detailing the computational difference between "generation" and "verification" and why it matters for LLM's: https://simianwords.bearblog.dev/the-generation-vs-verificat...
As an example: I asked the LLM "synonym for "provides" that also means "places" on you" and it gave me 5 answers and I immediately knew the right one was "confers". How? It just fits. Just like most things.
People are skeptical of LLMs because the experiences they’ve had with LLMs. You can’t blog your way out those experiences.
I’m skeptical because I’ve seen this exact situation and I’ve seen the result be something that anyone experienced wouldn’t do.
Sometimes I think folks having ‘experiences’ should play more poker.
The point is, it’s a game of chance and yet good players beat bad players in the long run. Your job in the new era of software engineering is to design the process so LLMs doing your code monkeying avoid the losses (including discarding bad changes) and take the wins. Win often enough and you’ll come out ahead.
that's a very dangerous analogy, because you would be considered the domain expert and you are just asking for synonyms for something you already know but may not remember off-hand.
now, what if you asked for the synonym for "provides" in a language that has gender differences (e.g. spanish/portuguese) as well as societal nuances (e.g. japanese) and it gives you "confers", how would you now know that's correct?
ah, so you say you tell it to take into consideration gender differences, as well as societal nuances. What are those, if you were not already familiar with the language?
Yes, just throw 2kloc over the wall for some feature. Your coworkers must love you.
My coworkers don’t care, they’re doing the exact same thing.
"It’s definitely a better Google for most searches"
This is dangerously incorrect. AI summaries of search results consistently return incorrect information and grossly oversimplified and thus misleading summaries, neither of which are detectable unless one either has prior domain knowledge or spends time drilling into search results to validate the AI output.
My experience with ChatGPT as a search engine - it is totally paranoid about checking and re-checking its answers by referencing them in multiple places (I usually read its thinking output). I have not seen an outright hallucination for at least a year. (It is of course a different situation with Google's "AI summary" which is wrong half of the time.)
Ironically I quit using ChatGPT a while back. I decided to run it through it's paces and asked it some rather detailed questions about a range of topics that I have significant domain knowledge on. Without exception the responses I got back where glibly superficial to the point the responses were almost totally devoid of meaningful information. The AI summary on Google search results is so bad it represents an assault on reason.
> And whenever you need a quick prototype and don’t care about polish, it is absurdly fast. But is it a software engineer? Not close to the bar at any company I have worked at.
This line which he wrote, will override any quality gaps, because the cost to produce that shitty software will be lower than the cost to produce good software.
I don't think LeCun is saying they won't be able to program. I think he says we won't hit AGI. Programming does not require AGI; it's a pretty specific skill!
-- I think this article is COPE, if I'm being quite honest. I thought of putting cute analogies, like the C programmers saying the Python and Javascript programmers are not "hardcore" enough... but the truth should be obvious to anyone using LLMs effectively.
-- Current AI is a much better programmer than 100% of people and when directed by someone in that top 10%, it's a force majeur.
This is my experience. Though ice been writing LLM harnesses, agents, tooling, etc for 5 years now and believe it requires several hundred hours of experience before understanding how to consistently outperform at scale.
preach it
There's a time and a place for assembly language programming. Of course, I knew someone who would say there's a time and a place for machine language programming (improved it by reprogramming a device by flipping the 17 switches on the front panel)
Wake me up.... when sloptember ends.
It won‘t..
It really feels like a mass psychosis. I'm not an AI sceptic insofar as I fully expect to get replaced by some future AI system. But what we have now isn't it.
To use a Geohot-inspired analogy, what we have now is like the Google self-driving car of 2010. It works most of the time, yet sometimes fails in unpredictable ways. So you need a safety driver behind the wheel to constantly watch what it's doing (the code review).
A real AI agent would not need a safety driver. We don't have that but many people are basically saying "fuck it, I'm just going to set this car off on its own and see what happens". And sure if you're prototyping it's not dangerous. But for production systems that is dangerous.
That is a fair analogy.
There is some very cool tech it just needs continued refinement, there is a path forwards even if it isn't always the clearest. This is happening but it is taking years and a lot of work to get done.
The more specific your work is, the more these LLM’s seem to struggle.
If your work was previously googling stack overflow, it can be incredibly useful at working through that. Which let’s face it, that’s what a lot of us do.
The horse is better than a car!
Good News! The horse and the car can coexist
Bad news! The horse population declined by 85% after the widespread introduction of the car.
Good News! I learned something today.
Bro claims to write good code. He got fired <4 weeks from twitter. AI is hyped but code isnt that bad.
spill the tea, fired or quit?
On the other hand we see success stories such as antirez using agents to work on Redis and Deepseek v4 flash inference.
> They are a highly sophisticated statistical model designed to mimic the distribution of programming
Are we really still doing this?
Author here. I have never said that phrase before this blog post and certainly understand the absurdity of it. I certainly don't mean that you need something biological or whatever consciousness might or might not be.
However there's still a distinction. Unless I'm responding to an LLM, you had a childhood. You learned about the world and space and agency before you ever learned how to program. And you didn't learn it from billions of examples, you learned from a few examples, some self directed experiments, some feedback from teachers, etc...
I'm saying that's what matters. The process matters. You didn't learn to mimic a distribution, you learned to program. Of course in the perfect mathematical limit it's the same, but in practice it's not.
For a lot (most) of what we do with programming, the process actually doesn't matter. I understand you are a real ass dude who is in this shit for the love of the game. I respect that. You are a true artisan and exist in a kind of rarified space. There will always be a place for people like you and in some senses you are correct - you are not replaceable by any AI as they currently function today.
However, 99.9999% of coding is not like that. Non-coders don't care about the code at all. They just care about outcomes. People don't care if it's "slop" if it works. Similar to bug prevalence, the optimal level of slop is not zero and will be decided by the market, not by coders.
Yes, until everyone gets it through their head.
I'm confused what else you think they are?
Its fundamentally how LLMs work.
"Magic"
Well, since the fundamental underlying structure is still the same, yes.
It's not exactly what it is; they now model an incredibly complex markov process, and harnesses that control how that thinking is done.
Is this any different than how a PM gets a programmer to work on a project? They think, then they deliver. If given more time, maybe they deliver something better. Maybe they consult some text and try to apply a design pattern.
The LLM in this use case is perfect because almost everything involved is text based, and the model is able to take in all the expressive that is language.
> Is this any different than how a PM gets a programmer to work on a project?
Yes, it's very different. You seem to be suggesting that the current frontier LLMs, when tied to their tools and harnesses, have emergent properties that are similar to human consciousness. If you truly believe that, I'm not sure how to have a productive discussion here.
It's not just that, but the core is just that, even with reasoning models. Harness can only get you closer to the good result, but can't save you from every pitfall. As for PM analogy - don't forget that models don't learn and keep doing same stupid stuff they were doing a month ago.
Agents are perfectly capable of learning. Why would the model need to learn? The harness and tooling are all that matter.
But its not useful because even humans are like that - a bunch of neurons slapped together. Overall a tired analogy that is more suited to stay in 2024 where it belongs. Right now it is clear that it is _much_ more than a statistical model semantically. It is misleading to claim it is _just_ that just like a human is _just_ a statistical model.
Neurons? Go lower. Just atoms. Dumb, senseless atoms.
It's the mantra of the AI skeptics. Sounds so clever because it's technically true. Just like humans are just piles of oxygen, hydrogen, nitrogen and carbon atoms, along with maybe a dozen or so other trace elements, none of which have any intelligence or will or desire to do anything - hence humans cannot possibly be intelligent or have any free will.
That's a very good way to describe what they do (better than 'AI') but ironically, it really well explains the mechanism, and how they are in fact able to 'code so well' which is contrary to the authors own premise.
Agents code extremely well.
They're not particularly good at 'architecture' and I think that's where his specific concerns about 'not being able to see the problems' arise - the issues are are almost never in the syntax, because the AI writes perfect code. The issue is that it's not doing exactly what you intended.
Instead of 'missing the target' ... it's 'hit the wrong target perfectly'.
Any senior developer working with AI daily should be able to have a baseline intuition for all of this, and would therefore reject the hyperbole of the premise 'it can't code!'.
Of course it's producing gargantuan amounts of slop - that's not because 'it can't code', that's something else entirely.
Yes, the beatings will continue until morale improves!
To me this sounds like an old cobbler complaining that machines aren't producing good shoes if left unsupervised and that the old process of making shoes completely by hand is far superior.
So what he is telling us? That agents are not infaillable and they are not capable to one shot complex software and they do not produce perfect code?
We know what and the solution is to use agents for what they are good at and work around their limitations and we have a human in the loop.
>not some RLVR shit that comments out the failing test and tells you all the tests are now passing
That's what harnesses should be about: detect when the agent is misbehaving and force it to take the right approach.
This example in particular should be easy to solve if we generated the tests before coding and we have a workflow or state machine that doesn't allow the agent to disable tests and doesn't allow it to reach the next stage unless all tests are passing.
LLM proponents always use some language like "these old, stuck up dinosaurs with their manual labor vs us cool smart kids with automated labor", but they forget one thing - with automated labor the performance and cost difference was easily measurable in favor of the automation. With LLMs it's neither measurable nor visible (no better software, no faster delivery overall in the industry), and the costs are pretty bad. Besides personal anecdotes of someone toiling away at yet another AI harness project on GitHub.
Right now, to get some good results from AI and save time, you will have to spend a lot of tokens and money. Maybe in the future, the things will get better, I don't know.
Saving money is the wrong reason to use AI now. AI is expensive if you want good results.
But what AI is good for, is it allows you to build fast.
Also, I don't see everything being automated. To get good results you have to drive the AI.
The factories still have workers supervising the process and doing some high value manual processes even if most of the production is done by machines.
Nah this person is dead wrong. Lets come back in 2 years and check on it. I'm willing to make a reasonable bet on these terms: companies will go even more AI native, will use even more tokens and spend even more money.
EDIT: To people downvoting me, please come up with a reasonable bet and lets try to work it out.
EDIT 2: $500 bet paid to your account on whether LLM's are going to still be used productively or not. No one?
EDIT 3: Any bet that would express the author's argument in a way that can be disproven in the future
The author is not arguing against your bet. I think the author wouldn't be surprised if you won that bet. But that doesn't change the argument he does make.
You could both be right.
No, I don't think people will be spending even more money on AI if it is not becoming productive. 2 years is a long time to get used to it.
Coders underestimate the utility of AI in so many boring day to day tasks. If you freelance, that’s where the money is at, not in creating a startup that fills holes in AI offerings or in creating generic slop while hoping for ad money.
The amount of domain specific apps that will be created will likely make excel look like yesterday’s news.
I heard that a year ago, I guess we still need to wait a bit more. Thought agents were fast!
People have been trying to replace Excel for the last 40 years...
[flagged]
We all remember cryptocurrency. Everyone in tech proclaimed fiat was dead, every office buzzed with talk of every possible way that cryptocurrency could be used, billions of dollars flooded in to projects losing money hand over fist. The cynics reacted to the froth with outright rejection of the idea. And today… cryptocurrency exists, it has some use, but it didn’t take over the world, it didn’t kill fiat, it was useful in some areas and worthless in others. AI will be the same. The noisiest proponents will be over exaggerating. The most cynical cynics will be underestimating. The result will be somewhere in the middle. Success will not be predicated on adoption of the technology. We, nerds, are bad at predicting the impact of technology.
Almost nobody used crypto, everyone is using AI, every day for doing productive work and almost nobody would give it up.
I don't understand how it's remotely reasonable to try to make the comparison.
I wish people would stop comparing AI with cryptocurrency. The hype/perception was the only thing that was similar between them. The fundamental usefulness of the respective technologies are not comparable.
Two other similarities: they both rely on burning huge amounts of electricity, and have driven up costs of GPUs around the world.
The cognitive dissonance around this is astounding.
First of all define productive. Would someone using AI to build software at a startup which is likely to fail be considered productive? What if there is already similar software available that solves the same problems? What about the broad use of LLMs to draft emails or make silly memes?
It’s funny how everyone’s concerns around climate change just disappeared when they realised AI was useful to them.
One of the reasons it gets compared with crypto is because the same people who were all in on crypto schemes are now all in on AI schemes.
That's the fallacy. You think technology usefulness dictates the outcomes. There are billions of people living in poverty the world over, starving every day, we have the technology and resources to solve that right now, we have for decades, but we don't because we don't want to.
Could AI technology change the world? Sure. Will it? That depends on so much more than what the technology can do. Why are we all still working 40 hours a week? Why are people still hungry? We could have radically changed our world with the technology we have had for decades. Yet, we have not, we have continued, nothing has really changed.
The internet is a great example. What is the most impactful part of the internet today? Social media. Social media has radically changed our culture. What is social media? A database, a few endpoints and an app? The technology is the least consequential part, the consequence comes from how we use it.
Nerds focus on what is possible with the technology, not what society is likely to do with it. What evidence is there that AI is going to change the world? What change is going to come from... being able to generate plausible sounding text? From being able to instruct agents? How many companies are using garbage software from 20 years ago despite dozens of revolutionarily better equivalents being available out there today that could have drastically reshaped their workforce? What are agents if not better macros? How many businesses have hundreds of employees doing the same tasks over and over again that could have been replaced by a few macros? How much of the code that you and I have written in our careers has already been written before?
The fundamental usefulness is the least important part of a technology when discussing how that technology will impact the world.
What "society is likely to do" with a technology drops to zero for things that technology cannot do. That's a technical property. On the other hand, people quite reliably find unexpected uses for technology, beyond the initial hype, again, due to inherent technical properties. You're vastly overcorrecting against "nerds".
It’s a matter of timeframes too.
Cryptocurrency is more popular and intertwined with the financial system than it ever was so while the claim isn’t currently true it doesn’t mean it won’t be on a long enough timeframe.
If you are old enough then you would be aware that similar claims were made about email but only one country that I know of (the Netherlands) no longer processes mail. Still if we had to guess I would say that we are still early and email will replace the worlds postal systems.
You are predicting the weather by saying tomorrow will be just like today.
Which has a good track record of being right, most of the time.
I agree!
Geohot's next venture will be writing a book titled "Fear & Trembling".
" Agents cannot program, and it’s taking longer and longer to realize that they can’t. They are a highly sophisticated statistical model designed to mimic the distribution of programming"
In other words - they can program, and probably better than you.
I don't like being too critical but this is a really superficial post - as if either 'AI is a Software Engineer - or - It must be Fraud'
It's an extremely powerful tool that is very 'pattern oriented' and with guidance can absolutely write great code - and even across modules given the right basis.
It's also great at so many other tasks - finding bugs in big code bases, doing migrations etc.
It's not going to make very goo architectural decisions for you, and if you're doing anything novel you have to read most of the code ... but that's too be expected.
It’s not like the author is a noob.
https://en.wikipedia.org/wiki/George_Hotz
In fact, he’s done several things that are truly hard, and has a well-deserved engineering reputation.
Well he doesn't write CRUD apps, which plenty of us do, and with a decent harness, agents can do decent enough work on them.
The author is absurdly wrong.
It's ridiculous to suggest that 'AI can't code' - when the entire development world has moved into agentic coding, including all of the best developers in the world, and it's yielding positive results in most scenarios.
It's a callow 'bad twitter take' the length of an article.
He's not wrong to suggest that IA is a 'stochastic mechanism' over all the code that's ever been written, but that's evidence of the mechanism, and frankly, describe how it is able to code.
And yes - organizations will misappropriate AI at scale as they do with everything.
His premise is so far out of proportion and misguided, it's tantamount to 'fake moon landing' conspiracy theory.
> when the entire development world has moved into agentic coding
Careful, your bubble is showing.