9 january 2026 - agents are the wrong idiom
I don't think "agentic AI" is the right approach for a bunch of reasons [1].
The most obvious and popular reason is maintainability over time. As the generated code grows larger and evolves over time, your understanding, and ability to fix issues, drop precipitously. Eventually, you have a mess of incomprehensible code that cannot be resolved by any human dev, and might be too big for your own agent to fix (and could literally exceed the context window). I'm not sure that throwing ever-increasing compute at the problem is a practical solution, since you'd almost certainly need to continue to scale compute exponentially. The normal response is "you should be reviewing all the code it produces," but as anyone whose ever done code review knows, it is nearly impossible to gain a perfect understanding of a sizable amount of code that you did not write. Furthermore, there is knowledge gained in the process of writing a large amount of code, that isn't necessarily communicated in a PR.
Another big issue is internal software, which the LLM is not trained on. RAG is clearly not sufficient, since there is a searching bottleneck and a "putting together totally novel systems" bottleneck. Internal frameworks aren't necessarily similar to Internet code; they often have unique syntax, idioms, and patterns. Again, you can *maybe* fix this by continuing to scale compute, but my gut sense is it doesn't make sense economically after a certain point.
There's the problem of underspecified intent, where a developer may think they can fully describe what they want, but in fact would have discovered what they actually want through the process of writing code themself. LLMs don't tend to have these "aha" moments; they don't push back against your wishes, nor do they necessarily think about the big picture. Unless we radically change how these models are post-trained, I don't see a clear solution.
There is also definitely a tension between code quality and code correctness, and the current RL paradigm seems to push models to write highly correct code (e.g. through extensive error checks) but not necessarily high-quality code. Presumably, this is an effect of rewards based on correctness, which is easy to verify, whereas quality is much more subjective and fuzzy. Now, the Twitter startup guys would argue that code quality doesn't matter, but these principles are rarely unmotivated. Generally, they are inspired by maintainability, readability (see point 1), preventing logic errors, and future-proofing.
Finally, my biggest reason is that writing code was never the difficult part. Professional devs aren't limited by how fast they can type code, the vast majority of the job is spent debugging, agreeing on specs, communicating, etc. I had plenty of days working at Apple where I'd write little to no lines of code, because my effort was spent otherwise. This is why I'm not surprised agent coding tools are so popular amongst hobbyists, where going from zero to lots of code is critical.
With that in mind, there are plenty of cases where agents make sense. For hobbyists, for personal projects, for basic web dev, little tools, and easily-verifiable code, it makes perfect sense. But for large-scale enterprise development, I am not convinced these problems are solvable with finite resources.
So what should we do instead? My favorite way to use AI is as a hyper-intelligent debugger. You can point it at a bug, and quickly get a decent guess at what the issue is. This can then be easily verified and the cost of an incorrect guess is zero. You can use current agents like this, but I would love to see an AI tool designed specifically for debugging. Imagine a passive agent which cosntantly checked new code for semantic errors, footguns, and subtle bugs. Sort of like Copilot for PRs, but integrated way earlier into the development code. The AI debugger then does not increase cognitive load while decreasing bugs, enabling devs to do the fun part—developing new features.
mg
[1] Of course, I may eat my words at some point.