The Messiest Codebase in the Universe (so far)

On the tension between readability and evolvability.

13 min read
Waddington's epigenetic landscape

Nothing in biology makes sense

Theodosius Dobzhansky

The most widely adopted firmware in the world has an unbelievably ugly codebase. Most of the code is completely ineffective; the whole thing has been repeatedly duplicated over the years; entire segments consist basically of a few lines repeated over and over hundreds of times. Sometimes, the duplicated sections have been tinkered with to develop new features; some others just ensure that code gets executed the correct number of times. There are portions of code that run only to copy-paste themselves around. Goes without saying, not a single docstring, or truly descriptive variable names. 1Even worse: analysts have been piling up crazy nicknames to talk about sections of the codebase. And yet, its success is undeniable: it powers the most successful ecosystem. It constitutes the kernel of the only entities whose agency nobody would deny.

This is obviously the genetic code, where code is to some degree a misnomer that would, if taken literally, encounter a frowned look from more than one biologist. Still, it is fair to recognise some very loose analogy with computer code. They both are durable registries whose changes, once enacted through the intermediation of an interpreter, translate into changes of a system and its ability to interact with the environment. 2The term “genetic code” is mostly meant as a cypher to be solved. But note that Schrödinger used code-script" with the same meaning I am implying here.

The way genomic evolution has produced the mind-boggling variety of life forms we know today sounds so miraculous that a significant fraction of humanity still struggles to believe it. DNA has just accumulated random mutations over millions, billions of years, and the differential reproductive success produced by the interactions between the organism and the environment has led to the adaptations and diversity we see today.

A central tenet of evolution is that it selects at the surface: when replicating, genomes accumulate mutations, so there is a constant source of genetic variance. Variants that produce phenotypes with a higher fitness in a given environment are more likely to propagate their genomes. Evolution does not care how a given trait is implemented, as long as it is expressed.

The secret sauce of blind evolution: evolvability #

A famous image compares the probability that a random mutation produces an adaptive change to the chance that shooting at a car engine would produce an improvement in the motor. This is actually an incredibly misleading mental picture: the car engine is a delicate engineered system where each component was designed, and its interactions with the rest of the system optimised, under the assumption reasonable users would not actually shoot at the hood to make the car go faster. Genomes and organisms are a very different story. Since life exists and it has been self-replicating, adaptations to the environment could be driven only through DNA mutations. So even though each single mutation is not in itself adaptive, the system over time acquires features that make it more likely that further mutations actually are. The technical term here is evolvability: the ability of a system to change adaptively to environmental pressures through evolution. Evolvability solves the conundrum: organisms are infinitely more evolvable via random mutations than cars are via hood shooting. 3For digging into evolvability, a friend guarantees this book is a great resource.

There is a long list of traits that increase evolvability by making the system more prone to benefit from further random mutations. The modularity of the design of organisms, their hierarchical organization, the autonomy of the subsystems that compose them, all improve and facilitate the way small mutations can produce major changes without being catastrophic. Subsystems are able to cooperate and mind their own business without micromanagement from above. In this way, the occasional reconfiguration of the high-level system could happen as the consequence of a single gene change, without specifying a whole different routine for each element that composes the system. Image The effects of modularity in genomes: a single regulatory change in fruit flies causes a duplication of their wings.

To some degree, the messiness of the genetic code and its evolvability are linked. A major challenge of producing useful changes that make a protein acquire a new function, for example metabolising a new nutrient, is the fact that by changing it we would risk losing the function it had before. Here, genomic duplications are crucial. The duplication of a whole segment of genome is not in itself adaptive. Even worse, it could produce imbalances that prove harmful for a cell. But it increases enormously the surface for further adaptive mutations to happen: a second mutational event could introduce new features without breaking the old ones. Even when a duplication leads to non-functional code, having it around makes available a lot of “tinkering material”. No wonder duplications are ubiquitous in genomes!

And what about code? #

What does life evolution have to do with software? As it turns out, software engineers also talk about the evolvability of their codebases. Actually, maintainability is a more common, largely overlapping term (but it is recognised that maintaining and evolving a codebase are two separate things). Indeed, over very long timescales, people have observed the same wide trends in software 4There are even papers directly comparing the organisation of genomes and software systems: Yan et al. (PNAS, 2010) compared the E. coli regulatory network to the Linux kernel call graph. So, what makes code evolvable?

Very interestingly, some of the evolvable features we saw above also appear as good coding practices that improve maintainability. Code coupling 5Interdependencies that make the implementation of a single feature require a ripple of changes to be propagated throughout the codebase. is a major source of pain. Therefore, modularity is a very sane principle: scope the pertinence of each component, keep its responsibilities limited. In the same spirit, exposing stable interfaces between modules, behind which code can be changed without affecting how that module is used, is another core principle in software design; as is the layering of abstraction levels, each leaving the implementation details to the layers below. These concepts map, to some degree, onto the principles we have described for organisms: modularity, hierarchy, segregation.

In software, however, those ideas stem also from a necessity that is entirely absent in genomes: software has to be produced, fixed, maintained and evolved by human minds. Human tinkering is expensive, variance is not free in code evolution. 6We could obviously produce random mutations to code, and this is the principle behind genetic algorithms. However this has so far been unfeasible in classic software. The closest precedent might be some viruses that rewrite their own code to evade detection. Therefore, a clean codebase is the result of careful planning, skilled implementation, and thorough testing. Code variance comes from deliberate thinking and elbow grease, and we cannot afford selecting code at the surface: how something is implemented has to be carefully integrated with our predictions of what will be implemented next.

The vibecoding era #

I think that those intuitions are being challenged by vibe coding 7Collins English Dictionary’s Word of the Year for 2025. . Vibecoding or (in a sad attempt to make it sound a bit less improv theatre and a bit more LinkedIn skills), AI-assisted coding or agentic coding means coding by just telling a coding agent the specifications that a program has to match, and fully delegating to it the whole code implementation. In the strict definition, you do not even read the changes, and let the assistant push to your codebase anything that passes tests.

If you have tried it yourself or you have been reading something about it, you know the usual takeaways (as of February 2026): the technology is universally accepted as incredible for short-horizon tasks and demos; as maturing for saving human labour in producing production-grade code; and either utterly useless or revolutionary for working with long-horizon tasks with no human in the loop. The split is real, and the camps are very divided: some claim it is just leveraging _technical debt_ 8Technical debt means deliberately shipping suboptimal code, knowingly postponing fixes and refactoring headaches to the future. , and cannot be used for production-grade code; most believe that it will completely disrupt the software industry in 2026. I think both can actually be true.

Recently at work I have been implementing, deploying, and evolving features in a large ERP codebase with complex logic and interaction with several external services, vibecoded from scratch. There, I was vibecoding in the purest sense: I could not even read any of the codebase myself (the tech stack was Next.js/React with TypeScript and Chakra UI, and I only speak Python). 9As a digression, it was fascinating to observe in such a clear way how much work is still there to be done for software when one does not write code. As a person with no training or experience in code production, I used to think software is primarily defined by its codebase, and its environment is the machine it runs on. This is actually very wrong: it is just as much interaction within the world it is a part of and whose processes it models. And this is actually where even the best coding agents inevitably fail with their planning, and they need “evolution”: they cannot see what’s between the monitor and the chair.

When working on this codebase, we have been using all the known drills to anchor our agents to reasonable code quality. We set up our cursor rules, our AGENTS.md 10in 2026, a markdown filename constitutes a standard… , declared all our nice codebase principles, employed state-of-the-art models. The codebase, once the project reached maturity (500k LOC 11Don’t ask me, I’m a biologist. , deployed for some months now), has become reasonably stable. And yet…

Yet it kept accumulating what an engineer would immediately recognise as code smells. Duplications are ubiquitous, and keep appearing. Code is basically never refactored, and carries around vestigial modules whose function is getting forgotten. To stress the point: this happens despite all the explicit prompts about “keeping code tidy” from the upfront .rules.mdc configurations; and despite the agents actually recognising many of those code smells if prompted to do so. So, are the skeptics right? Are those models really not capable of shipping crisp code when operating with long temporal horizons?

Designing and evolving #

At the beginning, I thought so too. I was frustrated about the idea of the rotten code accumulating, the logic diverging, the abstractions being constantly violated. But implementing new features and fixing the old ones was getting actually easier. 12Clearly not a scientific statement. And thinking deeper, I actually started wondering whether the mess was not a bug but a feature. In other words, whether what I was observing was that extensive rounds of vibecoding themselves were setting an “evolutionary pressure” that, as more and more selections compound, favours the spontaneous appearance of evolvable traits in the codebase. Surely, each of those anti-patterns makes the code hard to read as hell 13But this is vibecoding: nobody even tries . But duplications, hierarchy violations, redundancies, reimplementations do lead to _un_coupling, which makes the code potentially more readily evolvable in the next iterations.

So, is it possible that what is emerging here is an inescapable tension between readable and evolvable? Readability is a requirement that increases coupling: we want to reuse the same functions, adhere to our tidy abstractions, preserve layer segregations… And coupling is bad for evolvability: new features become harder to implement, bug fixes that were supposed to be small have to propagate changes throughout the codebase, etc. I do not mean to say that none of those readability concerns affect coding agents; but by sheer force of will and speed of iterations, a coding agent that is properly guided by test feedback can easily deal with a much more fragmented codebase.

Traditionally, the costs of losing readability were very clear: it would mean increased cognitive load and reduced productivity; duplications linearly scale the amount of labour required to fix bugs, and make the logic hard to follow. But do the same constraints really apply when we can have free coding labour and free variance, and can afford to keep throwing spaghetti at the wall and see what sticks? Could it be better to maximise tinkerability at the expense of global analysability?

Giving up control? #

Almost free labour at the keyboard-typing aspect of software production will mean software engineers will be working at increasingly higher levels of abstraction, and they will be pushed away (and upward) the chain of feedback loops that have to be closed for software to be maintained and evolved. Code bugs could be automatically addressed. Error logs from servers could directly feed agent pipelines. User requests or interaction metrics could be automatically distilled into new feature implementations. In the backend, microservices could compete with each other and be selected on speed or robustness. Once free code variance is unleashed, it could lead to an explosion of diversity of how software is produced. But this does not mean code will be better, or crisper in its logic than what humans would produce. Quite the opposite, by mastering code evolution, we could make code readability obsolete: are we sure that in this massive reconfiguration it makes sense to stick to the principles that have so far defined code quality in the eyes of the human developer?

No matter how many trillions of human-typed tokens an LLM has ingested, a coding agent is a very different object from a human mind. Jaggedness — the idea that models can be (or are already) super-human at many tasks and markedly sub-human for many others — is probably here to stay. And no matter how good they will get with code, it is unlikely that they will be able to predict every possible interaction between software and its production environment. So why don’t we loosen constraints and harvest variance?

The current way of producing code is clearly not designed around variance abundance. Here, the tricks that evolutionary algorithm designers have learned to leverage could actually become an inspiration for the whole software production chain. For example, why keep the bottleneck of a single software version being released? We could start deploying whole pools of code versions, implementing new features on all of them, and screen the variants that have the best fitness in production — whatever measure of fitness we use. Sky (and infrastructure) is the limit! Obviously, we would need to find ways of governing the process by guiding this selection. But maybe we should get rid of having too many a priori constraints. Abandon the idea of enforcing tidy, analyzable code by abundant upfront specifications, and instead keep everything that increases fitness.

Granted, deciding to lose control of something we invented, that powers a significant chunk of our society’s information infrastructure right now, is particularly unsettling — especially when the result might look worse in the short term than what our active effort could produce (in much longer times). But something similar already happened when artificial intelligence as a field moved from Good Old-Fashioned AI to connectionism and modern neural networks. We moved away from the ambition of engineering every tiny aspect of deliberate thinking to a much more modest next-token prediction, performed at an unbelievable scale and optimization. We went from micromanaging heuristics to focusing on learning rules and hyperparameter tweaking, studying learning dynamics, optimising initializations that make optima discoverable. It seems to have worked, even though in the meantime we have produced a significant amount of technical debt — we are still trying to understand a technology that was globally adopted more than three years ago.

So, why not give up more control in software production? Can we design proper ecosystems and guide evolutionary dynamics to steer software to do what we need it to do for us? And in the process, do we want to obsess on keeping code conceivable and analysable by humans? Maybe not. Maybe we should stop worrying, and just start to love vibecoding.

Dr. Strangelove riding the bomb