← Back to Insights

After Vibe Coding: The Maintenance Reckoning Nobody Warns You About

Shipping the MVP with AI was the easy part. Running it is a different job entirely.

7 min readBy The Bushido Collective
AITechnical DebtMaintenanceProduct DevelopmentFractional CTO

It's Tuesday morning. Your third customer this week has emailed about being double-charged. You open Stripe and see two successful charges with the same idempotency key — which shouldn't be possible, but there they are. You switch to your app's logs. The logs don't show the second charge at all. You ask your AI assistant why. It suggests adding more logging. You add more logging. The next double-charge happens Thursday, and the new logs don't explain it either.

You didn't write most of this code. You prompted it. It worked in test mode. You launched three months ago, the signups are real, and now something is quietly wrong with the part of the system that moves money. The tool you used to build it can't see what it's looking at.

The Shape of the Problem

The METR study of experienced open-source developers is worth sitting with. Sixteen contributors, working in codebases they already knew well, took 19% longer with AI assistance than without. They'd predicted a 24% speedup. The gap between felt-productivity and measured-productivity was 43 points, and these were senior engineers who understood every system they were touching.

If that's what happens when experts use these tools on familiar code, consider what happens when a non-engineer uses them on a codebase nobody has ever read end-to-end. The output looks the same. The internal model does not.

Simon Willison has written extensively about the difference between using LLMs as a code generator and using them as a collaborator. The former produces artifacts; the latter builds understanding. Most first-time founders using Cursor, Replit Agent, or Copilot are doing the first and hoping it produces the second. It doesn't, by default. Understanding is a separate workstream, and if nobody budgeted for it, it didn't happen.

What Production Actually Tests

The test environment is a controlled experiment. Production is every other condition you didn't think to simulate.

Payment webhooks retry. Real cards fail in ways test cards don't — 3D Secure challenges, soft declines that look like successes to naive code. Database queries that return instantly at 100 rows lock the table at 100,000. A user opens your app on a phone with a flaky connection and the retry logic you never wrote turns one action into three. Someone in Europe signs up and the timezone assumption buried in your scheduling code fires their recurring job at the wrong hour for six weeks before anyone notices.

None of this is exotic. It's just the actual behavior of systems under real load, which is what Charity Majors has been arguing for a decade: you cannot know how your system behaves in production until it is in production, and the only way to learn from production is to instrument it so you can see inside.

The code you generated doesn't ship with that instrumentation. Cursor didn't prompt you to set up structured logging. The AI agent that wrote your payment integration didn't ask whether you wanted a dead-letter queue for failed webhook deliveries. Those weren't oversights on the tool's part. They're judgment calls, and the judgment has to come from somewhere.

The Comprehension Tax

Here's the pattern we keep seeing, and it's worth giving it a name: the comprehension tax.

Every line of code in production costs something to understand. When you write code yourself, you pay that cost up front — slowly, painfully, by typing it. When you generate code, you defer the cost. The system accumulates lines faster than anyone accumulates understanding of them, and the gap between those two curves is the tax. It doesn't disappear. It comes due the first time something breaks in a way the AI can't pattern-match to a Stack Overflow answer.

The comprehension tax is why vibe-coded MVPs feel weightless on the way up and unbearable on the way down. Shipping a feature that works is a small deposit. Shipping a feature you can't debug at 2am is a large withdrawal. For the first few months, deposits dominate. Past product-market fit, withdrawals take over. Founders hit this wall and assume the problem is AI code quality. The problem is that code quality is cheaper than code comprehension, and they were optimizing the wrong one.

At Oxen.ai, where one of our founding partners is CTO, the daily question isn't "can we generate more code." It's "do we understand what the system we shipped is actually doing to our customers' data." Those are different disciplines, and vibe coding quietly ships the first one while skipping the second.

What Actually Moves the Needle

There is a version of post-MVP life that doesn't end in a rebuild. It requires doing exactly three things that nobody told you were part of the job when you started.

First, make the system legible to you before it's stressed. Martin Fowler's writing on observability and DORA's research on high-performing teams converge on the same point: the teams that ship reliably are the ones that can see what's happening in their own systems in real time. Sentry, a log aggregator, and a basic uptime monitor are the entry fee. You don't need to understand every line of your code. You do need to see when a line of it misbehaves, and which line, and what the user was doing at the time.

Second, treat the AI differently past launch than you did during the build. Pre-launch, you used it to generate. Post-launch, use it to explain. When Stripe's webhook handler misbehaves, paste the handler into a session and ask it to walk you through every branch. Don't ask it to fix anything yet. Force yourself to understand what the code is trying to do before you let it change anything. This is slower. It is also the only way the comprehension tax gets paid down.

Third — and founders resist this hardest — get someone experienced to look at the system once, before it breaks in the way that costs you a customer. Not a full-time hire. Not a course. A person who has taken a payment system, an auth flow, and a multi-tenant data model to production a few times and knows where the landmines are. Two hours with the right operator and a screen-share will surface the five risks that will actually hurt you, ranked. That's not consulting. That's triage.

The Part That Isn't Optional

The uncomfortable version of this piece is that the founders who survive past the MVP aren't the ones who ship fastest. They're the ones who notice earliest that shipping and running are different jobs, and who stop pretending the tool that did the first one will do the second.

If you're reading this with the specific kind of dread that comes from recognizing your own situation in it — the logs you don't fully understand, the bug that moved when you fixed it, the customer email you haven't replied to because you can't explain what happened — that recognition is the useful signal. It isn't a sign you built the wrong thing. It's a sign you're past the part of the journey the tools were built for, and you need a different kind of help for the part that comes next. That's the conversation to have now, not after the next incident makes the decision for you.

Ready to Transform Your Organization?

Let's discuss how The Bushido Collective can help you build efficient, scalable technology.

Start a Conversation