← Back to Insights

The AI Coding Revolution That Actually Works

Senior expertise plus agentic tools produces a different kind of throughput. Without the expertise, you get speed in the wrong direction.

6 min readBy The Bushido Collective
AIClaude CodeEngineering ExcellenceDeveloper ToolsEfficiency

It's Tuesday afternoon. Three PRs are open in front of you, all generated with agentic tooling, all passing CI. The first adds a retry loop around a Postgres client that's already retrying internally. The second introduces a second HTTP library because the model didn't notice you already had one. The third looks clean until you trace a cache invalidation path that quietly assumes single-region deployment. None of these will fail today. All of them will fail on a Thursday in six months when traffic spikes and you're trying to explain to the CEO why a feature nobody remembers shipping is the reason checkout is down.

This isn't a story about AI being bad at code. The code is fluent. The tests pass. Simon Willison has written about how capable Claude Code and similar agents have become at the actual generation — and he's right. That part of the problem is solved. The part that isn't solved is what happens upstream and downstream of the generation: whether the person steering knows what shape the solution should take, and whether they can evaluate what came back.

The Shape of the Real Productivity Gain

The productivity numbers circulating right now are genuinely confusing. GitHub's research on Copilot found 55% faster task completion in controlled conditions. A METR study of experienced open-source contributors found AI increased task time by 19% — while the developers in the study had predicted a 24% speedup. Both findings are real. The 74-point swing between them isn't noise. It's a signal about where the gain actually lives.

The gain lives with the senior engineer who already knows the shape of the answer. It collapses — sometimes inverts — when the person driving can't tell fluent output from correct output. The Pragmatic Engineer has documented this repeatedly: tool-level speedups don't translate to team-level throughput when the bottleneck sits elsewhere, and "elsewhere" is usually the judgment required to direct the tool.

What Driving Actually Looks Like

A concrete version from one of our engagements. At Oxen.ai, an engineer who'd never shipped Elixir before built and deployed a production service — a few thousand lines, full test coverage, integrated with the existing Rust-centric stack, on a timeline of days rather than weeks. He's been writing production code for over a decade across three companies, including a previous founding-CTO role. He knew how a service should handle backpressure, how errors should propagate across a message boundary, how the observability hooks needed to shape up for the team's existing dashboards. He didn't know Elixir syntax. The agent handled the syntax. He handled every decision that mattered.

Replace the senior engineer with someone two years into the craft and the same tooling produces a service that passes review, runs in staging, and fails the first time production traffic hits a path the tests didn't cover. Same model. Same prompts. Different outcome. The variable isn't the tool.

What the senior engineer brings is a mental library of failure modes. He's seen a retry loop turn a brief database hiccup into a cascading outage. He's watched a cache-aside pattern silently corrupt data across a deploy. He's debugged the connection pool exhaustion that only shows up in week three after a small traffic bump. Anthropic's own guidance on Claude Code is explicit that the model works best with a human reviewing at the level of intent, not just syntax. That review only exists if someone in the loop has the intent in their head to begin with.

The Output Illusion

Here's where most teams go sideways. The metric that looks like productivity — lines merged, PRs closed, velocity points — measures exactly the thing AI tools made cheap. Call it the output illusion: confusing volume of shipped code with velocity toward a working system. DORA's research on elite performers has been consistent for years that throughput without stability is a negative indicator, not a positive one. AI tooling lets a team post throughput numbers that would have required stability once, and no longer do.

We've watched this arc inside partner companies. At an earlier-stage portfolio team, a push for AI-assisted velocity moved merged-PR-per-week up 40% in a quarter. Incidents per week moved up 3x in the quarter that followed. The work was real; nobody was faking anything. The team had simply replaced the slowest step — writing the code — with something faster, and hadn't noticed that the review and reasoning steps were now the constraint. The surplus code accumulated faster than the system's ability to reason about it.

The Reframe: Conducting, Not Typing

The senior engineer working well with agentic tooling is running a different operation than they used to. One workstream is drafting the API handlers. Another is generating migration scripts. A third is writing integration tests against the draft. The engineer is moving between threads, reviewing, redirecting, rejecting. The work product looks like a team. The decision-making is still one person.

This is closer to conducting than to typing. The conductor doesn't play every instrument. They hold the piece in their head and make sure each section is playing the right thing at the right moment. Remove the conductor and you don't have a faster orchestra. You have noise generated in parallel.

The implication for how a team is shaped isn't subtle. At GigSmart, ToolWatch, and Oxen.ai — the companies our founding partners built — the leverage ratio between senior judgment and shipped output was already the real number, long before agentic tools existed. What these tools change is the amount of output a single experienced engineer can direct. What they don't change is whether the engineer knows what to direct toward. That part is still scarce, still earned, and still the difference between a team that compounds and a team that accumulates.

If the PRs in your queue on Tuesday afternoon look fluent but nobody on the team can tell you why the code is correct, that's not a tooling problem waiting on a better model. It's a shape-of-the-team problem, and agentic tooling is making it visible faster than most teams can respond. If you want a second set of eyes on whether your team is set up to conduct or just to generate, that's the work we do.

Ready to Transform Your Organization?

Let's discuss how The Bushido Collective can help you build efficient, scalable technology.

Start a Conversation