The Meter Was Running

Your CFO schedules a fifteen-minute meeting. The invite says "AI tooling spend." You pull the numbers on the way in. Eight months ago this was a $4,000-a-month experiment. Today it's a $40,000-a-month line item, and the finance model projects $600,000 annualized once agentic workflows hit the rest of engineering. The meeting isn't about whether to keep the tools. It's about whether you can prove they're worth what you're paying.

You sit down. Before you can open your laptop, the CFO says: "Walk me through the ROI."

What's your opening sentence?

What You Actually Have

You have adoption rates. Developers use the tools every day, which is a nice number, though it would be an embarrassing one if they didn't given what's being spent. You have satisfaction scores. Engineers like the tools. You have PR merge time, which is down a bit. You don't have a baseline cycle time from before deployment — nobody set one. You don't have a clean mapping from merged PRs to revenue impact, because that mapping was always messy and AI didn't make it cleaner.

So you tell the CFO what you can: the proxies look good, the team is happy, the tools seem to be working. They nod and ask the next question. "What's our spend per engineer-hour saved?" You don't have that number. Nobody has it. An engineering manager described this exact meeting on r/EngineeringManagers last month — the token bill hit six figures and the productivity case collapsed into hand-waving when finance pressed for specifics.

The hard part isn't that you're unprepared. It's that the measurement problem is harder than the sales pitch suggested. GitHub's own Copilot research reported 55% faster task completion in controlled conditions — a real finding, but one measured on isolated tasks, not end-to-end delivery. Gergely Orosz at The Pragmatic Engineer has documented how hard it is to tie even clear tool-level speedups to throughput at the team level, because the bottleneck usually isn't where the tool is pointed. Simon Willison has written repeatedly about the gap between an impressive demo and measurable organizational productivity. If the people most enthusiastic about these tools can't reliably measure the return, you reconstructing a framework after the fact isn't going to produce a clean number either.

The Governance Gap

Call it the governance gap. You deployed the tools as a bet. "It'll pay for itself" felt obvious enough that nobody asked what "it" or "pay" or "itself" actually meant operationally. So no baseline cycle time got recorded. No per-team cost attribution got wired up. No mapping from delivery acceleration to revenue got built. The meter was running from day one — you just don't watch a meter when you've already decided the outcome.

The bill arrives and the framework is missing. That's the gap.

There's a quieter cost problem hiding underneath it, too: token waste. The same codebase context gets re-sent with every inference request. No shared caching, no persistent understanding, no cross-team efficiency layer. It's re-indexing the library every time someone looks up a word. Solvable, but only if someone owns it deliberately — which nobody did, because nobody planned for the bill.

The Constraint Moved

There's a second gap underneath the first, and it's the one most governance conversations miss. The tools didn't just change the cost structure. They moved the bottleneck.

Eliyahu Goldratt's Theory of Constraints is forty years old and still the cleanest lens for this: optimizing anything other than the binding constraint produces no throughput gain, and sometimes a net loss. For most of software history, code production was the binding constraint. That's why engineering teams are structured with many engineers per PM, per designer, per QA. Writing the code was the slow, hard part. Everything around it — ratios, review processes, staffing — was shaped by that reality.

AI tools made code production cheap. Which means it's no longer the constraint. Teams now generate features faster than requirements can specify them, faster than reviewers can evaluate them, faster than QA can validate them. You optimized the step that wasn't limiting throughput, and the steps that actually were limiting throughput are now more overwhelmed, not less.

Put yourself in the seat of your review queue on a Wednesday afternoon. Three AI-assisted PRs landed in the last hour. Each touches code you haven't seen in two months. The engineer who opened it spent forty minutes generating and eyeballing the diff. You're expected to spend twenty minutes actually understanding it. The review math was already hard before AI tripled the input volume. Now it's broken.

So what happens? In practice, the review gets lighter. The approver skims. The CI gate becomes the real review. Or — the more honest version — the organization quietly stops treating human review as the meaningful gate and shifts accountability onto the individual engineer, with rollback capability and DORA metrics as the safety net. Some companies have done this on purpose: Klarna publicly restructured around AI-augmented output before reversing parts of it when the quality tradeoffs showed up. Most teams never made the decision at all. Review queues got long, people adjusted, and the new policy got ratified by nobody signing off on it.

That isn't governance. That's drift.

The Real Decision You Already Made

Here's the reframe. The AI tooling decision wasn't really a tooling decision. It was an organizational design decision, and by the time you're in front of the CFO explaining spend, most of that design has already happened by default.

The engineer-to-PM ratio was set when code was the bottleneck. The QA headcount was set when code was the bottleneck. The review process, the staffing model, the promotion criteria — all calibrated against a constraint that doesn't bind anymore. You didn't have an AI ROI problem. You had an org structure built for a different physical limit, and you deployed tools that broke the limit without touching the structure. The ROI question is unanswerable against a system designed to optimize something else.

What to Actually Do

If you're deploying now, the measurement comes first. Record baseline cycle time before you flip the switch. Wire per-team cost attribution into the rollout. Build a caching strategy for repeated codebase context so the meter isn't eating itself. Most of all, name an explicit hypothesis about which constraint you're trying to relax. If code production is the bottleneck, carry on. If it's requirements clarity, review capacity, or QA bandwidth, a faster code-writing tool will make your throughput worse — the bottleneck just gets more backed up.

If you're already deployed without a framework, be honest with your CFO about what you can and can't prove. Reconstruct what baselines historical data can give you. Instrument the gaps. And name the structural question explicitly: the shape of the org was built for a constraint that's moved, and the next work isn't tool tuning — it's figuring out what the org should look like now.

Back to the CFO

"Walk me through the ROI."

The useful answer isn't a number. It's: "The tools are cheaper than the hiring we'd need to produce the same code volume. But code volume isn't the constraint anymore, so that comparison flatters us. The real question the spend is forcing us to answer is whether our engineering org is shaped for where the bottleneck lives now. Here's what I'm doing about it."

If that's the meeting you're walking into and you want a second set of eyes on the framework before you sit down — that's exactly the work.