We Shipped 3x Faster With AI — and Made Our Worst Architectural Decision in 5 Years

Introduction

We shipped roughly 3x faster after adding AI to our stack.

We also made one of our worst architectural decisions in five years.

That second part doesn't go in the case study.

The first part does.

This is the longer version — the part about what actually broke, what AI couldn't have known, and how we're trying to ship fast without digging the hole faster.

The speed was real

AI tools changed our throughput in ways that are hard to argue with:

Boilerplate that used to take an afternoon now takes minutes
Test scaffolding, CRUD endpoints, and UI variants arrive before the standup ends
Engineers spend less time typing and more time thinking through product flows

On paper, it looked like a clean win. Velocity up. Cycle time down. Everyone happy.

Then we merged a change that looked harmless.

The decision we shouldn't have made

The PR was well-structured. Types checked. Tests passed. The diff read like something a careful senior engineer would write.

What it did was introduce a new abstraction layer in a part of the system that already had too many layers — because three years ago, a client escalation forced us to bolt on a compatibility shim that was never meant to be permanent.

AI had no way to know that.

It saw duplicated logic and "fixed" it the way textbooks recommend: extract, generalize, reuse.

The refactor was technically correct.

It was also contextually wrong.

Within a week, edge cases started surfacing in production — not in the happy path, but in the weird paths that only exist because real businesses run on exceptions, workarounds, and things someone promised in a Slack thread in 2021.

That's when it clicked for us: acceleration without judgment just means you dig the hole faster.

What AI can't see in your codebase

Your repository is not the full story of your system. Not even close.

Models read files. They don't read history, politics, or scar tissue.

Here is the kind of context that lives in team memory — and is invisible to every model:

1. Business-logic fuses

Some code looks redundant because it is redundant on purpose. It exists as a circuit breaker: if one path fails, another still completes the transaction. A refactor that "cleans this up" can silently remove the only fallback that keeps payments working when a third-party API hiccups.

2. "Temporary" fixes that became load-bearing

Every mature codebase has a comment like // TODO: remove after migration. Some of those TODOs are older than your newest hire. They aren't technical debt in the abstract — they're structural debt with a reason. Removing them requires knowing why they were added, not just what they do.

3. Abstractions born from client pressure

Sometimes you don't extract an interface because it's elegant. You extract it because a client melted down in Q3, legal got involved, and the fastest path to calm everyone down was a configurable layer nobody actually wanted long term. The abstraction is ugly. It also kept the contract.

4. Refactors that look simple but aren't

Renaming a service, moving a module, or "just" splitting a resolver can break:

Webhook idempotency assumptions
Cache invalidation timing
Audit trails required for compliance
Billing proration logic tied to a specific timezone edge case

AI can generate the refactor. It cannot feel the blast radius.

The review bottleneck got worse, not better

Before AI, our review queue was already tight. After AI, code volume outpaced review capacity.

That creates a dangerous loop:

More code gets generated
Reviewers skim because there's too much to read
Context-heavy changes get approved because they look fine
Debt compounds in places nobody flagged
The next AI-assisted change builds on top of the mistake

Speed didn't remove the need for judgment. It amplified the cost of missing it.

What we're doing differently now

We didn't roll AI back. We changed how we use it.

1. Tag the landmines

We maintain a lightweight internal doc — not a wiki graveyard, just a living list — of areas where "clean" changes are risky:

Payment and billing flows
Legacy migration shims
Client-specific overrides
Auth/session edge cases
Anything touching webhooks or idempotency keys

Before AI touches those paths, a human with context has to be in the loop. No exceptions.

2. Separate generation from integration

AI drafts. Humans integrate.

We treat model output like a junior engineer's first PR: useful, fast, and not merge-ready by default in sensitive areas. The value is in the draft, not the diff.

3. Smaller PRs, slower merges in risky zones

Fast shipping doesn't have to mean big-bang PRs. We split work so high-risk changes get reviewed with room to ask "why was it like this before?"

If nobody on the team can answer that question, we stop and investigate before merging.

4. Capture context when you touch legacy code

Every time we modify a workaround, we leave a short note: what broke, who needed it, what fails if this disappears.

Not for the model. For the next human — including future us.

That is how institutional memory survives turnover, vacations, and the next wave of tooling.

5. Measure review quality, not just velocity

Lines merged per week is a vanity metric if half of them need hotfixes.

We pay more attention to:

Revert rate
Incidents tied to recent refactors
Time-to-fix for production issues in "stable" modules

If speed goes up but those numbers go up too, we're not winning.

The engineers who matter most right now

The most valuable engineers on our team aren't the best prompters.

They're the ones who:

Know which folders are safe to let AI run in
Ask "what happens if this fails at 2 a.m.?" before approving
Remember why the weird branch exists
Push back when a refactor looks right but feels wrong

Prompting is a skill. Judgment is the multiplier.

AI makes execution cheap. Taste, context, and restraint are what keep execution pointed in the right direction.

So — how do you handle technical debt when AI ships faster than you can review?

There is no perfect playbook yet. This is what we've landed on:

Accept the speed. Fighting the tooling is a losing battle.

Protect the context. If it isn't written down, it doesn't exist for the team — or for the model.

Slow down where it hurts. Payments, auth, billing, migrations, and client-specific logic deserve friction.

Review for history, not just syntax. The question isn't only "is this code correct?" It's "do we know why the old code was wrong-shaped on purpose?"

Pay down debt deliberately. AI can help refactor after you understand the system. It is a terrible substitute for understanding.

Closing thought

Yes, AI accelerates execution.

But execution was never our bottleneck.

Understanding was.

The teams that win in this phase won't be the ones that generate the most code. They'll be the ones that know when not to trust the output — and who build just enough process to keep speed from turning into damage.

If you're navigating the same tradeoff on your team, I'd love to hear how you're handling it.

More from me: LinkedIn · GitHub