Back to writing

Code review is the new bottleneck; how I'm using AI to manage it

We invested in AI code generation but left reviews behind. Custom review skills and a feedback loop between generation and review changed everything.

claude-code code-review ai-engineering

We invested heavily in AI-assisted code generation. Custom skills, project-specific rules, structured prompts that encode our architecture decisions, naming conventions, and domain patterns. It works. The team ships faster, PRs land more frequently, and the codebase grows at a pace that would have been unthinkable two years ago.

But that pace created a new problem: code review can’t keep up.

More PRs means more reviews. And AI-generated code still needs human verification, sometimes more carefully than hand-written code. The patterns can look perfectly reasonable at first glance while missing subtle project conventions. A service that returns data from a command handler when our CQRS rules say commands should never return domain objects. A test named “it should set isActive to true” instead of “it should activate the user given a valid subscription.” Things that compile and pass, but don’t match how we build software.

The same tech leads who used to review five PRs a day now face fifteen. The review queue grows, merge times slow down, and developers wait longer for feedback. Context switches pile up because reviewers jump between unrelated PRs all day. The bottleneck quietly shifted from “writing code” to “making sure code is good.”

We spent months building generation skills that encode our conventions and domain knowledge. We gave almost zero attention to the review side. And yet review is where quality is actually enforced. Generation proposes; review decides.

According to Addy Osmani’s research on code review in the AI era, teams using AI-assisted review reduce time spent on reviews by 40-60% while improving defect detection rates. The tooling exists, the gains are real. But most teams reach for generic solutions that catch generic issues, and miss what actually matters for their codebase.

If you’ve invested in AI-assisted code generation but haven’t invested equally in AI-assisted code review, you’re optimizing the wrong half of the pipeline.

Generic AI reviews catch generic issues

Tools for AI-assisted code review already exist. Claude Code recently shipped a multi-agent review feature that posts inline comments on PRs. GitHub Copilot has its own review capabilities. Third-party tools like Qodo and CodeRabbit analyze diffs and flag potential issues. And Anthropic’s own data shows impressive results: PRs receiving detailed review comments jumped from 16% to 54% after introducing AI review, and large PRs (over 1000 lines) contain issues 84% of the time.

They’re genuinely useful for a baseline. Null pointer risks, obvious security holes, unused imports, inconsistent error handling, missing edge cases in conditional logic. The kind of issues that any experienced developer would catch but that slip through when you’re reviewing your twelfth PR of the day and your attention is fading. Having an AI flag these before a human even opens the PR is a real time saver.

But here’s the thing: the issues that actually matter in a mature codebase are rarely generic.

They’re things like: this aggregate exposes a setter that bypasses the invariant check. This query handler imports from the command side of the module. This event name uses past tense in one context and present tense in another. This integration test hits the real database instead of using the in-memory repository we built for that bounded context. This service method does both validation and persistence, when our convention is to separate those concerns into distinct pipeline steps.

No generic tool catches these. They can’t, because these rules don’t exist anywhere outside your team’s collective knowledge. They live in onboarding conversations, PR comment threads, architecture decision records that half the team hasn’t read, and the heads of senior developers who’ve been on the project long enough to know why things are done a certain way.

And that’s the core problem. The most valuable review comments aren’t “you forgot a null check.” They’re “this breaks the pattern we established in the ordering context” or “this aggregate is doing too much, the invariant doesn’t require all of this in the same transaction.” These comments require deep project knowledge, and that’s exactly the knowledge generic tools lack.

The gap isn’t intelligence. These AI tools are plenty smart. The gap is context. They don’t know your codebase’s rules because nobody told them.

So we told them.

Encoding your knowledge into review skills

The same way we built custom skills for code generation, we built custom skills for code review. Same system, same approach: encode what matters into structured instructions that the AI can follow consistently. I explored this idea of encoding knowledge into structured AI workflows in a previous article about building a DDD facilitator skill; the same principle applies here, just on a different problem.

The idea is simple. Instead of relying on the AI’s general knowledge of “good code,” you tell it explicitly what good code looks like in your project. Architecture rules, naming patterns, test conventions, module boundaries, the decisions your team made and keeps making every day. Everything a senior developer carries in their head when they open a PR, written down in a format the AI can apply systematically.

This goes deep. We don’t just encode high-level principles like “follow clean architecture.” We encode the specific, opinionated decisions that make our codebase ours.

For architecture boundaries:

## CQRS boundaries

- Command handlers MUST NOT return domain objects. They return void or a simple acknowledgment.
- Query handlers MUST NOT trigger side effects or emit domain events.
- A module's command side and query side MUST NOT import from each other.
- Commands are validated in a dedicated validation step before reaching the handler.
  The handler assumes valid input.

Compare that to what a generic review tool would flag on the same code. It might tell you the function has high cyclomatic complexity, or that a variable could be const instead of let. Valid observations, but not what you actually need a reviewer to catch. The real issue is that a command handler returns an entity, and that breaks the architectural contract the team agreed on six months ago.

We do the same for test conventions:

## Test naming and structure

- Test descriptions follow: "it should [expected behavior] given [precondition]"
- Bad: "it should set isActive to true"
- Good: "it should activate the user given a valid subscription"
- Test names describe business outcomes, not implementation details
- Each test file mirrors the source file structure: user.service.spec.ts tests user.service.ts

This one catches a surprising amount. Left to its own judgment, the AI will write test names that describe what the code does internally (“it should call the repository save method”) rather than what it achieves from a business perspective (“it should persist the new order given all items are in stock”). The difference matters. When a test fails six months from now, “should activate the user given a valid subscription” tells you what broke. “Should set isActive to true” tells you nothing about why that matters.

For domain modeling, the rules encode years of hard-won DDD experience:

## Aggregate design

- Aggregates expose behavior methods, never setters
- Every public method on an aggregate must enforce at least one invariant
- If a method doesn't enforce an invariant, it probably belongs in a domain service
- Domain events are named in past tense, describing what happened: OrderPlaced, not PlaceOrder
- Value objects are immutable; any "change" returns a new instance

And for naming and file organization:

## Naming and file organization

- Service classes are named after the action they perform: CreateOrderService, not OrderService
- One class per file, file name matches class name in kebab-case

These aren’t suggestions. They’re the encoded version of what a senior developer on the team would flag during a real review. The difference is that the AI applies them on every PR, every time, without fatigue or context switching. A human reviewer might miss the misnamed test on their fifteenth PR of the day. The AI won’t.

The rules accumulate over time. Every convention the team agrees on, every architectural decision that gets documented, every pattern that keeps showing up in review comments: it goes into the skills. We started with maybe twenty rules. Six months in, we have over a hundred, covering everything from aggregate design to error message formatting. The more specific the rules, the more useful the reviews become, and the less time human reviewers spend repeating the same feedback.

There’s a practical benefit that goes beyond review quality: onboarding. When a new developer joins the team, the review skills are effectively a living, executable style guide. They don’t just describe how code should look; they actively enforce it. Instead of learning conventions through months of PR feedback, new developers get immediate, consistent guidance from their first commit.

The feedback loop

This is where it gets interesting. The review skills and the generation skills aren’t separate systems. They’re two halves of the same quality loop.

Here’s how it works in practice. I’m reviewing a PR where the AI generated a command handler that returns the created entity. The generation skill didn’t prevent it, and the review skill didn’t flag it either. That’s two misses on the same issue.

So I fix both. I update the generation skills to instruct the AI to never return domain objects from command handlers. And I update the review skills to flag that pattern explicitly. One PR comment turns into two improvements that prevent the same mistake from ever reaching review again.

It works in the other direction too. Sometimes the review skill catches something that the generation skill should have prevented in the first place. A test named “it should return true” makes it through generation, gets flagged during review, and the fix goes back into the generation rules. Over time, the review catches fewer and fewer issues, because the generation side already handles them.

This is the core mental shift: generation and review are not separate concerns. They’re the same concern, observed at two different moments. Generation is “write code that follows our rules.” Review is “verify code follows our rules.” Every gap you find in one should feed back into the other.

Think of it as a ratchet. Each cycle tightens the system:

  1. Generate code using project-specific skills
  2. Review the output, both with AI review skills and human eyes
  3. Identify gaps where the wrong thing got through
  4. Update both skill sets to close the gap
  5. Next generation is better, next review is more precise

After a few months of this, the system compounds. The AI generates code that’s closer to what you’d write yourself, and the reviews focus on genuinely subtle issues rather than repeating the same basic feedback. The human reviewer’s job shifts from “catch convention violations” to “evaluate design decisions and business logic.” Which is what code review should have been about all along.

The practical result: review comments go down, but review quality goes up. You spend less time pointing out naming issues and more time discussing whether the bounded context boundary is in the right place. Less “rename this method” and more “should this really be an aggregate, or is a domain service more appropriate here?”

That’s not a small shift. That’s the difference between review as a gatekeeping chore and review as a genuine design conversation.

The hardest part: AI is not deterministic

There’s a catch to all of this, and it would be dishonest not to address it.

You can encode every rule perfectly, write crystal-clear instructions, provide examples of what to do and what not to do. And the AI will still occasionally ignore them. A command handler that returns an entity will slip through both generation and review. A test will be named “it should update the field” despite explicit rules saying otherwise. A query handler will emit a domain event even though the CQRS boundaries are documented in three different places.

This isn’t a bug. It’s the nature of working with large language models. They’re probabilistic, not deterministic. The same prompt with the same context can produce different outputs. A rule that works 95% of the time still fails on the other 5%, and in a codebase producing dozens of PRs a week, that 5% adds up.

The response isn’t to give up on encoding rules. It’s to build redundancy into the system. The same convention gets checked at generation time and at review time. If the generation skill misses it, the review skill catches it. If both miss it, the human reviewer is still there. Three layers, each independent, each compensating for the others.

It also means the work is never finished. Every week, I find a new edge case where the AI interpreted a rule differently than intended, or applied it in a context where it didn’t make sense. The skills evolve constantly. A rule that was one sentence six months ago might be a full paragraph now, with examples and counter-examples, because the AI kept finding creative ways to technically follow the letter while missing the spirit.

This is the part that requires genuine commitment. Building the initial skill set is a one-time effort. Maintaining and refining it is ongoing. If you treat it as a “set and forget” system, the quality will drift. If you treat it as a living document that evolves with your codebase, it keeps getting better.

Everyone owns this

One thing I want to be clear about: this is not a tech lead solo mission.

Yes, someone needs to lead the effort. Someone needs to set up the initial skills, define the architecture rules, and establish the feedback loop. That’s naturally a tech lead or senior developer responsibility, the same way someone leads any architectural initiative.

But every developer on the team should be involved. When you’re reviewing a PR and you catch something the AI missed, don’t just leave a comment. Ask yourself: should this be a rule? If the answer is yes, add it. When you’re writing code and the generation skill produces something that doesn’t match the team convention, don’t just fix it locally. Update the skill so it doesn’t happen again.

This scales the team’s knowledge in a way that traditional code review never could. Instead of conventions living in the heads of two or three senior developers, they’re encoded, shared, and enforced automatically. Every developer benefits from the accumulated knowledge, and every developer contributes to it.

You still own your code. You still own your reviews. The AI doesn’t replace judgment; it handles the mechanical part so you can focus on the parts that actually require thinking. Design decisions, business logic, whether the abstraction is in the right place, whether the feature actually solves the user’s problem. That’s where human reviewers add irreplaceable value.

The goal isn’t to automate code review away. It’s to make every reviewer on the team more effective, starting from their first PR.

TL;DR

  • AI code generation shifted the bottleneck to code review. Invest there accordingly.
  • Generic review tools catch generic issues; project-specific rules catch what actually matters for your codebase.
  • Treat generation and review as one system. Failures in either should feed back into the other, creating a ratchet that tightens over time.
  • The value isn’t in the AI itself. It’s in the knowledge you encode into it. The AI is only as good as the rules you give it.
  • Everyone on the team should participate in building and maintaining review quality. You still own your code and your reviews.