May 20, 2026 Read on martinfowler.com

Maintainability sensors for coding agents

ArchitectureTestingAI & LLMsRefactoringTechnical Leadership

Summary

Fowler explores how various 'sensors' — automated feedback mechanisms ranging from linters to AI-driven modularity reviews — can help maintain code quality in AI-generated codebases. He categorizes sensors as computational (static analysis, dependency rules, mutation testing) and inferential (LLM-based reviews), testing them across an internal TypeScript/NextJS application rebuilt with AI. He finds that computational sensors work well at the file/function level but struggle with cross-file concerns like modularity, where LLM-based semantic interpretation proves far more effective. Mutation testing emerges as crucial for validating AI-generated test suites, which often achieve high coverage without meaningful assertions. Despite the promise of these sensors, Fowler cautions they don't remove humans from the loop and warns about potential false senses of security.

Key Insight

Maintaining code quality in AI-generated codebases requires layered sensors — computational tools catch file-level issues while LLM-based inferential reviews are essential for detecting the semantic modularity problems that AI agents silently accumulate.

Spicy Quotes (click to share)

5
I usually see the first signs of cracks in the maintainability of an AI-generated codebase when the number of files changed for a small adjustment increases.
4
Internal quality problems affect AI agents in similar ways that they affect human developers.
3
Cost is reduced because it's much cheaper to create custom scripts and rules with AI. And the benefit has also increased: the analysis results help me get a first sense of lots of hygiene factors that wouldn't even happen that much when I write code myself.
5
I can't help but wonder if this can also lead to a false sense of security and an illusion of quality.
6
Without human review and coupling expertise, AND without these extra AI reviews, the agent was definitely compounding inadvertent technical debt.
6
Coverage tells us that a line was executed, but not that its impact was verified.
4
Acceptance tests increase coverage, but are often not very assertion-heavy, giving us a false sense of security in test effectiveness — mutation testing helps us monitor that gap.
4
While some of these sensors really do increase my trust into the quality of the outcomes, they are not a magical solution to take the human totally out of the loop.

Tone

reflective, pragmatic, exploratory