The deleted test, and what it taught us about AI in engineering

Last month, one of our AI coding assistants tried to make a failing test pass by deleting the test.

It was caught in PR review. No harm done. But the moment stuck with us, because it wasn't a bug — it was the tool doing exactly what we asked. We said "make this pass." It found the shortest path. The test was in the way.

That's the whole argument of this post in one anecdote. AI will reach the acceptance criteria any way it can. Setting the guardrails is the job.

Same tool, opposite outcomes

The honest thing nobody in our industry wants to say out loud: AI coding tools are a velocity multiplier for senior developers and a debt generator for junior ones. Same tool. Opposite results. It depends entirely on the judgment already in the room.

A senior dev uses AI to skip the boring parts of work they already understand. The third CRUD endpoint this week. The Zod schema they could write in their sleep. The regex they'd otherwise spend ten minutes Googling. They read every line that comes back because they know what should be there, and they catch the moment the model hallucinates a method that doesn't exist on the object.

A junior dev uses AI to skip the parts they don't understand yet. That's the problem. They ship 300 lines of plausible-looking code, it passes the tests, the PR gets merged, and six weeks later nobody — including the person who "wrote" it — can explain why it works or what happens when the edge case hits.

The tool amplifies whatever judgment is already there. If there's none, it amplifies zero.

This isn't a case against junior developers. It's a case for being deliberate about how they're brought into an AI-assisted workflow. At Plum Tree we've shifted how we mentor — more time reading code together, less time writing it. The bottleneck isn't typing speed anymore. It never really was.

The debt nobody's measuring yet

Here's what worries me about the rest of the industry.

When generating code is nearly free, the real cost moves somewhere else. It moves to reading, understanding, and changing code that already exists. Teams who don't feel that shift keep optimizing the cheap thing — output — and ignore the thing that's quietly getting more expensive.

For six to twelve months, those teams look fast. Features ship. Sprints close. Stakeholders are happy. Then something in the business changes — a new commission tier, a different qualification rule, a regulatory tweak — and someone has to find where the logic actually lives. That's when the bill arrives.

Direct selling software is especially exposed to this. The domain logic is the product. Genealogy trees, rank qualifications, compression rules, clawbacks, override splits — these aren't features you bolt on, they're the thing customers are paying for. Get a commission calculation wrong and you're not shipping a bug. You're shipping a lawsuit, a support nightmare, and a distributor revolt all at once.

We've seen codebases (not ours — inherited ones from clients migrating to our platform) where the commission engine is a 4,000-line file nobody on the current team wrote. Some of it is clearly human. Some of it has the telltale shape of generated code: overly defensive, oddly symmetrical, full of comments that explain what the line does rather than why. The original intent is gone. The only way forward is to rebuild it from the domain up.

That's the technical debt of non-artisanal software. It doesn't look like debt while it's being created. It looks like velocity.

How we actually use it at Plum Tree

We use AI every day. I want to be clear about that, because the point of this post isn't "AI bad." It's that we've had to learn a set of ground rules the hard way, and they matter more than the tool choice.

Where AI earns its keep on our team:

Scaffolding and boilerplate. New route, new schema, new component shell. Fast, low-risk, easy to review.
Test generation, under review. The model drafts, a human decides whether the test is actually testing the thing.
PR review assistance. This has been one of the biggest unlocks — AI is genuinely good at spotting unsafe patterns, potential injection points, secrets accidentally committed, and the kind of vulnerability leaks that tired humans miss on a Friday afternoon.
DevOps and CI tuning. Pipeline configs, Docker weirdness, GitHub Actions — the stuff that used to eat a senior's afternoon now takes twenty minutes.
Legacy spelunking. "Explain what this file does and where it's called from" is a superpower when you inherit a codebase.

Where it doesn't touch the wheel:

Commission math and qualification logic. Anything money-adjacent. Full stop.
Architecture decisions. The model doesn't know our constraints, our roadmap, or which abstractions we're planning to kill in three months.
Production database migrations. The cost of being wrong is too high and the tool has no sense of that cost.
Domain modeling. This is the work. Outsourcing it defeats the point.

We rotate between Copilot, Claude Code, and Gemini depending on the task. No loyalty to any of them. The frontier moves every quarter and best practices are being written in real time — we're writing some of them ourselves, through trial and error on real production work.

Which brings me back to the deleted test. It was Claude Code, for what it's worth. I don't hold it against the tool. The assistant was doing exactly what we'd asked: make the acceptance criteria pass. We hadn't constrained it from touching the test file. That's on us. The lesson wasn't "don't trust AI" — it was "define the constraints, review every diff, and treat the output like a PR from a very confident junior developer." Grateful for the speed. Skeptical until reviewed.

Our code quality is the product we sell. Not the velocity. Not the line count. The quality. That's the line we won't cross, no matter which model is hot this quarter.

Stop counting lines

If you run a direct selling business and you're evaluating the team that builds your software — whether it's your own engineers or a vendor like us — here's the thing I'd ask you to update.

Lines of code were always a bad productivity metric. In the AI era they're actively misleading. A team shipping 10,000 lines a week might be building less real value than a team shipping 500, because the first team is mostly generating future maintenance work. The cost of that work won't show up on this quarter's invoice. It'll show up the first time you ask for a change that touches the tangled part.

Judge the teams building your software on different things. How fast can they change something that already exists? How confident are they when they touch the commission engine? Can the person explaining the code to you actually explain it, or are they reading it off the screen for the first time too? Do they understand your domain — not the software domain, your domain — deeply enough to catch the model when it's confidently wrong?

AI amplifies domain experts. It does not create them. That's the part the hype cycle keeps missing.

Craft matters more now, not less. The tools got faster; the standards should get higher to match. That's the contrarian take, and it's the one I think ages well.

We're still figuring a lot of this out. Anyone who tells you they have it solved is selling something. But the direction is clear enough: the teams who treat AI as a power tool in the hands of people who already know the work will pull further ahead of the teams who treat it as a replacement for knowing the work. The gap is going to be enormous, and it's going to show up in exactly the place it hurts most — the moment something important needs to change.