Use Coding Agents on the Bugs You Keep Avoiding

Last time we got into flags, output formats, and model choice. All useful, but it still leaves the obvious question open.

Where the hell do you actually point these things if you do real work and not just slop out toy apps all day?

I've mostly moved over to Codex CLI lately so the examples below use that, but the workflow here is the point, not some provider-specific trick. If you'd rather use Claude Code, OpenCode, or something else, fine. Same idea.

One of the best uses for coding agents, especially while you're still learning where they break, is the weird bug or side quest you've been avoiding.

The failing test you haven't dug into. The regression with three possible causes. The edge case you know exists but can't be bothered to spend your afternoon testing properly. The ugly log spam you keep noticing and keep telling yourself you'll get back to later.

You still reason better than the model. Good. You should. The point is that it will happily chew through logs, stack traces, test output, grep results, and dumb little experimental fixes without getting bored or annoyed. A lot of the time I don't want the agent writing the final answer. I want it doing the annoying investigation work around the answer.

First rule though, don't let it loose in the checkout you're actively working in. That's how you turn a maybe useful bug hunt into you being annoyed for no reason.

git worktree add ../bughunt-checkout -b bughunt-checkout
cd ../bughunt-checkout
codex

Separate branch, separate worktree, whatever. Same idea. Let it fuck around somewhere that isn't gonna trash the thing you're in the middle of.

Then give it the whole pile. Bug description, expected behaviour, actual behaviour, repro steps, logs, failing command, suspicious files, and what you've already ruled out. This is one of the few times where oversharing with the model actually pays off.

We have a bug where checkout retries spike after deploy.

Expected:
- one retry at most on transient failures

Actual:
- some requests loop until timeout

Repro:
1. npm run dev
2. hit POST /checkout with the test payload in scripts/repro.json
3. inspect logs in logs/checkout.log

Files I suspect:
- src/payments/retry.ts
- src/api/checkout.ts

What I already checked:
- DB looks healthy
- queue depth is normal
- issue started after commit abc123

Figure out the likely root cause.
You can try fixes and run tests if that helps isolate it.
Do not just patch the symptom and call it done.
Show me what you think is happening, what you tested, and what is still uncertain.

That last bit matters. I don't want the first thing it does to be some random patch because the model felt like being helpful. I do want it trying ideas though. If it can tweak something, run the test, inspect the output, back it out, and try again, that's often how it actually gets somewhere. Investigation first, but with permission to poke at the system a bit.

Then once it thinks it found something, make it prove it. This is where most people fuck it up.

The model does not get to decide whether it succeeded. Your environment does. The tests do. The hooks do. You do.

npm test
npm run lint
npm run typecheck
pre-commit run --all-files

This is also one of the funniest things about these tools. A bunch of unit tests I couldn't be bothered to write for myself suddenly become very useful when the machine is the one that needs to prove it isn't bullshitting me.

What you want back from a useful run is pretty simple. A clear guess at the root cause. The files involved. The commands it ran. Evidence that the diagnosis is real. A small diff if there is one. And any remaining uncertainty. If all it gives you is "fixed it" and a massive patch touching half the repo, bin it.

And even when it does find the big nasty bug, still review the diff like a normal person.

In Codex I like doing a local review pass once it's done.

/review

If it fixes one horrible bug it will often introduce three small stupid ones. Fine. Those are usually a lot easier for you to clean up than the original issue was. The point is not that the agent writes perfect code. The point is that it makes taking a real swing at annoying problems much cheaper.

You also don't need it to fully solve the problem every time. Sometimes the patch is garbage but the diagnosis is right. Sometimes the diagnosis is half right but it points you at the file you should've opened first. Sometimes the whole run is useless and you throw it away. Fine. That still beats carrying the bug around in your head for three more days.

And if you wanna do the same sort of thing in one shot instead of in the TUI, you can.

codex exec "Investigate why checkout retries spike after deploy. You can edit files, run tests, and try small fixes to isolate the cause. Do not stop at a patch. Prove the diagnosis."

These tools don't replace your judgement. They make it cheap to throw serious context at the work you keep avoiding and get something concrete back.

So use them on the ugly bugs, the annoying regressions, and the edge cases you can't be bothered to test.

Just make them come back with receipts.