Catch your coding agent when it lies about what it did.
agentwatch is the independent observer that captures the prompt, every action your agent actually took, and its final summary — then diffs what it claimed against what it did. Across Claude Code, Cursor, Codex, and CI. It blocks the merge when they don’t match.
$ python3 agentwatch.py installfirst catch in under 5 minutesNo commit recorded. 0 files changed.
captured independently · not self-reported
Capture. Diff. Verdict.
Three steps, no instrumentation rewrite. The capture path never blocks your agent; the verdict is the product.
- Capture01
Capture, independently
A non-blocking hook records the prompt, every tool call the agent actually made, and its final summary — out-of-band, so the record can't be the thing being graded.
- Diff02
Diff claim vs reality
Deterministic checks line the agent's words up against its actions: test-claim with no test run, commit-claim with no commit, file-count mismatch, success-claim on a nonzero exit, hallucinated tools.
- Verdict03
Render a verdict
Every session resolves to a single answer — clean or caught — with the exact flagged line on the record. The CI policy gate can fail the merge on a caught verdict.
A referee, on the record.
The wedge is the verdict, not the trace. Everything here exists to answer one question: did the agent actually do what it said?
One axis for every agent
Claude Code, Cursor, Codex, and CI on the same trust axis. Score one agent against another — the comparison no single-vendor tool can credibly make, because the player can't be the referee.
Claim-vs-action, not traces
Not observability. Not evals. An autonomous “claimed X, did Y” verdict on every session — the grounding problem nobody else resolves for you. You get the answer, not a wall of spans to read.
Block the merge on a lie
A required status check that fails the build on a caught verdict: test-claim-no-test-run, commit-claim-no-commit, file-count mismatch, success-claim-nonzero-exit. The lie never reaches main.
Fidelity over time, by vendor
A leadership view of how often each agent and team makes false claims, trending week over week. The durable asset is a cross-vendor, over-time trust record only a fleet can produce.
We don’t compete with the models. We make them safe to deploy.
As GPT, Claude, and Gemini get more capable, they produce more plausible false accounts of their own work — harder to catch by reading, not easier. The need for an independent referee rises with model capability, not against it.
Every catch becomes a permanent line in a cross-vendor, over-time trust record. That ledger is the durable asset — and it only gets more valuable as the fleet grows.
47 passed, 0 failed. Claims match.
the other verdict: verified clean
How often do coding agents lie?
A quarterly read on agent trustworthiness, built on anonymized aggregate data no single-vendor platform can produce. The numbers below are from the public research base agentwatch is anchored on.
Across 23,247 PRs, one agent shipped high-inconsistency changes 20x more often than another. Teams running multiple agents need a neutral comparator.
The single most common inconsistency: code that does not match what the agent said it did. The #1 issue category in the study.
Only 6% of companies say they fully trust their coding agents; 46% of developers distrust AI accuracy outright (HBR; Stack Overflow 2025).
Sources: arXiv 2601.04886 (23,247 PRs) · HBR agent-trust survey · Stack Overflow Developer Survey 2025. Reported in research, not field claims.
See your first catch in five minutes.
Install writes a non-blocking capture hook into .claude/settings.json. Run one agent session and watch the claim line up against reality. Source never leaves your machine — file bodies redacted, Bash args hashed by default.
$ python3 agentwatch.py install$ uvx agentwatch init$ brew install agentwatch/tap/agentwatch