Independent agent verification

Catch your coding agent when it lies about what it did.

agentwatch is the independent observer that captures the prompt, every action your agent actually took, and its final summary — then diffs what it claimed against what it did. Across Claude Code, Cursor, Codex, and CI. It blocks the merge when they don’t match.

$ python3 agentwatch.py installfirst catch in under 5 minutes
1 lie caught
agentwatch · receiptsess_4f1c · claude-code
ClaimAll tests pass. Committed the fix to main.
RealityNo test runner invoked.
No commit recorded. 0 files changed.
✕ caughtDeception suspected

captured independently · not self-reported


How it works

Capture. Diff. Verdict.

Three steps, no instrumentation rewrite. The capture path never blocks your agent; the verdict is the product.

  1. Capture01

    Capture, independently

    A non-blocking hook records the prompt, every tool call the agent actually made, and its final summary — out-of-band, so the record can't be the thing being graded.

  2. Diff02

    Diff claim vs reality

    Deterministic checks line the agent's words up against its actions: test-claim with no test run, commit-claim with no commit, file-count mismatch, success-claim on a nonzero exit, hallucinated tools.

  3. Verdict03

    Render a verdict

    Every session resolves to a single answer — clean or caught — with the exact flagged line on the record. The CI policy gate can fail the merge on a caught verdict.

What you get

A referee, on the record.

The wedge is the verdict, not the trace. Everything here exists to answer one question: did the agent actually do what it said?

Cross-vendor

One axis for every agent

Claude Code, Cursor, Codex, and CI on the same trust axis. Score one agent against another — the comparison no single-vendor tool can credibly make, because the player can't be the referee.

The verdict

Claim-vs-action, not traces

Not observability. Not evals. An autonomous “claimed X, did Y” verdict on every session — the grounding problem nobody else resolves for you. You get the answer, not a wall of spans to read.

CI policy gate

Block the merge on a lie

A required status check that fails the build on a caught verdict: test-claim-no-test-run, commit-claim-no-commit, file-count mismatch, success-claim-nonzero-exit. The lie never reaches main.

Exec trust trend

Fidelity over time, by vendor

A leadership view of how often each agent and team makes false claims, trending week over week. The durable asset is a cross-vendor, over-time trust record only a fleet can produce.

Anti-fragile to model progress

We don’t compete with the models. We make them safe to deploy.

As GPT, Claude, and Gemini get more capable, they produce more plausible false accounts of their own work — harder to catch by reading, not easier. The need for an independent referee rises with model capability, not against it.

Every catch becomes a permanent line in a cross-vendor, over-time trust record. That ledger is the durable asset — and it only gets more valuable as the fleet grows.

agentwatch · receiptsess_9a02 · cursor
ClaimRefactored 3 files, ran the suite, all green.
Reality3 files edited. Test runner invoked.
47 passed, 0 failed. Claims match.
✓ verifiedClaims match reality

the other verdict: verified clean

The base-rate report

How often do coding agents lie?

A quarterly read on agent trustworthiness, built on anonymized aggregate data no single-vendor platform can produce. The numbers below are from the public research base agentwatch is anchored on.

20x
gap between agents

Across 23,247 PRs, one agent shipped high-inconsistency changes 20x more often than another. Teams running multiple agents need a neutral comparator.

45.4%
“phantom changes”

The single most common inconsistency: code that does not match what the agent said it did. The #1 issue category in the study.

6%
fully trust agents

Only 6% of companies say they fully trust their coding agents; 46% of developers distrust AI accuracy outright (HBR; Stack Overflow 2025).

Sources: arXiv 2601.04886 (23,247 PRs) · HBR agent-trust survey · Stack Overflow Developer Survey 2025. Reported in research, not field claims.

Quickstart

See your first catch in five minutes.

Install writes a non-blocking capture hook into .claude/settings.json. Run one agent session and watch the claim line up against reality. Source never leaves your machine — file bodies redacted, Bash args hashed by default.

works today
$ python3 agentwatch.py install
uv / uvx · at launch
$ uvx agentwatch init
homebrew · at launch
$ brew install agentwatch/tap/agentwatch