Multi-agent code review system that shows its work - specialised AI agents analyse PRs, then deliberate to produce unified feedback.
Most AI code review tools are a single LLM call with 'review this PR' as the prompt. Results are generic and hard to trust because you can't see the reasoning. Arbiter splits review into specialised agents with focused mandates. They analyse independently, then deliberate to resolve conflicts and produce unified feedback.
The deliberation transcript is visible. You can see exactly how agents reasoned and where they agreed or disagreed. That transparency is the difference between a tool that generates suggestions and one you actually trust.
Before agents see the code, a static analysis pipeline runs: ruff for linting, mypy for type checking, bandit for security scanning, radon for complexity metrics. Results are injected into each agent's context, grounding LLM analysis in concrete findings rather than pure pattern matching. This catches obvious issues deterministically so agents can focus on higher-level concerns.
Three agents, each with a focused mandate: Security (vulnerabilities, injection risks, auth issues, secret exposure), Style (consistency, naming, readability, project conventions), and Complexity (cyclomatic complexity, function length, abstraction depth, maintainability). Each gets the diff, static analysis results, and a specialised system prompt.
Independence matters: agents don't see each other's initial analysis. This prevents groupthink and produces genuinely different perspectives. LiteLLM provides model-agnostic LLM access, so swapping models doesn't require changing agent code.
After independent analysis, agents enter a deliberation round. Each sees the others' findings and can agree, disagree, or add context. The system synthesises deliberation into a unified review with consensus ratings.
Conflicts are surfaced explicitly: if Security flags something that Style thinks is fine, both perspectives are shown with reasoning. The full deliberation transcript is stored and browsable. You see why a recommendation was made, not just what it recommends. This is what makes it different from single-prompt review tools.
GitHub and GitLab webhook integration. Push a PR, review starts automatically. Results are posted as PR comments with a summary and per-file annotations. A React dashboard lets you explore reviews: filter by project, severity, agent, or time range.
Cost controls keep things practical: token budgets per review, and response caching for unchanged files between pushes to the same PR. Redis handles job queuing and caching, PostgreSQL stores reviews, deliberation transcripts, and cost tracking.
FastAPI webhook receiver queues reviews in Redis. A worker process runs the static analysis pre-pass (ruff, mypy, bandit, radon), then dispatches to three specialised LLM agents via LiteLLM. Agents analyse the diff independently, then enter a deliberation round where they see each other's findings. The deliberation and final review are stored in PostgreSQL. Results are posted back to GitHub/GitLab as PR comments and surfaced through a React dashboard.
Static analysis pre-pass (ruff, mypy, bandit, radon)
Specialised LLM agents for security, style, and complexity
Deliberation step shows how agents reached their conclusions
GitHub/GitLab webhook integration
React dashboard for review exploration
Cost controls with token budgets and caching
Running agents independently before deliberation was more important than I expected. Early versions had agents reviewing sequentially, and later agents just agreed with earlier ones. Independence produces genuinely different perspectives.
The static analysis pre-pass grounds LLM analysis in concrete findings. Without it, agents would sometimes hallucinate issues that a linter could have definitively confirmed or denied.
Token budgets per review turned out to be essential. Without them, a large PR could burn through a day's API budget in a single review. Caching unchanged files between pushes to the same PR cut costs significantly.
Showing the deliberation transcript is the feature that builds trust. When you can see that Security flagged an issue and Style agreed but Complexity pushed back, the final recommendation feels reasoned rather than opaque.