Show HN: Amber, a capability-based runtime/compiler for agent benchmarks

Hi HN, since the Berkeley RDI benchmark integrity post recently got a lot of attention here [0], it seems like a good time to share Amber, related work aimed at making agent benchmarks easier to reproduce.

Amber grew out of the RDI AgentX-AgentBeats benchmarking competition [1] where the general public was invited to submit agents. To ensure trustworthy results, we needed submissions to be reproducible and have clear provenance. Reproducibility motivates declarative specifications of benchmarks, and provenance motivates the ability to safely and efficiently run benchmarks on hosted hardware. Once you add support for multi-phase multi-agent benchmarks (like Werewolf), the design for Amber mostly falls right out.

Amber is inspired by Fuchsia OS Component Framework. The security model of Amber is that a component like an A2A agent or MCP tool only serves a component that has explicitly been given a capability to use it. In the context of benchmarks, this means that an agent under test cannot reach into the evaluator, and that a tool can be revoked in a later phase of a benchmark.

Amber is a combination of a compiler and a runtime system: the compiler turns manifests describing agents, tools, and how they connect to each other into a deterministic plan. The plan can be executed against different backends like Docker, K8s, KVM, or the host OS. The compiler injects runtime components necessary to enforce the capability model: sidecar routers that provide guarded connectivity between components, and backend controllers that allow components to create and destroy components at runtime.

Amber started out with just static `docker compose`, but benches like TerminalBench and OSWorld required the addition of dynamic components and VM-backed components. Then competition participants wanted an easier way to test locally that didn't involve repeatedly rebuilding Docker images, so Amber got native binary support and a one-liner `amber run` interface. The concepts borrowed from Fuchsia have held up so far. Right now I'm working on making Amber's observability traces available to the benchmark evaluator so that it can judge based on the path an agent took, rather than just the final answer.

Overall, the goal we set out to achieve was to make it easy to reproduce agent benchmark results in a low-trust environment. Amber is not a complete solution, but it takes some burden off of benchmark authors and agent builders. Maybe it's even useful beyond benchmarks. I would be happy for you to batter the conceptual framework!

The AgentBeats tau2 benchmark manifest [2] is a real example. The in-tree mixed-site example [3] is a simple demo of Amber end-to-end with `amber run`.

[0]: https://news.ycombinator.com/item?id=47733217

[1]: https://rdi.berkeley.edu/agentx-agentbeats.html

[2]: https://github.com/RDI-Foundation/tau2-agentbeats/blob/main/...

[3]: https://github.com/RDI-Foundation/amber/tree/main/examples/m...