Karma Ledger benchmark for Agent Governance

Project summary

I am requesting $22,000 to build and publish an open-source benchmark that measures whether AI agents actually do what they claim, using a system I created and deployed before Anthropic, Microsoft, or ArbiterOS shipped comparable audit infrastructure months later.

"What are this project's goals? How will you achieve them?"

A reproducible benchmark evaluating governance systems on three dimensions: declared-vs-actual alignment (Karma Ledger), tool-selection hallucinations (Voice Audit), and policy enforcement under adversarial prompting (Dharma Rules); 6-month research validation of cross-framework agent governance, including adversarial robustness testing, 500+ annotated interactions, and full arXiv publication.

100+ controlled interactions across 5 frameworks (LangChain, CrewAI, AutoGen, OpenAI Agents SDK, PydanticAI), published as open dataset on Hugging Face.
An arXiv preprint "Cross-Framework Agent Governance: A Benchmark for Declared-vs-Actual Alignment" with one-page integration guides per framework.

How will this funding be used?

$4,000: Compute + API credits (~50K agent interactions; 80% open-source models, 20% buffer).

$2,000: Benchmark infrastructure (CI, Hugging Face dataset hosting, arXiv fee).

$12,000: Developer time, 3 months part-time (design, execution, analysis, publication).

$2,000: Documentation + dissemination (technical report, blog post, workshop submission).

$2,000: Contingency for follow-up reviewer questions and revisions.

Who is on your team? What's your track record on similar projects?

I built and deployed the underlying system starting in fall of 2025, before comparable infrastructure shipped at Microsoft AGT (April 2026), Anthropic's model-welfare audit (April 2026), and Cloudflare Project Think (April 2026) months later.

Full claims ledger available in the public prescience.json API (in the repo); materials available on request.

What are the most likely causes and outcomes if this project fails?

Most likely failure modes: (1) framework API changes break the benchmark - mitigated by pinned dependencies and Docker reproducibility; (2) results are weaker than competitor benchmarks - mitigated by honest publication; the value is cross-framework comparison, not absolute score; (3) compute costs exceed budget - mitigated by open-source models for 80% of runs with 20% buffer.

If the project fails entirely, the partial work (protocol, integration guides, partial dataset) is still released as open source under MIT, providing a foundation for future governance benchmark efforts.

How much money have you raised in the last 12 months, and from where?

N/A