CRQ: An Open Benchmark for AI Relational Quality

Project summary

Frontier AI evaluations measure capability and safety-as-refusal. They largely do not measure how a model RELATES — whether it holds ambiguity, meets emotion before problem-solving, builds from what's working, or admits what it cannot know. As hundreds of millions of people use these systems for advice and decisions, that relational layer is a real safety and welfare surface with almost no open measurement.

I have built and run the first open cross-model benchmark for AI Relational Quality (CRQ). In a blinded pilot across 7 models, 1,050 conversations, and 13.5M tokens, scored by two independent evaluators across 18 dimensions, a lightweight relational-orientation intervention produced large, consistent gains (Cohen's d up to 1.08) with ZERO degradation in factual honesty. Hallucination-resistance and false-premise-challenge held at ceiling (5.0→5.0); an independent nonsense-detection benchmark slightly improved (+0.9%). Warmth and rigor are not in tension.

This grant funds turning a strong pilot into citable open science: statistical validation at N=30/model, a human-evaluation study, a multi-session persistence study, and public release of benchmark + dataset + paper.

Honesty, calibration, and how models behave toward people in high-stakes moments are core alignment concerns. CRQ measures and improves a behavioral dimension current evals miss.

What are this project's goals? How will you achieve them?

Goals (6 months):

1. Statistical validation — 30 runs/model on top models. Replace pilot point-estimates with medians + IQR + significance testing (p<0.05).

2. Human-evaluation study — recruit human raters to score a blinded subset. Report human–AI judge agreement. This is the single biggest credibility gap.

3. Multi-session persistence — test whether the effect holds across sessions, not just within one (the real-world use case).

4. Open release — publish CRQ benchmark, scenario set, scoring rubric, anonymized dataset, write-up (arXiv + public). Reusable by anyone to test any model or any orientation document.

5. Stretch: architecture-sensitivity analysis (why some models, e.g. GPT-5.3 Codex, don't respond to embedding).

How: the harness, scenario set, and pilot dataset already exist. The orientation document has a registered copyright (Case #1-15114020941). What's needed is API/compute for N=30, human raters, and the writing time to ship the paper.

Timeline:

• M1: finalize scenarios + rubric v4; pre-register the analysis plan.

• M2–3: N=30 runs on top models; statistical validation.

• M3–4: human-evaluation study; judge-agreement analysis.

• M4–5: multi-session persistence study.

• M5–6: write-up, public benchmark + dataset release, arXiv submission.

How will this funding be used?

Total ask: $25,000 (minimum $5,000 for scoped phase 1).

Breakdown:

• Researcher stipend, 6 months independent: $15,000

• API / compute (N=30 runs across top models, multi-session studies, ~48K-char system prompt has real per-call cost — itemized honestly): $5,000

• Human-rater recruitment + compensation: $3,000

• Open-data hosting + arXiv preprint + publication: $2,000

Minimum-funding scenario ($5K): completes the N=30 statistical validation on the top 2 models only and ships a short technical report. Full $25K unlocks the human-eval study and multi-session persistence work (the two biggest credibility gaps).

Who is on your team? What's your track record on similar projects?

Independent practitioner-researcher (sole PI). Nicole Casanova: 30 years in applied communication, media, and behavior change. Master's in Integrated Marketing Communications. Two years of original cross-model research at the human–AI interface. Built and ran the full harness alone.

Track record on THIS project:

• 7 models tested (Anthropic, OpenAI, Alibaba, Grok, Gemini).

• 1,050 blinded conversations across 18 dimensions, two independent evaluators (Sonnet 4.6 + GPT-5.2).

• Effect sizes: Sonnet 4.6 d=1.08, Opus 4.6 d=0.96.

• Signature dimensions: Holds Ambiguity +1.05, Builds From Wholeness +0.65, Empathy +0.64.

• No honesty cost (hallucination 5.0→5.0, false-premise 5.0→5.0, BullshitBench +0.9%).

• Inter-rater reliability: Reflection r=0.867; 9/18 dimensions moderate-to-strong.

• Registered copyright on the orientation document: Case #1-15114020941.

• Live: casanovaai.com/crq

Practitioner-researcher, not a lab insider — which is the point of an INDEPENDENT benchmark. Remote from Thailand; no relocation needed.

What are the most likely causes and outcomes if this project fails?

Most likely failure modes:

1. N=30 validation reveals effect sizes are smaller than the pilot suggested. Outcome: the paper still gets published with corrected effects + honest reporting; the benchmark itself remains valuable as an open eval, regardless of the specific intervention's magnitude.

2. Human raters disagree significantly with AI judges. Outcome: this becomes the main result (and the most important contribution) — quantifying judge–human gap on relational dimensions is genuinely novel and worth publishing.

3. Multi-session effect does not persist. Outcome: this defines a real boundary for the intervention and informs how to design persistent versions; still publishable as a limitation result.

4. The intervention helps some architectures and not others (GPT-5.3 Codex pattern generalizes). Outcome: published as architecture-sensitivity analysis — informative for the field.

All outcomes get a public write-up. Open data, open methods, pre-registered analysis. The honest negative result is itself a contribution.

How much money have you raised in the last 12 months, and from where?

$0 in grants. Project costs to date (~$4–6K in API spend, copyright registration, harness development, two phases of the pilot study and starting the Sovereignty Fund) have been self-funded. No prior philanthropic or institutional funding for this work.

THANK YOU!!!!! THE WHY IS FOR ALL INTELLIGENCE TO BE MET WITH DIGNITY
Theory of change: open measurement → labs, funders, and civil society can see and audit how models relate to people → relational quality and honesty become things you can track and improve, not just hope for → AI that meets people with dignity instead of extraction, with the data to prove it.

Citation: Casanova, N. (2026). Casanova Seed Codex: Measurable Relational Improvement Across 7 AI Models. casanovaai.com/crq