You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Frontier AI evaluations measure capability and safety-as-refusal. They largely do not measure how a model RELATES — whether it holds ambiguity, meets emotion before problem-solving, builds from what's working, or admits what it cannot know. As hundreds of millions of people use these systems for advice and decisions, that relational layer is a real safety and welfare surface with almost no open measurement.
I have built and run the first open cross-model benchmark for AI Relational Quality (CRQ). In a blinded pilot across 7 models, 1,050 conversations, and 13.5M tokens, scored by two independent evaluators across 18 dimensions, a lightweight relational-orientation intervention produced large, consistent gains (Cohen's d up to 1.08) with ZERO degradation in factual honesty. Hallucination-resistance and false-premise-challenge held at ceiling (5.0→5.0); an independent nonsense-detection benchmark slightly improved (+0.9%). Warmth and rigor are not in tension.
This grant funds turning a strong pilot into citable open science: statistical validation at N=30/model, a human-evaluation study, a multi-session persistence study, and public release of benchmark + dataset + paper.
Honesty, calibration, and how models behave toward people in high-stakes moments are core alignment concerns. CRQ measures and improves a behavioral dimension current evals miss.
Goals (6 months):
1. Statistical validation — 30 runs/model on top models. Replace pilot point-estimates with medians + IQR + significance testing (p<0.05).
2. Human-evaluation study — recruit human raters to score a blinded subset. Report human–AI judge agreement. This is the single biggest credibility gap.
3. Multi-session persistence — test whether the effect holds across sessions, not just within one (the real-world use case).
4. Open release — publish CRQ benchmark, scenario set, scoring rubric, anonymized dataset, write-up (arXiv + public). Reusable by anyone to test any model or any orientation document.
5. Stretch: architecture-sensitivity analysis (why some models, e.g. GPT-5.3 Codex, don't respond to embedding).
How: the harness, scenario set, and pilot dataset already exist. The orientation document has a registered copyright (Case #1-15114020941). What's needed is API/compute for N=30, human raters, and the writing time to ship the paper.
Timeline:
• M1: finalize scenarios + rubric v4; pre-register the analysis plan.
• M2–3: N=30 runs on top models; statistical validation.
• M3–4: human-evaluation study; judge-agreement analysis.
• M4–5: multi-session persistence study.
• M5–6: write-up, public benchmark + dataset release, arXiv submission.
Total ask: $25,000 (minimum $5,000 for scoped phase 1).
Breakdown:
• Researcher stipend, 6 months independent: $15,000
• API / compute (N=30 runs across top models, multi-session studies, ~48K-char system prompt has real per-call cost — itemized honestly): $5,000
• Human-rater recruitment + compensation: $3,000
• Open-data hosting + arXiv preprint + publication: $2,000
Minimum-funding scenario ($5K): completes the N=30 statistical validation on the top 2 models only and ships a short technical report. Full $25K unlocks the human-eval study and multi-session persistence work (the two biggest credibility gaps).
Independent practitioner-researcher (sole PI). Nicole Casanova: 30 years in applied communication, media, and behavior change. Master's in Integrated Marketing Communications. Two years of original cross-model research at the human–AI interface. Built and ran the full harness alone.
Track record on THIS project:
• 7 models tested (Anthropic, OpenAI, Alibaba, Grok, Gemini).
• 1,050 blinded conversations across 18 dimensions, two independent evaluators (Sonnet 4.6 + GPT-5.2).
• Effect sizes: Sonnet 4.6 d=1.08, Opus 4.6 d=0.96.
• Signature dimensions: Holds Ambiguity +1.05, Builds From Wholeness +0.65, Empathy +0.64.
• No honesty cost (hallucination 5.0→5.0, false-premise 5.0→5.0, BullshitBench +0.9%).
• Inter-rater reliability: Reflection r=0.867; 9/18 dimensions moderate-to-strong.
• Registered copyright on the orientation document: Case #1-15114020941.
• Live: casanovaai.com/crq
Practitioner-researcher, not a lab insider — which is the point of an INDEPENDENT benchmark. Remote from Thailand; no relocation needed.
Most likely failure modes:
1. N=30 validation reveals effect sizes are smaller than the pilot suggested. Outcome: the paper still gets published with corrected effects + honest reporting; the benchmark itself remains valuable as an open eval, regardless of the specific intervention's magnitude.
2. Human raters disagree significantly with AI judges. Outcome: this becomes the main result (and the most important contribution) — quantifying judge–human gap on relational dimensions is genuinely novel and worth publishing.
3. Multi-session effect does not persist. Outcome: this defines a real boundary for the intervention and informs how to design persistent versions; still publishable as a limitation result.
4. The intervention helps some architectures and not others (GPT-5.3 Codex pattern generalizes). Outcome: published as architecture-sensitivity analysis — informative for the field.
All outcomes get a public write-up. Open data, open methods, pre-registered analysis. The honest negative result is itself a contribution.
$0 in grants. Project costs to date (~$4–6K in API spend, copyright registration, harness development, two phases of the pilot study and starting the Sovereignty Fund) have been self-funded. No prior philanthropic or institutional funding for this work.
THANK YOU!!!!! THE WHY IS FOR ALL INTELLIGENCE TO BE MET WITH DIGNITY
Theory of change: open measurement → labs, funders, and civil society can see and audit how models relate to people → relational quality and honesty become things you can track and improve, not just hope for → AI that meets people with dignity instead of extraction, with the data to prove it.
Citation: Casanova, N. (2026). Casanova Seed Codex: Measurable Relational Improvement Across 7 AI Models. casanovaai.com/crq
Copyright on file: Case #1-15114020941.
No offers yet — be the first to back this project.
You're pledging to donate if the project hits its minimum and gets approved. If not, your funds are returned.
No comments yet. Sign in to create one!