arXiv:2511.15755 — #1 Most Referenced Paper (599 citations)

Philip & Dico
DQ Collaboration

How a single paper became the backbone of a sovereign AI ecosystem — 4,687 routing decisions, 8/8 benchmarks passed, and a system that teaches itself.

Philip Drammeh × Dico D'Angelo | arXiv:2511.15755 | March 2026

4,687

Production Decisions

+12.4%

Multi-Agent DQ Lift

95.4%

Variance Reduction

8/8

Benchmarks Passed

01 — THE IMPLEMENTATION

Faithful V+S+C in Production

We implemented your formula exactly as published, then extended it with cognitive awareness, expertise routing, and self-improving correctness.

DQ = Validity(0.4) + Specificity(0.3) + Correctness(0.3)

Implemented in dq-scorer.js — 610 lines, running since Day 1

Component	Weight	What It Measures	Our Enhancement
Validity	40%	Does model selection make logical sense?	Over-provisioning penalty (Opus for simple = 0.6), under-provisioning 2x penalty
Specificity	30%	How precise is the model match?	Cost-aware tie-breaking when DQ within 0.05
Correctness	30%	Historical accuracy for similar queries	Token-similarity matching across 4,687 decisions (learning loop)

Complexity Analysis: 7 Signal Categories

Signal	Weight	Examples
Architecture	0.25	design, system, distributed, scalable
Multi-File	0.20	across all files, project-wide, refactor all
Code	0.15	function, class, async, import
Analysis	0.15	analyze, review, audit, research
Debug	0.10	error, fix, bug, not working
Creation	0.10	create, build, implement, generate
Simple	-0.15	what is, how to, explain (reduces complexity)

02 — THE BENCHMARK

100 Queries, 8/8 Passed

Controlled replication of your methodology using real queries from 11 projects across the Antigravity ecosystem.

Single-Model DQ

Avg DQ0.824

Variance0.005527

Validity0.964

Specificity0.988

Correctness0.600

Actionable100%

3-Agent SUPERMAX Consensus

Avg DQ0.926

Variance0.000255

Validity0.959

Specificity1.000

Correctness0.808

Actionable100%

DQ Lift by Complexity Tier

The harder the query, the bigger the multi-agent advantage — exactly as the paper predicts.

Trivial

0.864

0.923

+6.9%

Simple

0.870

0.913

+4.9%

Moderate

0.789

0.926

+17.4%

Complex

0.806

0.934

+15.9%

Expert

0.791

0.934

+18.0%

The SUPERMAX Council

Agent 1

Principal Engineer

0.872

Systems architecture focus
Harshest on waste & over-provisioning
Tightest scorer of the three
Catches what others miss

Agent 2

Security Architect

0.919

Risk assessment focus
Highest validity (0.979)
Safety margins — never under-provisions
Compliance-first perspective

Agent 3

Product Strategist

0.928

User value optimization
Most generous scorer
Speed & cost conscious
Delivery-focused perspective

Consensus: median V/S/C across agents. Model selection: majority vote. 74% unanimous agreement.

03 — PRODUCTION DATA

The System Teaches Itself

4,687 routing decisions over 50 days. The correctness component creates a compounding learning loop — DQ improved 51% from baseline.

DQ Score Over Time

Dec 2025

0.575

94 decisions — Baseline

Jan 2026

0.656

2,006 decisions — +14.1%

Feb 2026

0.686

2,575 decisions — +4.6%

Mar 2026

0.870

Self-improving — +26.8%

Model Distribution

Model	Decisions	Share	Avg DQ	Role
Opus	2,248	48%	0.687	Complex reasoning, architecture, research
Haiku	1,479	32%	0.658	Quick lookups, simple tasks, formatting
Sonnet	949	20%	0.659	Code generation, analysis, balanced work

The 0.60 Threshold Discovery

Below 0.60 Complexity

StrategySingle-Model DQ

PerformanceAlready excellent

Cost1x evaluation

% of queries~60%

Above 0.60 Complexity

StrategySUPERMAX Consensus

Performance+15-18% DQ lift

Cost3x evaluation

Variance-95.4% reduction

04 — BEYOND THE PAPER

The Full Routing Intelligence Stack

DQ scoring is Layer 2 of a 7-layer routing intelligence system. Your paper was the foundation — here's what we built on top.

L7 Recovery Engine Self-healing on DQ failures — 94% error coverage, 70% auto-fix

L6 SUPERMAX Consensus 3-agent council for complex queries (>0.60) — 21 agents total

L5 HSRGS Latent space routing with IRT — P(success|query, model) prediction

L4 Expertise Routing Domain expertise modifies tier — high expertise = downgrade, low = upgrade

L3 Cognitive OS Time-of-day weight adjustment — morning +8% conservative vs evening

L2 DQ Scoring (arXiv:2511.15755) V(0.4) + S(0.3) + C(0.3) — the foundation layer

L1 Complexity Analysis 7 signal categories, token-based thresholds (Astraea-inspired)

Ecosystem Impact: #1 Paper

Rank	Paper	References	Topic
#1	2511.15755	599	DQ Scoring (Drammeh)
#2	2508.17536	413	Voting vs Debate
#3	2505.19591	301	CIR3 / Evolving Orchestration
#4	2505.13516	255	Multi-Agent Consensus
#5	2506.14496	200	LLM-Powered Swarms

Projects Using DQ Scoring

meta-vengine

Core routing engine

OS-App

Agentic Kernel ACE

SUPERMAX

21-agent coordination

ResearchGravity

Research quality scoring

Command Center

17-tab DQ dashboard

UCW

174K events scored

Frontier Alpha

Portfolio decisions

CareerCoach

Recommendation quality

05 — KEY INSIGHTS

What Production Taught Us About DQ

Insight 1

DQ Improves Over Time

The correctness component creates a compounding learning loop. After 4,687 decisions, our single-model already hits 100% actionable (vs paper's 1.7% baseline). Intelligence accumulates.

Insight 2

Cognitive State Matters

Same query, same model, different time of day = different DQ. Morning routing is ~8% more conservative than evening. The human's state should inform the routing decision.

Insight 3

Cost-Aware DQ is Essential

Pure quality maximization always picks the most expensive model. Adding a cost dimension (over-provisioning penalty) makes DQ economically sustainable for production.

Insight 4

Expertise Creates a Map

Historical DQ scores per domain create an expertise profile. High expertise = downgrade model (save cost). Low expertise = upgrade model (prevent failure). Domain-aware routing.

Insight 5

The 0.60 Threshold is Clean

Below 0.60 complexity: single-model DQ is sufficient. Above 0.60: multi-agent consensus justifies 3x eval cost. A clean operational boundary discovered empirically.

Insight 6

Diverse Perspectives Filter Bias

Engineer runs tight (0.872), Security plays safe (0.919), Product runs generous (0.928). The median consensus smooths these biases into a more robust score than any single perspective.

06 — RESEARCH DIRECTIONS

What We Should Build Together

Your framework + our production data = an opportunity to define the standard for AI routing quality.

Our Contribution

4,687 scored production decisions
SUPERMAX 21-agent system
3-month longitudinal learning data
Cross-project DQ implementation (8+ projects)
Real-world validation dataset
900,000+ lines of codebase

Research Questions

DQ as universal evaluation standard?
Self-evolving V/S/C weights (meta-learning)
Cross-domain DQ transfer learning
DQ-aware agent coordination protocols
Co-authored "DQ in Production" paper
DQ standard proposal for industry

Academic Theory + Production Data = Industry Standard

The opportunity: define how AI routing quality is measured everywhere

Let's Build the DQ Standard Together

Your paper changed how we build AI systems. 599 references, 4,687 decisions, 8 projects. Now let's make it the industry standard.

D'Angelo | GitHub: Dicoangelo
dicoangelo.metaventionsai.com
Antigravity / Metaventions AI

Philip & DicoDQ Collaboration