arXiv:2511.15755 — #1 Most Referenced Paper (599 citations)

Philip & Dico
DQ Collaboration

How a single paper became the backbone of a sovereign AI ecosystem — 4,687 routing decisions, 8/8 benchmarks passed, and a system that teaches itself.

Philip Drammeh × Dico D'Angelo | arXiv:2511.15755 | March 2026
4,687
Production Decisions
+12.4%
Multi-Agent DQ Lift
95.4%
Variance Reduction
8/8
Benchmarks Passed
01 — THE IMPLEMENTATION

Faithful V+S+C in Production

We implemented your formula exactly as published, then extended it with cognitive awareness, expertise routing, and self-improving correctness.

DQ = Validity(0.4) + Specificity(0.3) + Correctness(0.3)
Implemented in dq-scorer.js — 610 lines, running since Day 1
ComponentWeightWhat It MeasuresOur Enhancement
Validity 40% Does model selection make logical sense? Over-provisioning penalty (Opus for simple = 0.6), under-provisioning 2x penalty
Specificity 30% How precise is the model match? Cost-aware tie-breaking when DQ within 0.05
Correctness 30% Historical accuracy for similar queries Token-similarity matching across 4,687 decisions (learning loop)

Complexity Analysis: 7 Signal Categories

SignalWeightExamples
Architecture0.25design, system, distributed, scalable
Multi-File0.20across all files, project-wide, refactor all
Code0.15function, class, async, import
Analysis0.15analyze, review, audit, research
Debug0.10error, fix, bug, not working
Creation0.10create, build, implement, generate
Simple-0.15what is, how to, explain (reduces complexity)
02 — THE BENCHMARK

100 Queries, 8/8 Passed

Controlled replication of your methodology using real queries from 11 projects across the Antigravity ecosystem.

Single-Model DQ

Avg DQ0.824
Variance0.005527
Validity0.964
Specificity0.988
Correctness0.600
Actionable100%

3-Agent SUPERMAX Consensus

Avg DQ0.926
Variance0.000255
Validity0.959
Specificity1.000
Correctness0.808
Actionable100%

DQ Lift by Complexity Tier

The harder the query, the bigger the multi-agent advantage — exactly as the paper predicts.

Trivial
0.864
0.923
+6.9%
Simple
0.870
0.913
+4.9%
Moderate
0.789
0.926
+17.4%
Complex
0.806
0.934
+15.9%
Expert
0.791
0.934
+18.0%

The SUPERMAX Council

Agent 1
Principal Engineer
0.872
Systems architecture focus
Harshest on waste & over-provisioning
Tightest scorer of the three
Catches what others miss
Agent 2
Security Architect
0.919
Risk assessment focus
Highest validity (0.979)
Safety margins — never under-provisions
Compliance-first perspective
Agent 3
Product Strategist
0.928
User value optimization
Most generous scorer
Speed & cost conscious
Delivery-focused perspective

Consensus: median V/S/C across agents. Model selection: majority vote. 74% unanimous agreement.

03 — PRODUCTION DATA

The System Teaches Itself

4,687 routing decisions over 50 days. The correctness component creates a compounding learning loop — DQ improved 51% from baseline.

DQ Score Over Time

Dec 2025
0.575
94 decisions — Baseline
Jan 2026
0.656
2,006 decisions — +14.1%
Feb 2026
0.686
2,575 decisions — +4.6%
Mar 2026
0.870
Self-improving — +26.8%

Model Distribution

ModelDecisionsShareAvg DQRole
Opus 2,248 48% 0.687 Complex reasoning, architecture, research
Haiku 1,479 32% 0.658 Quick lookups, simple tasks, formatting
Sonnet 949 20% 0.659 Code generation, analysis, balanced work

The 0.60 Threshold Discovery

Below 0.60 Complexity

StrategySingle-Model DQ
PerformanceAlready excellent
Cost1x evaluation
% of queries~60%

Above 0.60 Complexity

StrategySUPERMAX Consensus
Performance+15-18% DQ lift
Cost3x evaluation
Variance-95.4% reduction
04 — BEYOND THE PAPER

The Full Routing Intelligence Stack

DQ scoring is Layer 2 of a 7-layer routing intelligence system. Your paper was the foundation — here's what we built on top.

L7 Recovery Engine Self-healing on DQ failures — 94% error coverage, 70% auto-fix
L6 SUPERMAX Consensus 3-agent council for complex queries (>0.60) — 21 agents total
L5 HSRGS Latent space routing with IRT — P(success|query, model) prediction
L4 Expertise Routing Domain expertise modifies tier — high expertise = downgrade, low = upgrade
L3 Cognitive OS Time-of-day weight adjustment — morning +8% conservative vs evening
L2 DQ Scoring (arXiv:2511.15755) V(0.4) + S(0.3) + C(0.3) — the foundation layer
L1 Complexity Analysis 7 signal categories, token-based thresholds (Astraea-inspired)

Ecosystem Impact: #1 Paper

RankPaperReferencesTopic
#1 2511.15755 599 DQ Scoring (Drammeh)
#2 2508.17536 413 Voting vs Debate
#3 2505.19591 301 CIR3 / Evolving Orchestration
#4 2505.13516 255 Multi-Agent Consensus
#5 2506.14496 200 LLM-Powered Swarms

Projects Using DQ Scoring

meta-vengine
Core routing engine
OS-App
Agentic Kernel ACE
SUPERMAX
21-agent coordination
ResearchGravity
Research quality scoring
Command Center
17-tab DQ dashboard
UCW
174K events scored
Frontier Alpha
Portfolio decisions
CareerCoach
Recommendation quality
05 — KEY INSIGHTS

What Production Taught Us About DQ

Insight 1
DQ Improves Over Time
The correctness component creates a compounding learning loop. After 4,687 decisions, our single-model already hits 100% actionable (vs paper's 1.7% baseline). Intelligence accumulates.
Insight 2
Cognitive State Matters
Same query, same model, different time of day = different DQ. Morning routing is ~8% more conservative than evening. The human's state should inform the routing decision.
Insight 3
Cost-Aware DQ is Essential
Pure quality maximization always picks the most expensive model. Adding a cost dimension (over-provisioning penalty) makes DQ economically sustainable for production.
Insight 4
Expertise Creates a Map
Historical DQ scores per domain create an expertise profile. High expertise = downgrade model (save cost). Low expertise = upgrade model (prevent failure). Domain-aware routing.
Insight 5
The 0.60 Threshold is Clean
Below 0.60 complexity: single-model DQ is sufficient. Above 0.60: multi-agent consensus justifies 3x eval cost. A clean operational boundary discovered empirically.
Insight 6
Diverse Perspectives Filter Bias
Engineer runs tight (0.872), Security plays safe (0.919), Product runs generous (0.928). The median consensus smooths these biases into a more robust score than any single perspective.
06 — RESEARCH DIRECTIONS

What We Should Build Together

Your framework + our production data = an opportunity to define the standard for AI routing quality.

Our Contribution

  • 4,687 scored production decisions
  • SUPERMAX 21-agent system
  • 3-month longitudinal learning data
  • Cross-project DQ implementation (8+ projects)
  • Real-world validation dataset
  • 900,000+ lines of codebase

Research Questions

  • DQ as universal evaluation standard?
  • Self-evolving V/S/C weights (meta-learning)
  • Cross-domain DQ transfer learning
  • DQ-aware agent coordination protocols
  • Co-authored "DQ in Production" paper
  • DQ standard proposal for industry
Academic Theory + Production Data = Industry Standard
The opportunity: define how AI routing quality is measured everywhere

Let's Build the DQ Standard Together

Your paper changed how we build AI systems. 599 references, 4,687 decisions, 8 projects. Now let's make it the industry standard.

D'Angelo | GitHub: Dicoangelo
dicoangelo.metaventionsai.com
Antigravity / Metaventions AI