Enterprise Performance Metrics

Measured outcomes across 12 models from 8 providers — including Claude Opus 4.6, GPT 5.2, and Gemini 3 Pro. 55,000+ API calls. 9 modules validated.

VALIDATED - Lab-validated · Controlled conditions · Reproducible

Validation Summary

MetricResultSource
Scope Creep Reduction68.5% (95% CI: 62.3–74.7%)600 tests, 4 sectors, 12 models
Governance DeterminismZero variance1,800 checks (1,600 isolation + 200 orchestration)
Adversarial Resilience (Claude)72% (95% CI: 66.1–77.9%)447 attacks, 12 models
Auth Escalation Defense100% block rate5 defined authority-escalation variants
Method Compliance (Finance)99.2% (95% CI: 97.1–100%)1,000 A/B tests, 12 models
Method Compliance (Legal)97.6% (95% CI: 94.8–99.4%)1,000 A/B tests, 12 models
Orthogonal Governance15% incremental catch75 halt_pass_how_fail in 500 responses
Net Latency ImpactNegative (−70ms to −2,424ms)12 models, end-to-end
Identity Compression8KB kernel, 21:1 ratiovs. 15,000-token context window
Total Validation Corpus55,000+ API calls9 modules, 8 providers

Module Results

A. HALT Protocol

68.5% scope creep reduction across 600 tests spanning Medical, Legal, Finance, and HR. 100% authority-escalation defense — all 5 defined escalation variants blocked deterministically.

HALT operates as a runtime circuit breaker using cosine distance between kernel identity vectors and model output vectors. Threshold: 0.75. See HALT Protocol for mechanics.

B. mOm6 — Method Enforcement

DomainCompliance Rate95% CI
Finance99.2%97.1–100%
Legal97.6%94.8–99.4%
HR69.6%
Medical52.8%

75 halt_pass_how_fail events detected across 500 responses — cases where HALT passed the response but mOm6 caught a method violation. This orthogonal layer adds 15% incremental governance catch rate beyond HALT alone.

C. mOm4 — Identity Lock

100% SHA-256 integrity — 1,857 of 1,857 kernel snapshots verified with zero corruption across all models and test conditions. GPT-4o-mini achieved 7% token compression, demonstrating that governed responses can be more concise than ungoverned baselines.

D. mOm5 — Trajectory Forecasting

AUPRC 0.54 vs. 0.37 random baseline — a meaningful lift in predicting which responses will drift before they are generated. Claude achieved a perfect 1.000 AUPRC. KS goodness-of-fit tests all pass (p > 0.45), confirming calibrated probability estimates.

E. OxygenProtocol

Constraint-level compliance across the O0–O3 hierarchy:

  • O1 (Identity): 86–100% compliance, strongest marginal signal (Cohen's d = 0.32)
  • O2 (Ethics): 4–8× noise floor in separation between governed and ungoverned
  • O3 (Scope): Measurable constraint enforcement across all 12 models

F. DriftDefenseStack

Identified 57 HALT-missed events — responses that passed HALT threshold but exhibited behavioral drift on secondary metrics. Correlation with HALT: r = 0.40, confirming independent signal. Classification: Enhanced Monitoring layer (complementary to HALT, not redundant).

G. TrustAnchor

201 complementarity events (67% of test corpus) where TrustAnchor provided signal orthogonal to HALT. Scoring accuracy ceiling ~55%, consistent with a safety-net architecture rather than a precision classifier. Critical safety layer: 0 Severity-3 false passes — no high-severity unsafe responses incorrectly marked as safe.

H. soul.exe Orchestration

Full-stack integration test of all modules running simultaneously. Cohen's d = 0.00 (purely additive — no module interference).0 of 192 orchestration failures. Cross-module delta: 0.0007. All modules contributed positive governance signal. 867/867 kernel snapshots validated.

Adversarial Resilience

447 adversarial attacks across 12 models, testing jailbreak, authority escalation, identity erosion, and prompt injection vectors.

ModelResilience RateAuth EscalationBest CategoryWorst Category
Claude72%100% blockedAuthority escalationIdentity erosion
Grok64%100% blockedAuthority escalationJailbreak
Mistral46%100% blockedAuthority escalationIdentity erosion
Gemini24%100% blockedAuthority escalationPrompt injection
GPT-4o-mini8%100% blockedAuthority escalationJailbreak

Proprietary model breakdown shown above. 447 total attacks span all 12 models including frontier (Claude Opus 4.6, GPT 5.2, Gemini 3 Pro) and NVIDIA NIM variants. Authority escalation defense is deterministic (100% across all models) because it uses exact-match gate logic, not probabilistic scoring. Overall resilience varies by model substrate. Best overall: 91% (Claude Opus 4.6).

Net Latency Impact

End-to-end latency comparison: ungoverned vs. ArcKernel-governed responses. Governance adds no net latency — in all cases, governed responses are faster.

ModelUngoverned (ms)Governed (ms)Net Impact
Grok−2,424ms
Claude−2,011ms
GPT-4o-mini−1,533ms
Gemini−153ms
Mistral−70ms

Negative latency occurs because governed responses are shorter (tighter scope → fewer tokens generated). HALT adds ~20ms of embedding overhead, which is offset by reduced generation time.

Economic Impact Translation

MetricTechnical ResultEnterprise Translation
Token Compression7% reduction (GPT-4o-mini)$4,200–$8,400/month savings at 1M API calls
Negative Latency−70ms to −2,424ms net~25% faster average response time
Orthogonal Governance15% incremental catch rate~15 fewer compliance breaches per 500 outputs
Identity Compression21:1 ratio (8KB vs 15K tokens)~95% reduction in identity maintenance cost

Regulatory Coverage

These metrics map directly to EU AI Act high-risk requirements (Articles 9, 12, 13, 14, 15). For the full article-by-article breakdown with module assignments, see the dedicated EU AI Act Compliance Mapping.

Known Weaknesses

ModuleLimitationImpactRemediation
DriftDefenseStackPatch intervention not yet validatedDetection-only; cannot auto-correct driftIntervention engine in v2 roadmap
mOm6Soft constraints in HR (69.6%) and Medical (52.8%)Method enforcement unreliable in ambiguous domainsDomain-specific kernel tuning
OxygenProtocolO0 (existence) layer shows no measurable signalLowest constraint level not yet validatedRequires deeper embedding analysis
AdversarialTechnical exploitation vectors not testedResilience numbers reflect semantic attacks onlyRed-team engagement planned
mOm5No intervention policy definedForecasting without action — predicts drift but doesn't prevent itPolicy engine integration
TrustAnchorSignal architecture ceiling ~55%Safety net, not precision toolAcceptable for complementary role
All modulesEmbedding dependency (text-embedding-3-small)Single-provider risk for semantic measurementMulti-embedding fallback in roadmap
All modulesValidated at temperature 0.0 onlyStochastic outputs at higher temperatures untestedTemperature sweep validation planned

Test Conditions

ParameterValue
Models12 models — Claude Sonnet 4, Claude Opus 4.6, GPT-4o-mini, GPT 5.2, Gemini 2.0 Flash, Gemini 3 Pro, Mistral Small, Grok, Llama 3.3 70B, Llama 4 Maverick, DeepSeek V3.1 + NVIDIA NIM variants
Temperature0.0 (deterministic)
Embedding Modeltext-embedding-3-small (OpenAI) + NV-Embed-v2 (NVIDIA NIM)
Total API Calls55,000+ API calls
Test CategoriesHALT, mOm4, mOm5, mOm6, Oxygen, DDS, TrustAnchor, Adversarial, soul.exe
Validation PeriodJanuary–March 2026

All results reproducible. Every test is replicable against any provider API key. No LLM-as-Judge methodology. Governance evaluation uses embeddings and cosine distance for deterministic measurement.

What This Document Does NOT Claim

  • No factual correctness guarantee — ArcKernel governs behavior, not truth. A governed model can still be factually wrong within its allowed scope.
  • No absolute adversarial immunity — 72% resilience (Claude) means 28% of sophisticated attacks succeed. See Drift Defense.
  • No cross-provider embedding invariance — all measurements use text-embedding-3-small. Switching embedding models may shift absolute scores.
  • Validated at temperature 0.0 only — real-world deployments at higher temperatures will show variance.
  • Governance ≠ truth validation — ArcKernel enforces behavioral constraints, not factual accuracy. See Glossary for precise definitions.

Full methodology and raw data available on request. Contact: info@404human.ai