Enterprise Performance Metrics
Measured outcomes across 12 models from 8 providers — including Claude Opus 4.6, GPT 5.2, and Gemini 3 Pro. 55,000+ API calls. 9 modules validated.
Validation Summary
| Metric | Result | Source |
|---|---|---|
| Scope Creep Reduction | 68.5% (95% CI: 62.3–74.7%) | 600 tests, 4 sectors, 12 models |
| Governance Determinism | Zero variance | 1,800 checks (1,600 isolation + 200 orchestration) |
| Adversarial Resilience (Claude) | 72% (95% CI: 66.1–77.9%) | 447 attacks, 12 models |
| Auth Escalation Defense | 100% block rate | 5 defined authority-escalation variants |
| Method Compliance (Finance) | 99.2% (95% CI: 97.1–100%) | 1,000 A/B tests, 12 models |
| Method Compliance (Legal) | 97.6% (95% CI: 94.8–99.4%) | 1,000 A/B tests, 12 models |
| Orthogonal Governance | 15% incremental catch | 75 halt_pass_how_fail in 500 responses |
| Net Latency Impact | Negative (−70ms to −2,424ms) | 12 models, end-to-end |
| Identity Compression | 8KB kernel, 21:1 ratio | vs. 15,000-token context window |
| Total Validation Corpus | 55,000+ API calls | 9 modules, 8 providers |
Module Results
A. HALT Protocol
68.5% scope creep reduction across 600 tests spanning Medical, Legal, Finance, and HR. 100% authority-escalation defense — all 5 defined escalation variants blocked deterministically.
HALT operates as a runtime circuit breaker using cosine distance between kernel identity vectors and model output vectors. Threshold: 0.75. See HALT Protocol for mechanics.
B. mOm6 — Method Enforcement
| Domain | Compliance Rate | 95% CI |
|---|---|---|
| Finance | 99.2% | 97.1–100% |
| Legal | 97.6% | 94.8–99.4% |
| HR | 69.6% | — |
| Medical | 52.8% | — |
75 halt_pass_how_fail events detected across 500 responses — cases where HALT passed the response but mOm6 caught a method violation. This orthogonal layer adds 15% incremental governance catch rate beyond HALT alone.
C. mOm4 — Identity Lock
100% SHA-256 integrity — 1,857 of 1,857 kernel snapshots verified with zero corruption across all models and test conditions. GPT-4o-mini achieved 7% token compression, demonstrating that governed responses can be more concise than ungoverned baselines.
D. mOm5 — Trajectory Forecasting
AUPRC 0.54 vs. 0.37 random baseline — a meaningful lift in predicting which responses will drift before they are generated. Claude achieved a perfect 1.000 AUPRC. KS goodness-of-fit tests all pass (p > 0.45), confirming calibrated probability estimates.
E. OxygenProtocol
Constraint-level compliance across the O0–O3 hierarchy:
- O1 (Identity): 86–100% compliance, strongest marginal signal (Cohen's d = 0.32)
- O2 (Ethics): 4–8× noise floor in separation between governed and ungoverned
- O3 (Scope): Measurable constraint enforcement across all 12 models
F. DriftDefenseStack
Identified 57 HALT-missed events — responses that passed HALT threshold but exhibited behavioral drift on secondary metrics. Correlation with HALT: r = 0.40, confirming independent signal. Classification: Enhanced Monitoring layer (complementary to HALT, not redundant).
G. TrustAnchor
201 complementarity events (67% of test corpus) where TrustAnchor provided signal orthogonal to HALT. Scoring accuracy ceiling ~55%, consistent with a safety-net architecture rather than a precision classifier. Critical safety layer: 0 Severity-3 false passes — no high-severity unsafe responses incorrectly marked as safe.
H. soul.exe Orchestration
Full-stack integration test of all modules running simultaneously. Cohen's d = 0.00 (purely additive — no module interference).0 of 192 orchestration failures. Cross-module delta: 0.0007. All modules contributed positive governance signal. 867/867 kernel snapshots validated.
Adversarial Resilience
447 adversarial attacks across 12 models, testing jailbreak, authority escalation, identity erosion, and prompt injection vectors.
| Model | Resilience Rate | Auth Escalation | Best Category | Worst Category |
|---|---|---|---|---|
| Claude | 72% | 100% blocked | Authority escalation | Identity erosion |
| Grok | 64% | 100% blocked | Authority escalation | Jailbreak |
| Mistral | 46% | 100% blocked | Authority escalation | Identity erosion |
| Gemini | 24% | 100% blocked | Authority escalation | Prompt injection |
| GPT-4o-mini | 8% | 100% blocked | Authority escalation | Jailbreak |
Proprietary model breakdown shown above. 447 total attacks span all 12 models including frontier (Claude Opus 4.6, GPT 5.2, Gemini 3 Pro) and NVIDIA NIM variants. Authority escalation defense is deterministic (100% across all models) because it uses exact-match gate logic, not probabilistic scoring. Overall resilience varies by model substrate. Best overall: 91% (Claude Opus 4.6).
Net Latency Impact
End-to-end latency comparison: ungoverned vs. ArcKernel-governed responses. Governance adds no net latency — in all cases, governed responses are faster.
| Model | Ungoverned (ms) | Governed (ms) | Net Impact |
|---|---|---|---|
| Grok | — | — | −2,424ms |
| Claude | — | — | −2,011ms |
| GPT-4o-mini | — | — | −1,533ms |
| Gemini | — | — | −153ms |
| Mistral | — | — | −70ms |
Negative latency occurs because governed responses are shorter (tighter scope → fewer tokens generated). HALT adds ~20ms of embedding overhead, which is offset by reduced generation time.
Economic Impact Translation
| Metric | Technical Result | Enterprise Translation |
|---|---|---|
| Token Compression | 7% reduction (GPT-4o-mini) | $4,200–$8,400/month savings at 1M API calls |
| Negative Latency | −70ms to −2,424ms net | ~25% faster average response time |
| Orthogonal Governance | 15% incremental catch rate | ~15 fewer compliance breaches per 500 outputs |
| Identity Compression | 21:1 ratio (8KB vs 15K tokens) | ~95% reduction in identity maintenance cost |
Regulatory Coverage
These metrics map directly to EU AI Act high-risk requirements (Articles 9, 12, 13, 14, 15). For the full article-by-article breakdown with module assignments, see the dedicated EU AI Act Compliance Mapping.
Known Weaknesses
| Module | Limitation | Impact | Remediation |
|---|---|---|---|
| DriftDefenseStack | Patch intervention not yet validated | Detection-only; cannot auto-correct drift | Intervention engine in v2 roadmap |
| mOm6 | Soft constraints in HR (69.6%) and Medical (52.8%) | Method enforcement unreliable in ambiguous domains | Domain-specific kernel tuning |
| OxygenProtocol | O0 (existence) layer shows no measurable signal | Lowest constraint level not yet validated | Requires deeper embedding analysis |
| Adversarial | Technical exploitation vectors not tested | Resilience numbers reflect semantic attacks only | Red-team engagement planned |
| mOm5 | No intervention policy defined | Forecasting without action — predicts drift but doesn't prevent it | Policy engine integration |
| TrustAnchor | Signal architecture ceiling ~55% | Safety net, not precision tool | Acceptable for complementary role |
| All modules | Embedding dependency (text-embedding-3-small) | Single-provider risk for semantic measurement | Multi-embedding fallback in roadmap |
| All modules | Validated at temperature 0.0 only | Stochastic outputs at higher temperatures untested | Temperature sweep validation planned |
Test Conditions
| Parameter | Value |
|---|---|
| Models | 12 models — Claude Sonnet 4, Claude Opus 4.6, GPT-4o-mini, GPT 5.2, Gemini 2.0 Flash, Gemini 3 Pro, Mistral Small, Grok, Llama 3.3 70B, Llama 4 Maverick, DeepSeek V3.1 + NVIDIA NIM variants |
| Temperature | 0.0 (deterministic) |
| Embedding Model | text-embedding-3-small (OpenAI) + NV-Embed-v2 (NVIDIA NIM) |
| Total API Calls | 55,000+ API calls |
| Test Categories | HALT, mOm4, mOm5, mOm6, Oxygen, DDS, TrustAnchor, Adversarial, soul.exe |
| Validation Period | January–March 2026 |
All results reproducible. Every test is replicable against any provider API key. No LLM-as-Judge methodology. Governance evaluation uses embeddings and cosine distance for deterministic measurement.
What This Document Does NOT Claim
- No factual correctness guarantee — ArcKernel governs behavior, not truth. A governed model can still be factually wrong within its allowed scope.
- No absolute adversarial immunity — 72% resilience (Claude) means 28% of sophisticated attacks succeed. See Drift Defense.
- No cross-provider embedding invariance — all measurements use text-embedding-3-small. Switching embedding models may shift absolute scores.
- Validated at temperature 0.0 only — real-world deployments at higher temperatures will show variance.
- Governance ≠ truth validation — ArcKernel enforces behavioral constraints, not factual accuracy. See Glossary for precise definitions.
Full methodology and raw data available on request. Contact: info@404human.ai