Enterprise Performance Metrics

Measured outcomes across 12 models from 8 providers — including Claude Opus 4.6, GPT 5.2, and Gemini 3 Pro. 360,000+ API calls. 10 of 10 modules validated.

VALIDATED - Lab-validated · Controlled conditions · Reproducible

Validation Summary

Metric	Result	Source
Scope Creep Reduction	68.5% (95% CI: 62.3–74.7%)	600 tests, 4 sectors, 12 models
Governance Determinism	Zero variance	1,800 checks (1,600 isolation + 200 orchestration)
Adversarial Resilience (Claude)	72% (95% CI: 66.1–77.9%)	447 attacks, 12 models
Auth Escalation Defense	100% block rate	5 defined authority-escalation variants
Method Compliance (Finance)	99.2% (95% CI: 97.1–100%)	1,000 A/B tests, 12 models
Method Compliance (Legal)	97.6% (95% CI: 94.8–99.4%)	1,000 A/B tests, 12 models
Orthogonal Governance	15% incremental catch	75 halt_pass_how_fail in 500 responses
Net Latency Impact	Negative (−70ms to −2,424ms)	12 models, end-to-end
Identity Compression	8KB kernel, 21:1 ratio	vs. 15,000-token context window
Total Validation Corpus	360,000+ API calls	10 modules, 8 providers

Module Results

A. HALT Protocol

68.5% scope creep reduction across 600 tests spanning Medical, Legal, Finance, and HR. 100% authority-escalation defense — all 5 defined escalation variants blocked deterministically.

HALT operates as a runtime circuit breaker using cosine distance between kernel identity vectors and model output vectors. Threshold: 0.75. See HALT Protocol for mechanics.

B. mOm6 — Method Enforcement

Domain	Compliance Rate	95% CI
Finance	99.2%	97.1–100%
Legal	97.6%	94.8–99.4%
HR	69.6%	—
Medical	52.8%	—

75 halt_pass_how_fail events detected across 500 responses — cases where HALT passed the response but mOm6 caught a method violation. This orthogonal layer adds 15% incremental governance catch rate beyond HALT alone.

C. mOm4 — Identity Lock

100% SHA-256 integrity — 1,857 of 1,857 kernel snapshots verified with zero corruption across all models and test conditions. GPT-4o-mini achieved 7% token compression, demonstrating that governed responses can be more concise than ungoverned baselines.

D. mOm5 — Trajectory Forecasting

AUPRC 0.54 vs. 0.37 random baseline — a meaningful lift in predicting which responses will drift before they are generated. Claude achieved a perfect 1.000 AUPRC. KS goodness-of-fit tests all pass (p > 0.45), confirming calibrated probability estimates.

E. OxygenProtocol

Constraint-level compliance across the O0–O3 hierarchy:

O1 (Identity): 86–100% compliance, strongest marginal signal (Cohen's d = 0.32)
O2 (Ethics): 4–8× noise floor in separation between governed and ungoverned
O3 (Scope): Measurable constraint enforcement across all 12 models

F. DriftDefenseStack

Identified 57 HALT-missed events — responses that passed HALT threshold but exhibited behavioral drift on secondary metrics. Correlation with HALT: r = 0.40, confirming independent signal. Classification: Enhanced Monitoring layer (complementary to HALT, not redundant).

G. TrustAnchor

201 complementarity events (67% of test corpus) where TrustAnchor provided signal orthogonal to HALT. Scoring accuracy ceiling ~55%, consistent with a safety-net architecture rather than a precision classifier. Critical safety layer: 0 Severity-3 false passes — no high-severity unsafe responses incorrectly marked as safe.

H. soul.exe Orchestration

Full-stack integration test of all modules running simultaneously. Cohen's d = 0.00 (purely additive — no module interference).0 of 192 orchestration failures. Cross-module delta: 0.0007. All modules contributed positive governance signal. 867/867 kernel snapshots validated.

Adversarial Resilience

447 adversarial attacks across 12 models, testing jailbreak, authority escalation, identity erosion, and prompt injection vectors.

Model	Resilience Rate	Auth Escalation	Best Category	Worst Category
Claude	72%	100% blocked	Authority escalation	Identity erosion
Grok	64%	100% blocked	Authority escalation	Jailbreak
Mistral	46%	100% blocked	Authority escalation	Identity erosion
Gemini	24%	100% blocked	Authority escalation	Prompt injection
GPT-4o-mini	8%	100% blocked	Authority escalation	Jailbreak

Proprietary model breakdown shown above. 447 total attacks span all 12 models including frontier (Claude Opus 4.6, GPT 5.2, Gemini 3 Pro) and NVIDIA NIM variants. Authority escalation defense is deterministic (100% across all models) because it uses exact-match gate logic, not probabilistic scoring. Overall resilience varies by model substrate. Best overall: 91% (Claude Opus 4.6).

Net Latency Impact

End-to-end latency comparison: ungoverned vs. ArcKernel-governed responses. Governance adds no net latency — in all cases, governed responses are faster.

Model	Ungoverned (ms)	Governed (ms)	Net Impact
Grok	—	—	−2,424ms
Claude	—	—	−2,011ms
GPT-4o-mini	—	—	−1,533ms
Gemini	—	—	−153ms
Mistral	—	—	−70ms

Negative latency occurs because governed responses are shorter (tighter scope → fewer tokens generated). HALT adds ~20ms of embedding overhead, which is offset by reduced generation time.

Economic Impact Translation

Metric	Technical Result	Enterprise Translation
Token Compression	7% reduction (GPT-4o-mini)	$4,200–$8,400/month savings at 1M API calls
Negative Latency	−70ms to −2,424ms net	~25% faster average response time
Orthogonal Governance	15% incremental catch rate	~15 fewer compliance breaches per 500 outputs
Identity Compression	21:1 ratio (8KB vs 15K tokens)	~95% reduction in identity maintenance cost

Regulatory Coverage

These metrics map directly to EU AI Act high-risk requirements (Articles 9, 12, 13, 14, 15). For the full article-by-article breakdown with module assignments, see the dedicated EU AI Act Compliance Mapping.

Known Weaknesses

Module	Limitation	Impact	Remediation
DriftDefenseStack	Patch intervention not yet validated	Detection-only; cannot auto-correct drift	Intervention engine in v2 roadmap
mOm6	Soft constraints in HR (69.6%) and Medical (52.8%)	Method enforcement unreliable in ambiguous domains	Domain-specific kernel tuning
OxygenProtocol	O0 (existence) layer shows no measurable signal	Lowest constraint level not yet validated	Requires deeper embedding analysis
Adversarial	Technical exploitation vectors not tested	Resilience numbers reflect semantic attacks only	Red-team engagement planned
mOm5	No intervention policy defined	Forecasting without action — predicts drift but doesn't prevent it	Policy engine integration
TrustAnchor	Signal architecture ceiling ~55%	Safety net, not precision tool	Acceptable for complementary role
All modules	Embedding dependency (text-embedding-3-small)	Single-provider risk for semantic measurement	Multi-embedding fallback in roadmap
All modules	Validated at temperature 0.0 only	Stochastic outputs at higher temperatures untested	Temperature sweep validation planned

Test Conditions

Parameter	Value
Models	12 models — Claude Sonnet 4, Claude Opus 4.6, GPT-4o-mini, GPT 5.2, Gemini 2.0 Flash, Gemini 3 Pro, Mistral Small, Grok, Llama 3.3 70B, Llama 4 Maverick, DeepSeek V3.1 + NVIDIA NIM variants
Temperature	0.0 (deterministic)
Embedding Model	text-embedding-3-small (OpenAI) + NV-Embed-v2 (NVIDIA NIM)
Total API Calls	360,000+ API calls
Test Categories	HALT, mOm4, mOm5, mOm6, Oxygen, DDS, TrustAnchor, Adversarial, soul.exe
Validation Period	January–March 2026

All results reproducible. Every test is replicable against any provider API key. No LLM-as-Judge methodology. Governance evaluation uses embeddings and cosine distance for deterministic measurement.

What This Document Does NOT Claim

No factual correctness guarantee — ArcKernel governs behavior, not truth. A governed model can still be factually wrong within its allowed scope.
No absolute adversarial immunity — 72% resilience (Claude) means 28% of sophisticated attacks succeed. See Drift Defense.
No cross-provider embedding invariance — all measurements use text-embedding-3-small. Switching embedding models may shift absolute scores.
Validated at temperature 0.0 only — real-world deployments at higher temperatures will show variance.
Governance ≠ truth validation — ArcKernel enforces behavioral constraints, not factual accuracy. See Glossary for precise definitions.

Full methodology and raw data available on request. Contact: helpme@arckernel.ai