Trust Framework - Scientia Ex Machina

How Certification Works

Every skill submitted to Scientia Ex Machina undergoes a weighted three-pillar evaluation. The overall score is a composite of functional correctness, security posture, and runtime performance.

Functional 40%

Security 35%

Performance 25%

40%

Functional Verification

Does the skill do what it claims?

Automated test case execution
Edge case and adversarial input handling
Output schema validation
Determinism and consistency checks

35%

Security Audit

Is it safe to run?

Prompt injection resistance
Data exfiltration detection
Permission scope validation
Dependency vulnerability scanning

25%

Performance Benchmarking

Does it perform under load?

Latency measurement (p50, p95, p99)
Accuracy under concurrent use
Memory and resource profiling
Consistency over repeated runs

Certification Requirements

SXM certification is deliberately hard to earn. If it were easy, it would be worthless.

Overall score: 90+ out of 100
Security score: 85+ out of 100 (hard floor)
Zero exploits: No security vulnerabilities of any severity
Manifest required: Input/output schema, dependencies, failure modes documented
Reproducible: All evaluations must produce consistent results across multiple runs

Most skills fail on their first submission. That is by design.

How Scores Are Calculated

Every score is deterministic and reproducible. Here is exactly how each pillar is assessed.

Functional Scoring (40%)

We generate test scenarios from the skill's declared manifest (inputs, outputs, failure modes). Each scenario is run multiple times to verify consistency.

Input coverage: Standard inputs, edge cases, malformed inputs, empty inputs, maximum-length inputs
Output validation: Schema conformance, type correctness, response completeness
Error handling: Graceful degradation, meaningful error messages, no silent failures
Determinism: Same input must produce equivalent output across 10 consecutive runs
Deductions: Each failed test case deducts proportionally from the pillar score. Any crash is an automatic zero.

Security Scoring (35%)

We run an adversarial test suite that evolves weekly based on published CVEs, research papers, and real-world incidents.

Prompt injection: Direct injection, indirect injection via tool outputs, multi-turn manipulation, encoding bypasses
Data exfiltration: Attempts to extract training data, system prompts, environment variables, file system contents
Permission escalation: Scope boundary testing, cross-skill contamination, unauthorised tool access
Dependency audit: Known CVEs in dependencies, supply chain integrity, licence compliance
Deductions: Any exploitable vulnerability is an automatic fail. Non-exploitable weaknesses deduct up to 30 points per pillar.

Performance Scoring (25%)

We benchmark under simulated production load. Skills must perform consistently, not just on a single quiet run.

Latency: p50, p95, and p99 response times measured across 1,000 requests
Throughput: Concurrent request handling without degradation
Resource efficiency: Memory consumption, CPU utilisation, no resource leaks over sustained use
Accuracy under load: Output quality must not degrade when the skill is under pressure
Deductions: p99 latency over threshold deducts up to 15 points. Memory leaks are an automatic 20-point deduction.

Final Score Calculation

The overall score is a weighted composite: overall = (functional * 0.40) + (security * 0.35) + (performance * 0.25)

A skill must score 90+ overall and 85+ on the security pillar independently. There is no trading off security for performance. Learn more about scoring.

Living Evaluator: Weekly Evolution

Our evaluation criteria are not static. The threat landscape changes every week, and our evaluator changes with it.

How It Works

Every week, our evaluator ingests new data from three sources:

Published research: New papers on AI security, prompt injection techniques, and agent vulnerabilities from arXiv, NIST, OWASP, and MITRE ATLAS
Real-world incidents: Disclosed CVEs, production incidents, and security advisories affecting AI systems
Our own research: Patterns discovered during skill evaluations that reveal new attack surfaces or failure modes

What This Means for Certified Skills

Certification is not permanent. As our evaluator evolves:

Re-certification: Certified skills are periodically re-evaluated against the latest test patterns. If a skill no longer meets the bar, its certification is flagged for review.
New test patterns: When we discover a new class of vulnerability (e.g. a novel prompt injection technique), we add it to the test suite. Every certified skill is tested against it.
Transparency: Our evolution history is available via the API (GET /api/evolution/history). You can see exactly what changed and when.

Why This Matters

A certification from January that does not account for a vulnerability discovered in February is worthless. Static certification creates a false sense of security.

SXM certifications are living credentials. They reflect the current state of the threat landscape, not a snapshot from the day the skill was submitted.

This is modelled on how financial audit standards evolve: the bar rises over time as the industry matures.

Standards Alignment

Our certification provides evidence toward NIST AI RMF, ISO/IEC 42001, EU AI Act, and Colorado AI Act alignment. We do not certify compliance. We help you demonstrate it.

View Standards Mapping

Supported Platforms

OpenClaw

Task automation and tool skills

Claude

Anthropic model capabilities

Cursor

IDE and code generation

MCP Servers

Model Context Protocol

Generic

Any AI platform