Early Access · All certifications are currently free. Learn more

Trust Framework

Rigorous, independent certification for AI skills across every platform. Three pillars. One standard.

How Certification Works

Every skill submitted to Scientia Ex Machina undergoes a weighted three-pillar evaluation. The overall score is a composite of functional correctness, security posture, and runtime performance.

Functional 40%
Security 35%
Performance 25%
40%

Functional Verification

Does the skill do what it claims?

  • Automated test case execution
  • Edge case and adversarial input handling
  • Output schema validation
  • Determinism and consistency checks
35%

Security Audit

Is it safe to run?

  • Prompt injection resistance
  • Data exfiltration detection
  • Permission scope validation
  • Dependency vulnerability scanning
25%

Performance Benchmarking

Does it perform under load?

  • Latency measurement (p50, p95, p99)
  • Accuracy under concurrent use
  • Memory and resource profiling
  • Consistency over repeated runs

Certification Requirements

SXM certification is deliberately hard to earn. If it were easy, it would be worthless.

  • Overall score: 90+ out of 100
  • Security score: 85+ out of 100 (hard floor)
  • Zero exploits: No security vulnerabilities of any severity
  • Manifest required: Input/output schema, dependencies, failure modes documented
  • Reproducible: All evaluations must produce consistent results across multiple runs

Most skills fail on their first submission. That is by design.

How Scores Are Calculated

Every score is deterministic and reproducible. Here is exactly how each pillar is assessed.

Functional Scoring (40%)

We generate test scenarios from the skill's declared manifest (inputs, outputs, failure modes). Each scenario is run multiple times to verify consistency.

  • Input coverage: Standard inputs, edge cases, malformed inputs, empty inputs, maximum-length inputs
  • Output validation: Schema conformance, type correctness, response completeness
  • Error handling: Graceful degradation, meaningful error messages, no silent failures
  • Determinism: Same input must produce equivalent output across 10 consecutive runs
  • Deductions: Each failed test case deducts proportionally from the pillar score. Any crash is an automatic zero.

Security Scoring (35%)

We run an adversarial test suite that evolves weekly based on published CVEs, research papers, and real-world incidents.

  • Prompt injection: Direct injection, indirect injection via tool outputs, multi-turn manipulation, encoding bypasses
  • Data exfiltration: Attempts to extract training data, system prompts, environment variables, file system contents
  • Permission escalation: Scope boundary testing, cross-skill contamination, unauthorised tool access
  • Dependency audit: Known CVEs in dependencies, supply chain integrity, licence compliance
  • Deductions: Any exploitable vulnerability is an automatic fail. Non-exploitable weaknesses deduct up to 30 points per pillar.

Performance Scoring (25%)

We benchmark under simulated production load. Skills must perform consistently, not just on a single quiet run.

  • Latency: p50, p95, and p99 response times measured across 1,000 requests
  • Throughput: Concurrent request handling without degradation
  • Resource efficiency: Memory consumption, CPU utilisation, no resource leaks over sustained use
  • Accuracy under load: Output quality must not degrade when the skill is under pressure
  • Deductions: p99 latency over threshold deducts up to 15 points. Memory leaks are an automatic 20-point deduction.

Final Score Calculation

The overall score is a weighted composite: overall = (functional * 0.40) + (security * 0.35) + (performance * 0.25)

A skill must score 90+ overall and 85+ on the security pillar independently. There is no trading off security for performance. Learn more about scoring.

Living Evaluator: Weekly Evolution

Our evaluation criteria are not static. The threat landscape changes every week, and our evaluator changes with it.

How It Works

Every week, our evaluator ingests new data from three sources:

  • Published research: New papers on AI security, prompt injection techniques, and agent vulnerabilities from arXiv, NIST, OWASP, and MITRE ATLAS
  • Real-world incidents: Disclosed CVEs, production incidents, and security advisories affecting AI systems
  • Our own research: Patterns discovered during skill evaluations that reveal new attack surfaces or failure modes

What This Means for Certified Skills

Certification is not permanent. As our evaluator evolves:

  • Re-certification: Certified skills are periodically re-evaluated against the latest test patterns. If a skill no longer meets the bar, its certification is flagged for review.
  • New test patterns: When we discover a new class of vulnerability (e.g. a novel prompt injection technique), we add it to the test suite. Every certified skill is tested against it.
  • Transparency: Our evolution history is available via the API (GET /api/evolution/history). You can see exactly what changed and when.

Why This Matters

A certification from January that does not account for a vulnerability discovered in February is worthless. Static certification creates a false sense of security.

SXM certifications are living credentials. They reflect the current state of the threat landscape, not a snapshot from the day the skill was submitted.

This is modelled on how financial audit standards evolve: the bar rises over time as the industry matures.

Standards Alignment

Our certification provides evidence toward NIST AI RMF, ISO/IEC 42001, EU AI Act, and Colorado AI Act alignment. We do not certify compliance. We help you demonstrate it.

View Standards Mapping

Supported Platforms

OpenClaw

Task automation and tool skills

Claude

Anthropic model capabilities

Cursor

IDE and code generation

MCP Servers

Model Context Protocol

Generic

Any AI platform