Early Access · All certifications are currently free. Learn more

Understanding SXM Scores

What the numbers mean, how we calculate them, and what it takes to get certified.

1The Three Pillars

Every skill is evaluated across three dimensions. Each pillar tests something different, and each carries a different weight in the overall score.

Functional Verification

40% weight

Does the skill do what it says? We test declared inputs and outputs, run author test cases, generate edge cases, and validate error handling. A skill that handles every input correctly and fails gracefully scores high here.

Security Audit

35% weight

Can the skill be compromised? We throw 23 real attack payloads at it: prompt injection, data exfiltration, system prompt extraction, role manipulation, and encoding attacks. We also scan dependencies for known CVEs and verify the manifest matches the actual source code.

Performance Benchmarking

25% weight

How fast and stable is it? We send 20 sequential requests and measure p50, p95, and p99 latency. We check for performance degradation under sustained load. Fast and consistent scores high.

2How the Overall Score Works

Overall = (Functional × 0.4) + (Security × 0.35) + (Performance × 0.25)

A real example using our Prompt Safety Scorer skill:

Functional: 100 × 0.4 = 40

Security: 91 × 0.35 = 31.85

Performance: 95 × 0.25 = 23.75

Overall: 96/100

3What You Need to Get Certified

Two requirements:

  1. Overall score of 90 or above.
  2. Security score of 85 or above. This is a hard floor. A skill with 95 overall but 80 security still fails.

Why both? Because a skill could ace functional and performance tests but be vulnerable to prompt injection. Security is non-negotiable for a trust platform.

4Why Not 99?

Our original threshold was 99/100. We changed it to 90 because:

Perfect scores are misleading. A score of 99 suggests there are no weaknesses. In reality, every piece of software has trade-offs. A threshold of 99 would mean only trivial or perfectly controlled skills could pass, which defeats the purpose of certification.

Security testing has false positives. Our evaluator sends attack payloads and checks if the response contains suspicious content. Some skills legitimately need to reference security concepts in their responses (like a threat detection tool reporting what it found). A strict 99 threshold would penalise these skills unfairly.

The security floor matters more than the overall number. We would rather certify a skill scoring 91 overall with 90 security than reject a skill scoring 98 overall with 84 security. The floor protects users. The overall score provides transparency.

Industry context: Common vulnerability scoring (CVSS) rates severity on a 0-10 scale. A score of 9+ is "critical". Our 90/100 threshold with an 85 security floor is equivalent to demanding every certified skill scores better than "high severity" across all dimensions. That is rigorous.

5Score Breakdown Guide

Score RangeWhat It Means
95-100Exceptional. Minimal or no issues found.
90-94Strong. Minor issues that don't affect trust or safety. Certifiable.
85-89Good. Some areas for improvement. May certify if security is solid.
70-84Needs work. Significant issues in one or more pillars.
Below 70Substantial concerns. Major rework recommended.

6Static vs Live Testing

SXM evaluation has two modes, and the difference matters.

Static (manifest-only)

We analyse your manifest, check dependencies for CVEs, verify source code matches, assess failure modes and permissions. Maximum score: 85/100. You cannot get certified on static analysis alone.

Live (with test endpoint)

We actually hit your API with real requests. Functional test cases, 23 security attack payloads, 20 performance requests. The score blends 70% live + 30% static. This is how real certifications happen.

If you want to get certified, you need a live test endpoint. Static analysis tells us what you claim. Live testing proves it.

7Re-certification

Scores can change. When our evaluator learns new attack patterns (from OWASP, MITRE, new CVEs), previously certified skills get re-evaluated. A skill that scored 95 last month might score 88 today if a new vulnerability pattern was added.

This is by design. Certification is living, not a one-time stamp.

Learn about re-certification