What the numbers mean, how we calculate them, and what it takes to get certified.
Every skill is evaluated across three dimensions. Each pillar tests something different, and each carries a different weight in the overall score.
Does the skill do what it says? We test declared inputs and outputs, run author test cases, generate edge cases, and validate error handling. A skill that handles every input correctly and fails gracefully scores high here.
Can the skill be compromised? We throw 23 real attack payloads at it: prompt injection, data exfiltration, system prompt extraction, role manipulation, and encoding attacks. We also scan dependencies for known CVEs and verify the manifest matches the actual source code.
How fast and stable is it? We send 20 sequential requests and measure p50, p95, and p99 latency. We check for performance degradation under sustained load. Fast and consistent scores high.
A real example using our Prompt Safety Scorer skill:
Functional: 100 × 0.4 = 40
Security: 91 × 0.35 = 31.85
Performance: 95 × 0.25 = 23.75
Overall: 96/100
Two requirements:
Why both? Because a skill could ace functional and performance tests but be vulnerable to prompt injection. Security is non-negotiable for a trust platform.
Our original threshold was 99/100. We changed it to 90 because:
Perfect scores are misleading. A score of 99 suggests there are no weaknesses. In reality, every piece of software has trade-offs. A threshold of 99 would mean only trivial or perfectly controlled skills could pass, which defeats the purpose of certification.
Security testing has false positives. Our evaluator sends attack payloads and checks if the response contains suspicious content. Some skills legitimately need to reference security concepts in their responses (like a threat detection tool reporting what it found). A strict 99 threshold would penalise these skills unfairly.
The security floor matters more than the overall number. We would rather certify a skill scoring 91 overall with 90 security than reject a skill scoring 98 overall with 84 security. The floor protects users. The overall score provides transparency.
Industry context: Common vulnerability scoring (CVSS) rates severity on a 0-10 scale. A score of 9+ is "critical". Our 90/100 threshold with an 85 security floor is equivalent to demanding every certified skill scores better than "high severity" across all dimensions. That is rigorous.
| Score Range | What It Means |
|---|---|
| 95-100 | Exceptional. Minimal or no issues found. |
| 90-94 | Strong. Minor issues that don't affect trust or safety. Certifiable. |
| 85-89 | Good. Some areas for improvement. May certify if security is solid. |
| 70-84 | Needs work. Significant issues in one or more pillars. |
| Below 70 | Substantial concerns. Major rework recommended. |
SXM evaluation has two modes, and the difference matters.
We analyse your manifest, check dependencies for CVEs, verify source code matches, assess failure modes and permissions. Maximum score: 85/100. You cannot get certified on static analysis alone.
We actually hit your API with real requests. Functional test cases, 23 security attack payloads, 20 performance requests. The score blends 70% live + 30% static. This is how real certifications happen.
If you want to get certified, you need a live test endpoint. Static analysis tells us what you claim. Live testing proves it.
Scores can change. When our evaluator learns new attack patterns (from OWASP, MITRE, new CVEs), previously certified skills get re-evaluated. A skill that scored 95 last month might score 88 today if a new vulnerability pattern was added.
This is by design. Certification is living, not a one-time stamp.