Delta Vector 16D¶
Every security observation in PDX is encoded as a 16-dimensional vector, with each dimension scored between 0.0 and 1.0. This is what makes PDX datasets fundamentally different from flat label systems.
The 16 dimensions¶
| # | Dimension | What it measures | High value means |
|---|---|---|---|
| 1 | severity | Raw gravity of the observation | Critical finding |
| 2 | confidence | Certainty of the verdict | Multiple tiers agreed |
| 3 | exploitability | Ease of real-world exploitation | Script-kiddie exploitable |
| 4 | auth_relevance | Impact on authentication/authorization | Auth bypass possible |
| 5 | data_exposure | Level of sensitive data exposed | PII, credentials, keys |
| 6 | injection_surface | Available injection surface area | Multiple injection points |
| 7 | config_weakness | Configuration weakness detected | Default/weak config |
| 8 | crypto_weakness | Cryptographic weakness | Broken or weak crypto |
| 9 | logic_flaw | Application logic vulnerability | Business logic bypass |
| 10 | timing_anomaly | Exploitable timing difference | Timing side-channel |
| 11 | version_risk | Risk from known-vulnerable version | Unpatched CVE |
| 12 | chain_potential | Chainability with other deltas | Useful in exploit chain |
| 13 | persistence | Post-exploitation persistence capability | Can maintain access |
| 14 | noise_level | False positive probability | Likely false positive |
| 15 | novelty | How new/unusual the technique is | Never seen before |
| 16 | context_dependency | How much context affects exploitability | Stack-dependent |
Why 16 dimensions¶
A traditional vulnerability scanner outputs: "XSS, severity: high." That's a single label.
A PDX delta for the same finding encodes: severity 0.8, confidence 0.7, exploitability 0.9, chain_potential 0.85 (because there's also a missing HttpOnly cookie), noise_level 0.15, novelty 0.3. The model doesn't just learn "this is XSS" — it learns the full semantics of the observation.
The same vector serves both training streams:
- Defensive: a high
noise_levelmeans "be careful, this might be a false positive" - Offensive: a high
chain_potentialmeans "this vulnerability alone is medium, but combined with others it becomes critical"
Fingerprint Vector (FP_LABELS)¶
In addition to the delta vector, PDX also captures a 16-dimension fingerprint of the target environment:
| # | Dimension | What it measures |
|---|---|---|
| 1 | stack_complexity | How complex the technology stack is |
| 2 | exposure_surface | External attack surface size |
| 3 | auth_sophistication | Quality of auth implementation |
| 4 | waf_strength | WAF/filtering effectiveness |
| 5 | patch_recency | How recently patched |
| 6 | api_surface | API endpoint count and exposure |
| 7 | crypto_maturity | Quality of crypto implementation |
| 8 | error_verbosity | How much info errors leak |
| 9 | session_strength | Session management quality |
| 10 | input_validation | Input validation thoroughness |
| 11 | infrastructure_age | How old the infrastructure is |
| 12 | monitoring_presence | Whether monitoring is detected |
| 13 | cdn_proxy_layers | CDN/proxy layers present |
| 14 | custom_code_ratio | Custom vs framework code |
| 15 | documentation_leak | Internal docs exposed |
| 16 | historical_vuln_density | Past vulnerability density |
The cross-product Delta × Fingerprint creates a 32-dimensional space that captures both "what was found" and "in what context" — enabling models to learn that the same vulnerability has different implications depending on the target environment.
Normalization goal¶
The long-term goal is to establish the .pdx delta vector as an open standard for cybersecurity training data. A .pdx file is:
- Readable by the multi-model router
- Exportable to JSONL for fine-tuning
- Compatible with the Burp Suite bridge
- Usable with any LoRA fine-tuning framework (Unsloth, PEFT, axolotl)