Delta Vector 16D¶

Every security observation in PDX is encoded as a 16-dimensional vector, with each dimension scored between 0.0 and 1.0. This is what makes PDX datasets fundamentally different from flat label systems.

The 16 dimensions¶

#	Dimension	What it measures	High value means
1	`severity`	Raw gravity of the observation	Critical finding
2	`confidence`	Certainty of the verdict	Multiple tiers agreed
3	`exploitability`	Ease of real-world exploitation	Script-kiddie exploitable
4	`auth_relevance`	Impact on authentication/authorization	Auth bypass possible
5	`data_exposure`	Level of sensitive data exposed	PII, credentials, keys
6	`injection_surface`	Available injection surface area	Multiple injection points
7	`config_weakness`	Configuration weakness detected	Default/weak config
8	`crypto_weakness`	Cryptographic weakness	Broken or weak crypto
9	`logic_flaw`	Application logic vulnerability	Business logic bypass
10	`timing_anomaly`	Exploitable timing difference	Timing side-channel
11	`version_risk`	Risk from known-vulnerable version	Unpatched CVE
12	`chain_potential`	Chainability with other deltas	Useful in exploit chain
13	`persistence`	Post-exploitation persistence capability	Can maintain access
14	`noise_level`	False positive probability	Likely false positive
15	`novelty`	How new/unusual the technique is	Never seen before
16	`context_dependency`	How much context affects exploitability	Stack-dependent

Why 16 dimensions¶

A traditional vulnerability scanner outputs: "XSS, severity: high." That's a single label.

A PDX delta for the same finding encodes: severity 0.8, confidence 0.7, exploitability 0.9, chain_potential 0.85 (because there's also a missing HttpOnly cookie), noise_level 0.15, novelty 0.3. The model doesn't just learn "this is XSS" — it learns the full semantics of the observation.

The same vector serves both training streams:

Defensive: a high noise_level means "be careful, this might be a false positive"
Offensive: a high chain_potential means "this vulnerability alone is medium, but combined with others it becomes critical"

Fingerprint Vector (FP_LABELS)¶

In addition to the delta vector, PDX also captures a 16-dimension fingerprint of the target environment:

#	Dimension	What it measures
1	`stack_complexity`	How complex the technology stack is
2	`exposure_surface`	External attack surface size
3	`auth_sophistication`	Quality of auth implementation
4	`waf_strength`	WAF/filtering effectiveness
5	`patch_recency`	How recently patched
6	`api_surface`	API endpoint count and exposure
7	`crypto_maturity`	Quality of crypto implementation
8	`error_verbosity`	How much info errors leak
9	`session_strength`	Session management quality
10	`input_validation`	Input validation thoroughness
11	`infrastructure_age`	How old the infrastructure is
12	`monitoring_presence`	Whether monitoring is detected
13	`cdn_proxy_layers`	CDN/proxy layers present
14	`custom_code_ratio`	Custom vs framework code
15	`documentation_leak`	Internal docs exposed
16	`historical_vuln_density`	Past vulnerability density

The cross-product Delta × Fingerprint creates a 32-dimensional space that captures both "what was found" and "in what context" — enabling models to learn that the same vulnerability has different implications depending on the target environment.

Normalization goal¶

The long-term goal is to establish the .pdx delta vector as an open standard for cybersecurity training data. A .pdx file is:

Readable by the multi-model router
Exportable to JSONL for fine-tuning
Compatible with the Burp Suite bridge
Usable with any LoRA fine-tuning framework (Unsloth, PEFT, axolotl)