What is PDX¶
PDX is a security analysis framework that transforms raw security events into structured training datasets for LLM fine-tuning. It is the pipeline that turns observations into knowledge.
PDX is not limited to HYDRA data. It has two input sources:
- HYDRA sessions (passive) — SSH honeypot events via JSONL files
- Burp Suite (active) — HTTP deltas captured during web pentests via a Java extension
Both sources produce data in the same .pdx format and flow through the same DataRouter, generators, and quality pipeline.
Core components¶
| Component | Purpose |
|---|---|
.pdx format | Binary format for security observations |
| Delta Vector 16D | 16-dimensional scoring of each observation |
| DataRouter | Dual-use classification (defensive + offensive) |
| Burp bridge | Java extension + Python proxy for web pentesting |
| Training generators | 7 generators: SFT, DPO, RAFT, ReAct, CoT, Chain, JS |
| Quality pipeline | Deduplication, filtering, curriculum ordering |
| Router multi-models | 4-tier analysis: 7B → 32B → Anthropic API → fallback |
| 8 collectors | NVD, ExploitDB, OWASP, MITRE ATT&CK, Nuclei, CWE, RFC, man pages |
The multi-model router¶
PDX doesn't rely on a single LLM. It uses a 4-tier cascade to analyze each security delta:
Tier 1: Copilot local (7B) — fast first-pass on every delta
↓ uncertain?
Tier 2: Teacher local (32B) — detailed second-pass
↓ still uncertain?
Tier 3: Anthropic API — when complexity requires it
↓ unavailable?
Tier 4: WebChat fallback — marked REQUIRES HUMAN VALIDATION
Each tier produces a verdict: VULNERABLE, NOT_VULN, INFORMATIONAL, UNCERTAIN, or FALSE_POS. When tiers disagree, the conflict is flagged for review. Nothing is discarded.
The 8 data collectors¶
PDX doesn't only process HYDRA/Burp data. It enriches every observation with context from 8 external sources:
| Collector | Source | What it adds |
|---|---|---|
nvd_collector | NVD/NIST | CVE details, CVSS scores |
exploitdb_collector | ExploitDB | Known exploits for detected versions |
owasp_collector | OWASP | Web vulnerability classifications |
attackmitre_collector | MITRE ATT&CK | Tactic/technique mapping |
nuclei_collector | Nuclei templates | Detection signatures |
cwe_collector | CWE database | Weakness classifications |
rfc_collector | IETF RFCs | Protocol specifications |
manpage_collector | Linux man pages | Command documentation |
Output¶
PDX produces three types of training data:
- Defensive — SFT pairs for pattern detection + DPO pairs for lure effectiveness measurement
- Offensive — SFT pairs for attack chain reconstruction + RAFT for complete kill chains
- Combined — ReAct traces analyzing the same sequence from both perspectives
Fine-tuning runs locally via Unsloth with LoRA adapters on Qwen or Llama models, directly on GPU.