What is PDX¶

PDX is a security analysis framework that transforms raw security events into structured training datasets for LLM fine-tuning. It is the pipeline that turns observations into knowledge.

PDX is not limited to HYDRA data. It has two input sources:

HYDRA sessions (passive) — SSH honeypot events via JSONL files
Burp Suite (active) — HTTP deltas captured during web pentests via a Java extension

Both sources produce data in the same .pdx format and flow through the same DataRouter, generators, and quality pipeline.

Core components¶

Component	Purpose
`.pdx` format	Binary format for security observations
Delta Vector 16D	16-dimensional scoring of each observation
DataRouter	Dual-use classification (defensive + offensive)
Burp bridge	Java extension + Python proxy for web pentesting
Training generators	7 generators: SFT, DPO, RAFT, ReAct, CoT, Chain, JS
Quality pipeline	Deduplication, filtering, curriculum ordering
Router multi-models	4-tier analysis: 7B → 32B → Anthropic API → fallback
8 collectors	NVD, ExploitDB, OWASP, MITRE ATT&CK, Nuclei, CWE, RFC, man pages

The multi-model router¶

PDX doesn't rely on a single LLM. It uses a 4-tier cascade to analyze each security delta:

Tier 1: Copilot local (7B)     — fast first-pass on every delta
   ↓ uncertain?
Tier 2: Teacher local (32B)    — detailed second-pass
   ↓ still uncertain?
Tier 3: Anthropic API          — when complexity requires it
   ↓ unavailable?
Tier 4: WebChat fallback       — marked REQUIRES HUMAN VALIDATION

Each tier produces a verdict: VULNERABLE, NOT_VULN, INFORMATIONAL, UNCERTAIN, or FALSE_POS. When tiers disagree, the conflict is flagged for review. Nothing is discarded.

The 8 data collectors¶

PDX doesn't only process HYDRA/Burp data. It enriches every observation with context from 8 external sources:

Collector	Source	What it adds
`nvd_collector`	NVD/NIST	CVE details, CVSS scores
`exploitdb_collector`	ExploitDB	Known exploits for detected versions
`owasp_collector`	OWASP	Web vulnerability classifications
`attackmitre_collector`	MITRE ATT&CK	Tactic/technique mapping
`nuclei_collector`	Nuclei templates	Detection signatures
`cwe_collector`	CWE database	Weakness classifications
`rfc_collector`	IETF RFCs	Protocol specifications
`manpage_collector`	Linux man pages	Command documentation

Output¶

PDX produces three types of training data:

Defensive — SFT pairs for pattern detection + DPO pairs for lure effectiveness measurement
Offensive — SFT pairs for attack chain reconstruction + RAFT for complete kill chains
Combined — ReAct traces analyzing the same sequence from both perspectives

Fine-tuning runs locally via Unsloth with LoRA adapters on Qwen or Llama models, directly on GPU.