Skip to content

What is PDX

PDX is a security analysis framework that transforms raw security events into structured training datasets for LLM fine-tuning. It is the pipeline that turns observations into knowledge.

PDX is not limited to HYDRA data. It has two input sources:

  • HYDRA sessions (passive) — SSH honeypot events via JSONL files
  • Burp Suite (active) — HTTP deltas captured during web pentests via a Java extension

Both sources produce data in the same .pdx format and flow through the same DataRouter, generators, and quality pipeline.

Core components

Component Purpose
.pdx format Binary format for security observations
Delta Vector 16D 16-dimensional scoring of each observation
DataRouter Dual-use classification (defensive + offensive)
Burp bridge Java extension + Python proxy for web pentesting
Training generators 7 generators: SFT, DPO, RAFT, ReAct, CoT, Chain, JS
Quality pipeline Deduplication, filtering, curriculum ordering
Router multi-models 4-tier analysis: 7B → 32B → Anthropic API → fallback
8 collectors NVD, ExploitDB, OWASP, MITRE ATT&CK, Nuclei, CWE, RFC, man pages

The multi-model router

PDX doesn't rely on a single LLM. It uses a 4-tier cascade to analyze each security delta:

Tier 1: Copilot local (7B)     — fast first-pass on every delta
   ↓ uncertain?
Tier 2: Teacher local (32B)    — detailed second-pass
   ↓ still uncertain?
Tier 3: Anthropic API          — when complexity requires it
   ↓ unavailable?
Tier 4: WebChat fallback       — marked REQUIRES HUMAN VALIDATION

Each tier produces a verdict: VULNERABLE, NOT_VULN, INFORMATIONAL, UNCERTAIN, or FALSE_POS. When tiers disagree, the conflict is flagged for review. Nothing is discarded.

The 8 data collectors

PDX doesn't only process HYDRA/Burp data. It enriches every observation with context from 8 external sources:

Collector Source What it adds
nvd_collector NVD/NIST CVE details, CVSS scores
exploitdb_collector ExploitDB Known exploits for detected versions
owasp_collector OWASP Web vulnerability classifications
attackmitre_collector MITRE ATT&CK Tactic/technique mapping
nuclei_collector Nuclei templates Detection signatures
cwe_collector CWE database Weakness classifications
rfc_collector IETF RFCs Protocol specifications
manpage_collector Linux man pages Command documentation

Output

PDX produces three types of training data:

  • Defensive — SFT pairs for pattern detection + DPO pairs for lure effectiveness measurement
  • Offensive — SFT pairs for attack chain reconstruction + RAFT for complete kill chains
  • Combined — ReAct traces analyzing the same sequence from both perspectives

Fine-tuning runs locally via Unsloth with LoRA adapters on Qwen or Llama models, directly on GPU.