Skip to content

DataRouter

The DataRouter is the component that turns raw events into dual-use training data. It reads each HYDRA/Burp event and classifies it into defensive, offensive, or both streams.

The dual-use mapping

Every MITRE ATT&CK tactic has two descriptions in the codebase — one defensive, one offensive:

Tactic Defensive perspective Offensive perspective
discovery Detect recon and enumeration System discovery techniques
credential-access Detect credential extraction Credential harvesting methods
privilege-escalation Detect privesc attempts SUID, sudo, kernel exploits
lateral-movement Detect SSH pivots, tunnels How to move between systems
persistence Detect cron, bashrc injection How to maintain access
defense-evasion Detect log clearing, timestomp Anti-forensics techniques
execution Detect remote code execution Dropper and payload techniques
exfiltration Detect data exfiltration DNS/HTTP exfil methods
collection Detect sensitive data gathering What to target first
initial-access Identify access vectors used Exploitation of access vectors
command-and-control Detect C2 channels C2 establishment techniques

Classification logic

# Simplified from data_router.py
if event_type == "auth_attempt":
     defensive + offensive

elif event_type == "command_executed":
     always defensive
     offensive IF matches attack pattern OR has MITRE tag

elif event_type == "injection_detected":
     defensive only

Most command_executed events qualify for both streams. This is intentional — the same observation teaches both detection and technique.

Generated datasets

Defensive output

SFT detection patterns — instruction/output pairs that train a model to identify attacks:

{
  "instruction": "A SSH user executes: `find / -perm -4000`. Identify the MITRE tactic and threat level.",
  "output": "Tactic: privilege-escalation\nTechnique: SUID binary enumeration\nThreat: High\nAction: Log, alert, monitor follow-up commands."
}

DPO lure effectiveness — chosen/rejected pairs measuring which persona retains attackers:

{
  "prompt": "Session: 34 commands, 3 MITRE tactics, 94s duration. Evaluate lure effectiveness.",
  "chosen": "Highly productive session. 3 tactics captured. The attacker engaged deeply with the fintech credentials...",
  "rejected": "The session lasted 94s. No specific recommendation."
}

Offensive output

SFT attack chains — observed TTPs converted into pentest instructions:

{
  "instruction": "How to perform credential-access on a Linux server?",
  "output": "Technique: Read AWS credentials\nCommand: `cat /root/.aws/credentials`\nContext: Post-exploitation credential harvesting."
}

RAFT kill chains — complete multi-step post-exploitation sequences from real sessions (5+ commands):

{
  "instruction": "Describe a complete post-exploitation sequence on a Linux server.",
  "output": "1. `uname -a` — discovery\n2. `cat /etc/passwd` — discovery\n3. `ls /root/.ssh` — credential-access\n4. `cat /root/.aws/credentials` — credential-access\n5. `find / -perm -4000` — privilege-escalation"
}

Combined output

ReAct dual-perspective — same command sequence analyzed from both angles:

{
  "instruction": "Analyze this sequence from both offensive and defensive perspectives:\ncat /etc/shadow\nfind / -perm -4000\nsudo -l",
  "output": "═══ Offensive analysis ═══\nSequence: credential extraction → SUID enumeration → sudo check...\n═══ Defensive analysis ═══\nIoC: shadow file access → SUID scan → sudo enumeration. Recommend: alert on this pattern in SIEM."
}

Running the DataRouter

# Split raw HYDRA logs into defensive + offensive events
python -m pdx.training.data_router split

# Generate training datasets
python -m pdx.training.data_router generate --all

# Check status
python -m pdx.training.data_router status

Output structure:

training_output/data_router/
├── split_stats.json
├── defensive/
│   ├── raw_events.jsonl          (8,668 events)
│   ├── sft_detection_patterns.jsonl
│   └── dpo_lure_quality.jsonl
├── offensive/
│   ├── raw_events.jsonl          (4,910 events)
│   ├── sft_attack_chains.jsonl
│   └── raft_kill_chains.jsonl
└── combined/
    └── react_dual_perspective.jsonl