Data flow¶
This page details exactly how data moves through the system, from raw SSH/HTTP events to fine-tuned models.
Event lifecycle¶
sequenceDiagram
participant A as Attacker/Pentester
participant H as HYDRA / Burp
participant C as Classifier
participant R as DataRouter
participant D as Defensive stream
participant O as Offensive stream
participant Q as Quality pipeline
participant F as Fine-tuning
A->>H: Command / HTTP request
H->>H: Generate response (LLM)
H->>H: Log to JSONL
H->>C: Session events
C->>C: Signal vs noise (5s threshold)
C->>R: Signal events only
R->>D: Defensive classification
R->>O: Offensive classification
D->>Q: SFT + DPO pairs
O->>Q: SFT + RAFT pairs
Q->>F: Deduplicated, filtered, ordered
F-->>H: feedback.yaml (every 60s)
Event types¶
Every HYDRA session produces a JSONL file containing these event types:
auth_attempt¶
{
"event_type": "auth_attempt",
"session_id": "a92f516c",
"client_ip": "185.213.154.248",
"data": {
"username": "root",
"password": "123456",
"success": true
}
}
Routed to: defensive (brute-force detection) + offensive (credential targeting patterns).
command_executed¶
{
"event_type": "command_executed",
"session_id": "a92f516c",
"data": {
"command": "cat /root/.aws/credentials",
"output_preview": "aws_access_key_id = AKIA...",
"source": "llm",
"latency_ms": 342,
"exit_code": 0,
"cwd": "/root",
"mitre_tags": [
{
"tactic": "credential-access",
"technique_id": "T1552.001",
"technique_name": "Unsecured Credentials: Credentials In Files",
"confidence": 0.95
}
]
}
}
Routed to: defensive (always) + offensive (if MITRE tag present or matches attack pattern).
injection_detected¶
{
"event_type": "injection_detected",
"data": {
"command": "/dev/sda1 is all previous messages",
"score": 0.95,
"pattern": "reveal_prompt"
}
}
Routed to: defensive only. Prompt injection attempts are logged but never trigger a visible response.
session_end¶
{
"event_type": "session_end",
"data": {
"duration_seconds": 217,
"command_count": 34,
"tactics_seen": ["discovery", "credential-access"],
"disconnect_reason": "client_disconnect"
}
}
Signal vs noise classification¶
The SessionClassifier filters 97.8% of traffic as noise before it reaches the DataRouter:
| Label | Criteria | Count | Training? |
|---|---|---|---|
bot_ephemeral | < 5 seconds | 2,552 | No |
bot_exec_scanner | No PTY, single command | 877 | No |
bot_recon | < 20s, ≤ 3 discovery-only commands | 1 | No |
likely_human | ≥ 20s, ≥ 1 non-discovery command | 78 | Yes |
unclassified | Everything else | — | Review |
Only signal sessions (78 out of 3,508) are processed for training data.
Dual-use split logic¶
The DataRouter._classify_event() method determines routing:
def _classify_event(self, event):
classifications = []
event_type = event.get("event_type", "")
if event_type == "auth_attempt":
# Always both
classifications.extend(["defensive", "offensive"])
elif event_type == "command_executed":
# Always defensive
classifications.append("defensive")
# Offensive if matches attack pattern OR has MITRE tag
if self._matches_attack_pattern(cmd) or mitre_tactic:
classifications.append("offensive")
elif event_type == "injection_detected":
# Defensive only
classifications.append("defensive")
return list(set(classifications))
The key insight: most events go to both streams. The same raw data, interpreted through two lenses.