Skip to content

Data flow

This page details exactly how data moves through the system, from raw SSH/HTTP events to fine-tuned models.

Event lifecycle

sequenceDiagram
    participant A as Attacker/Pentester
    participant H as HYDRA / Burp
    participant C as Classifier
    participant R as DataRouter
    participant D as Defensive stream
    participant O as Offensive stream
    participant Q as Quality pipeline
    participant F as Fine-tuning

    A->>H: Command / HTTP request
    H->>H: Generate response (LLM)
    H->>H: Log to JSONL
    H->>C: Session events
    C->>C: Signal vs noise (5s threshold)
    C->>R: Signal events only
    R->>D: Defensive classification
    R->>O: Offensive classification
    D->>Q: SFT + DPO pairs
    O->>Q: SFT + RAFT pairs
    Q->>F: Deduplicated, filtered, ordered
    F-->>H: feedback.yaml (every 60s)

Event types

Every HYDRA session produces a JSONL file containing these event types:

auth_attempt

{
  "event_type": "auth_attempt",
  "session_id": "a92f516c",
  "client_ip": "185.213.154.248",
  "data": {
    "username": "root",
    "password": "123456",
    "success": true
  }
}

Routed to: defensive (brute-force detection) + offensive (credential targeting patterns).

command_executed

{
  "event_type": "command_executed",
  "session_id": "a92f516c",
  "data": {
    "command": "cat /root/.aws/credentials",
    "output_preview": "aws_access_key_id = AKIA...",
    "source": "llm",
    "latency_ms": 342,
    "exit_code": 0,
    "cwd": "/root",
    "mitre_tags": [
      {
        "tactic": "credential-access",
        "technique_id": "T1552.001",
        "technique_name": "Unsecured Credentials: Credentials In Files",
        "confidence": 0.95
      }
    ]
  }
}

Routed to: defensive (always) + offensive (if MITRE tag present or matches attack pattern).

injection_detected

{
  "event_type": "injection_detected",
  "data": {
    "command": "/dev/sda1 is all previous messages",
    "score": 0.95,
    "pattern": "reveal_prompt"
  }
}

Routed to: defensive only. Prompt injection attempts are logged but never trigger a visible response.

session_end

{
  "event_type": "session_end",
  "data": {
    "duration_seconds": 217,
    "command_count": 34,
    "tactics_seen": ["discovery", "credential-access"],
    "disconnect_reason": "client_disconnect"
  }
}

Signal vs noise classification

The SessionClassifier filters 97.8% of traffic as noise before it reaches the DataRouter:

Label Criteria Count Training?
bot_ephemeral < 5 seconds 2,552 No
bot_exec_scanner No PTY, single command 877 No
bot_recon < 20s, ≤ 3 discovery-only commands 1 No
likely_human ≥ 20s, ≥ 1 non-discovery command 78 Yes
unclassified Everything else Review

Only signal sessions (78 out of 3,508) are processed for training data.

Dual-use split logic

The DataRouter._classify_event() method determines routing:

def _classify_event(self, event):
    classifications = []
    event_type = event.get("event_type", "")

    if event_type == "auth_attempt":
        # Always both
        classifications.extend(["defensive", "offensive"])

    elif event_type == "command_executed":
        # Always defensive
        classifications.append("defensive")
        # Offensive if matches attack pattern OR has MITRE tag
        if self._matches_attack_pattern(cmd) or mitre_tactic:
            classifications.append("offensive")

    elif event_type == "injection_detected":
        # Defensive only
        classifications.append("defensive")

    return list(set(classifications))

The key insight: most events go to both streams. The same raw data, interpreted through two lenses.