HYDRA × PDX¶
Dual-use cybersecurity pipeline — LLM-powered honeypot meets security training data generator.
The problem¶
In cybersecurity, a honeypot is a fake server deliberately exposed on the internet to attract attackers. You let them in, watch what they do, and learn from their techniques.
The problem is that today's honeypots are trivially detectable. An experienced attacker runs uname -r and sees the wrong kernel. Or checks /proc/1/cgroup and spots Docker traces. Tools like Cowrie — the most popular SSH honeypot — get fingerprinted in under 30 seconds.
Result: attackers disconnect instantly. Your logs are noise, not intelligence.
The hypothesis¶
What if the terminal could intelligently answer any command an attacker types — in real time, with memory, and without leaving any trace that it's fake?
And what if the captured data could automatically produce both offensive and defensive training datasets — from the same raw events?
That's what HYDRA × PDX does.
How it works¶
graph TB
A[Attacker via SSH] --> B[HYDRA Honeypot]
P[Pentester via Burp] --> C[Burp Extension]
B --> D[DataRouter]
C --> D
D --> E[Defensive stream]
D --> F[Offensive stream]
D --> G[Combined ReAct]
E --> H[Fine-tuning
Unsloth / LoRA]
F --> H
G --> H
H --> |feedback.yaml| B
The system has two data sources:
| Source | Type | What it captures |
|---|---|---|
| HYDRA | Passive | Attackers connect to a public SSH honeypot. Every command is answered by an LLM in real time. 65+ built-in commands, 3 personas, anti-fingerprinting. |
| Burp Suite | Active | During web pentests, HTTP deltas flow through a Java extension into the same pipeline. |
Both sources produce events in the same .pdx format. Both converge into a single DataRouter that classifies each event into defensive, offensive, or both streams simultaneously.
Key numbers¶
| Metric | Value |
|---|---|
| SSH sessions captured | 3,508 |
| Signal sessions (human) | 78 (2.2%) |
| Defensive events generated | 8,668 |
| Offensive events generated | 4,910 |
| MITRE ATT&CK tactics covered | 5/5 |
| Longest session | 36.3 minutes |
| Personas | 3 (fintech, crypto, corp AD) |
| Built-in commands | 65+ |
| Training generators | 7 formats |
| Data collectors | 8 sources |
What's in the docs¶
Architecture¶
How the full system fits together — capture, routing, output, and feedback loop.
HYDRA¶
The LLM-powered honeypot: 9-step command pipeline, personas, virtual filesystem, anti-fingerprinting, PromptGuard, feedback loop.
PDX¶
The pipeline: .pdx format, Delta Vector 16D, DataRouter, Burp bridge, 7 training generators, quality pipeline.
Observations¶
What we found in 3,508 sessions: Kinsing botnets, Solana targeting, credential propagation, prompt injection via SSH.
Links¶
- GitHub repository
- arXiv paper (coming soon)