Troubleshooting¶
Common issues and their solutions when running HYDRA × PDX.
HYDRA¶
SSH server won't start¶
Symptom: OSError: [Errno 98] Address already in use
The port is already occupied. Check what's using it:
If it's a previous HYDRA instance that didn't shut down cleanly:
If the real SSH daemon is on port 2222, change HYDRA's port in .env:
Groq API errors¶
Symptom: GroqError: 429 Rate limit exceeded
The free Groq tier has rate limits. Solutions:
- Reduce concurrent sessions — HYDRA opens one API call per command. Under heavy bot traffic, this adds up quickly.
- Increase cache TTL — In
.env, setLLM_CACHE_TTL=600(10 minutes). Repeated commands hit the cache instead of the API. - Increase cache size —
LLM_CACHE_SIZE=500stores more unique responses. - Upgrade Groq tier — The paid tier has significantly higher limits.
Symptom: GroqError: 503 Service unavailable
The Groq API is temporarily down. HYDRA will retry automatically. If it persists, check status.groq.com.
Sessions disconnect immediately¶
Symptom: Attackers connect but disconnect within 1 second.
Most likely these are bot_ephemeral scanners — they probe the port but don't authenticate. This is normal and expected (72.7% of all traffic). Check the logs:
If all sessions disconnect immediately (including your test connections), check:
- SSH host keys exist in
config/(RSA + Ed25519) - The
.envfile has a validGROQ_API_KEY - Python dependencies are all installed
PromptGuard false positives¶
Symptom: Legitimate commands are scored > 0.5 by PromptGuard.
PromptGuard uses regex patterns. Some legitimate admin commands can trigger low-score matches. This is by design — PromptGuard never blocks, only logs. Scores below 0.8 are informational.
If a specific pattern causes issues, you can adjust thresholds in the code or add exceptions to the pattern list.
VFS inconsistency¶
Symptom: An attacker creates a file but ls doesn't show it.
Check if the command was handled by the LLM rather than the built-in handler. LLM responses don't automatically mutate the VFS — only built-in commands (mkdir, touch, rm, echo >) trigger state mutations.
If an attacker runs a command that should create files (e.g., wget downloading a file), the VFS side-effect registry handles known patterns. Unrecognized side effects may not be reflected.
PDX¶
DataRouter produces empty output¶
Symptom: split_stats.json shows 0 events.
- No signal sessions — If all sessions are classified as noise, the DataRouter has nothing to process. Check
split_stats.jsonfor session counts by label. - Wrong logs directory — Verify the path:
python -m pdx.training.data_router split --logs-dir /path/to/logs - Corrupted JSONL — Some log files may be truncated (e.g., from a crash). The DataRouter skips malformed lines but logs warnings.
Fine-tuning runs out of VRAM¶
Symptom: CUDA out of memory
Solutions by VRAM available:
| VRAM | Model | LoRA rank | Batch size |
|---|---|---|---|
| 8 GB | Qwen 2.5 7B | 8 | 1–2 |
| 12 GB | Qwen 2.5 7B | 16 | 2–4 |
| 16 GB | Qwen 2.5 14B | 16 | 2 |
| 24 GB | Llama 3.3 8B | 32 | 4–8 |
Also check if Ollama is running — it reserves VRAM even when idle:
Quality pipeline removes too many entries¶
Symptom: dedup_removed count is very high.
This usually means the training data has many near-duplicate sessions (same bot, same commands). This is expected — bots are repetitive by nature. The dedup threshold can be adjusted:
Warning
Setting the threshold too high (> 0.95) risks leaving near-duplicates in the training set, which degrades model performance.
Deployment¶
Cloudflare Pages build fails¶
Symptom: Build fails with pip: command not found
Make sure the environment variable PYTHON_VERSION is set to 3.12 in the Cloudflare Pages build settings. Without it, the build environment may not have Python available.
Subdomain not resolving¶
If you're using a custom subdomain (e.g., docs.yourdomain.com):
- In Cloudflare dashboard → DNS → verify the CNAME record exists pointing to your Pages project
- In your Pages project → Custom domains → verify the domain is listed and has a green checkmark
- SSL certificates can take up to 24 hours to provision
Getting help¶
If your issue isn't listed here:
- Check the logs — HYDRA logs everything to
logs/in structured JSONL - Run tests —
python -m pytest tests/validates core functionality - Open an issue on GitHub with:
- The error message (full traceback)
- Your Python version (
python --version) - Your OS and hardware (GPU model if fine-tuning)
- The command you ran