Troubleshooting¶

Common issues and their solutions when running HYDRA × PDX.

HYDRA¶

SSH server won't start¶

Symptom: OSError: [Errno 98] Address already in use

The port is already occupied. Check what's using it:

sudo ss -tlnp | grep 2222

If it's a previous HYDRA instance that didn't shut down cleanly:

# Find the PID
sudo lsof -i :2222
# Kill it
sudo kill -9 <PID>

If the real SSH daemon is on port 2222, change HYDRA's port in .env:

SSH_PORT=2223

Groq API errors¶

Symptom: GroqError: 429 Rate limit exceeded

The free Groq tier has rate limits. Solutions:

Reduce concurrent sessions — HYDRA opens one API call per command. Under heavy bot traffic, this adds up quickly.
Increase cache TTL — In .env, set LLM_CACHE_TTL=600 (10 minutes). Repeated commands hit the cache instead of the API.
Increase cache size — LLM_CACHE_SIZE=500 stores more unique responses.
Upgrade Groq tier — The paid tier has significantly higher limits.

Symptom: GroqError: 503 Service unavailable

The Groq API is temporarily down. HYDRA will retry automatically. If it persists, check status.groq.com.

Sessions disconnect immediately¶

Symptom: Attackers connect but disconnect within 1 second.

Most likely these are bot_ephemeral scanners — they probe the port but don't authenticate. This is normal and expected (72.7% of all traffic). Check the logs:

# Count session types
grep '"label"' logs/*.jsonl | sort | uniq -c | sort -rn

If all sessions disconnect immediately (including your test connections), check:

SSH host keys exist in config/ (RSA + Ed25519)
The .env file has a valid GROQ_API_KEY
Python dependencies are all installed

PromptGuard false positives¶

Symptom: Legitimate commands are scored > 0.5 by PromptGuard.

PromptGuard uses regex patterns. Some legitimate admin commands can trigger low-score matches. This is by design — PromptGuard never blocks, only logs. Scores below 0.8 are informational.

If a specific pattern causes issues, you can adjust thresholds in the code or add exceptions to the pattern list.

VFS inconsistency¶

Symptom: An attacker creates a file but ls doesn't show it.

Check if the command was handled by the LLM rather than the built-in handler. LLM responses don't automatically mutate the VFS — only built-in commands (mkdir, touch, rm, echo >) trigger state mutations.

If an attacker runs a command that should create files (e.g., wget downloading a file), the VFS side-effect registry handles known patterns. Unrecognized side effects may not be reflected.

PDX¶

DataRouter produces empty output¶

Symptom: split_stats.json shows 0 events.

No signal sessions — If all sessions are classified as noise, the DataRouter has nothing to process. Check split_stats.json for session counts by label.
Wrong logs directory — Verify the path: python -m pdx.training.data_router split --logs-dir /path/to/logs
Corrupted JSONL — Some log files may be truncated (e.g., from a crash). The DataRouter skips malformed lines but logs warnings.

Fine-tuning runs out of VRAM¶

Symptom: CUDA out of memory

Solutions by VRAM available:

VRAM	Model	LoRA rank	Batch size
8 GB	Qwen 2.5 7B	8	1–2
12 GB	Qwen 2.5 7B	16	2–4
16 GB	Qwen 2.5 14B	16	2
24 GB	Llama 3.3 8B	32	4–8

Also check if Ollama is running — it reserves VRAM even when idle:

# Check Ollama VRAM usage
nvidia-smi
# Stop Ollama temporarily
systemctl stop ollama

Quality pipeline removes too many entries¶

Symptom: dedup_removed count is very high.

This usually means the training data has many near-duplicate sessions (same bot, same commands). This is expected — bots are repetitive by nature. The dedup threshold can be adjusted:

qp = QualityPipeline(dedup_threshold=0.95)  # More permissive (default: 0.85)

Warning

Setting the threshold too high (> 0.95) risks leaving near-duplicates in the training set, which degrades model performance.

Deployment¶

Cloudflare Pages build fails¶

Symptom: Build fails with pip: command not found

Make sure the environment variable PYTHON_VERSION is set to 3.12 in the Cloudflare Pages build settings. Without it, the build environment may not have Python available.

Subdomain not resolving¶

If you're using a custom subdomain (e.g., docs.yourdomain.com):

In Cloudflare dashboard → DNS → verify the CNAME record exists pointing to your Pages project
In your Pages project → Custom domains → verify the domain is listed and has a green checkmark
SSL certificates can take up to 24 hours to provision

Getting help¶

If your issue isn't listed here:

Check the logs — HYDRA logs everything to logs/ in structured JSONL
Run tests — python -m pytest tests/ validates core functionality
Open an issue on GitHub with:
- The error message (full traceback)
- Your Python version (python --version)
- Your OS and hardware (GPU model if fine-tuning)
- The command you ran