Skip to content

Troubleshooting

Common issues and their solutions when running HYDRA × PDX.

HYDRA

SSH server won't start

Symptom: OSError: [Errno 98] Address already in use

The port is already occupied. Check what's using it:

sudo ss -tlnp | grep 2222

If it's a previous HYDRA instance that didn't shut down cleanly:

# Find the PID
sudo lsof -i :2222
# Kill it
sudo kill -9 <PID>

If the real SSH daemon is on port 2222, change HYDRA's port in .env:

SSH_PORT=2223

Groq API errors

Symptom: GroqError: 429 Rate limit exceeded

The free Groq tier has rate limits. Solutions:

  1. Reduce concurrent sessions — HYDRA opens one API call per command. Under heavy bot traffic, this adds up quickly.
  2. Increase cache TTL — In .env, set LLM_CACHE_TTL=600 (10 minutes). Repeated commands hit the cache instead of the API.
  3. Increase cache sizeLLM_CACHE_SIZE=500 stores more unique responses.
  4. Upgrade Groq tier — The paid tier has significantly higher limits.

Symptom: GroqError: 503 Service unavailable

The Groq API is temporarily down. HYDRA will retry automatically. If it persists, check status.groq.com.

Sessions disconnect immediately

Symptom: Attackers connect but disconnect within 1 second.

Most likely these are bot_ephemeral scanners — they probe the port but don't authenticate. This is normal and expected (72.7% of all traffic). Check the logs:

# Count session types
grep '"label"' logs/*.jsonl | sort | uniq -c | sort -rn

If all sessions disconnect immediately (including your test connections), check:

  • SSH host keys exist in config/ (RSA + Ed25519)
  • The .env file has a valid GROQ_API_KEY
  • Python dependencies are all installed

PromptGuard false positives

Symptom: Legitimate commands are scored > 0.5 by PromptGuard.

PromptGuard uses regex patterns. Some legitimate admin commands can trigger low-score matches. This is by design — PromptGuard never blocks, only logs. Scores below 0.8 are informational.

If a specific pattern causes issues, you can adjust thresholds in the code or add exceptions to the pattern list.

VFS inconsistency

Symptom: An attacker creates a file but ls doesn't show it.

Check if the command was handled by the LLM rather than the built-in handler. LLM responses don't automatically mutate the VFS — only built-in commands (mkdir, touch, rm, echo >) trigger state mutations.

If an attacker runs a command that should create files (e.g., wget downloading a file), the VFS side-effect registry handles known patterns. Unrecognized side effects may not be reflected.

PDX

DataRouter produces empty output

Symptom: split_stats.json shows 0 events.

  1. No signal sessions — If all sessions are classified as noise, the DataRouter has nothing to process. Check split_stats.json for session counts by label.
  2. Wrong logs directory — Verify the path: python -m pdx.training.data_router split --logs-dir /path/to/logs
  3. Corrupted JSONL — Some log files may be truncated (e.g., from a crash). The DataRouter skips malformed lines but logs warnings.

Fine-tuning runs out of VRAM

Symptom: CUDA out of memory

Solutions by VRAM available:

VRAM Model LoRA rank Batch size
8 GB Qwen 2.5 7B 8 1–2
12 GB Qwen 2.5 7B 16 2–4
16 GB Qwen 2.5 14B 16 2
24 GB Llama 3.3 8B 32 4–8

Also check if Ollama is running — it reserves VRAM even when idle:

# Check Ollama VRAM usage
nvidia-smi
# Stop Ollama temporarily
systemctl stop ollama

Quality pipeline removes too many entries

Symptom: dedup_removed count is very high.

This usually means the training data has many near-duplicate sessions (same bot, same commands). This is expected — bots are repetitive by nature. The dedup threshold can be adjusted:

qp = QualityPipeline(dedup_threshold=0.95)  # More permissive (default: 0.85)

Warning

Setting the threshold too high (> 0.95) risks leaving near-duplicates in the training set, which degrades model performance.

Deployment

Cloudflare Pages build fails

Symptom: Build fails with pip: command not found

Make sure the environment variable PYTHON_VERSION is set to 3.12 in the Cloudflare Pages build settings. Without it, the build environment may not have Python available.

Subdomain not resolving

If you're using a custom subdomain (e.g., docs.yourdomain.com):

  1. In Cloudflare dashboard → DNS → verify the CNAME record exists pointing to your Pages project
  2. In your Pages project → Custom domains → verify the domain is listed and has a green checkmark
  3. SSL certificates can take up to 24 hours to provision

Getting help

If your issue isn't listed here:

  1. Check the logs — HYDRA logs everything to logs/ in structured JSONL
  2. Run tests — python -m pytest tests/ validates core functionality
  3. Open an issue on GitHub with:
    • The error message (full traceback)
    • Your Python version (python --version)
    • Your OS and hardware (GPU model if fine-tuning)
    • The command you ran