Quality pipeline¶
Before training data reaches the fine-tuning stage, it passes through a three-step quality pipeline: deduplication, filtering, and curriculum ordering.
Why quality matters¶
Raw honeypot data is noisy. The same bot script hits thousands of honeypots with identical commands. Without deduplication, a fine-tuned model would memorize uname -a → discovery but fail on novel commands. Without curriculum ordering, training would be unstable — jumping between trivial and complex examples.
Step 1 — Deduplication¶
Method: trigram-based hashing.
For each training entry, the pipeline extracts the output text, generates all trigrams (3-word sequences) from the first 80 words, and hashes them with MD5. If two entries share the same trigram hash, the duplicate is removed.
This catches near-duplicates — entries that are semantically identical but differ in minor wording (e.g., different IP addresses in otherwise identical sessions).
Threshold: configurable, default 0.85 similarity.
Step 2 — Quality filtering¶
Entries are filtered based on:
| Criterion | Min | Max | Rationale |
|---|---|---|---|
| Token count | 50 | 2,000 | Too short = no signal. Too long = noise. |
| Quality score | 0.3 | — | Computed from verdict weight and validation status |
Verdict weighting¶
The quality score incorporates human validation when available:
| Validation status | Weight factor |
|---|---|
| Not validated | 0.3× base weight |
| Human agreed | 1.5× base weight (capped at 1.0) |
| Human disagreed | 2.0× base weight (important edge case) |
| Human uncertain | 0.1× base weight |
Human-disagreed entries get the highest weight because they represent the cases where the model got it wrong — exactly what we want the model to learn from.
Step 3 — Curriculum ordering¶
Training entries are sorted by difficulty:
- Simple patterns first — single-command observations, clear verdicts
- Multi-step sequences — sessions with 3–5 commands
- Complex kill chains — 5+ command sequences covering multiple MITRE tactics
- Edge cases last — false positives, uncertain verdicts, prompt injection attempts
This follows the curriculum learning principle: models train more stably when they see easy examples before hard ones.
Decay engine¶
Older observations lose weight over time using a half-life decay function:
Base half-life: 90 days. This ensures training data stays fresh — a vulnerability observed 6 months ago matters less than one observed yesterday.
Negative decay modifiers increase weight over time — used for rare, high-value observations (like the GLaDOS prompt injection) that become more important as the dataset grows.
Running the quality pipeline¶
from pdx.training.quality.pipeline import QualityPipeline
qp = QualityPipeline(
min_quality=0.3,
dedup_threshold=0.85,
min_tokens=50,
max_tokens=2000
)
clean_entries = qp.run(
entries=raw_entries,
dedup=True,
quality_filter=True,
curriculum=True
)
print(qp.stats)
# {"dedup_removed": 142, "quality_filtered": 38, "total_output": 820}