Skip to content

Quality pipeline

Before training data reaches the fine-tuning stage, it passes through a three-step quality pipeline: deduplication, filtering, and curriculum ordering.

Why quality matters

Raw honeypot data is noisy. The same bot script hits thousands of honeypots with identical commands. Without deduplication, a fine-tuned model would memorize uname -a → discovery but fail on novel commands. Without curriculum ordering, training would be unstable — jumping between trivial and complex examples.

Step 1 — Deduplication

Method: trigram-based hashing.

For each training entry, the pipeline extracts the output text, generates all trigrams (3-word sequences) from the first 80 words, and hashes them with MD5. If two entries share the same trigram hash, the duplicate is removed.

This catches near-duplicates — entries that are semantically identical but differ in minor wording (e.g., different IP addresses in otherwise identical sessions).

Threshold: configurable, default 0.85 similarity.

Step 2 — Quality filtering

Entries are filtered based on:

Criterion Min Max Rationale
Token count 50 2,000 Too short = no signal. Too long = noise.
Quality score 0.3 Computed from verdict weight and validation status

Verdict weighting

The quality score incorporates human validation when available:

Validation status Weight factor
Not validated 0.3× base weight
Human agreed 1.5× base weight (capped at 1.0)
Human disagreed 2.0× base weight (important edge case)
Human uncertain 0.1× base weight

Human-disagreed entries get the highest weight because they represent the cases where the model got it wrong — exactly what we want the model to learn from.

Step 3 — Curriculum ordering

Training entries are sorted by difficulty:

  1. Simple patterns first — single-command observations, clear verdicts
  2. Multi-step sequences — sessions with 3–5 commands
  3. Complex kill chains — 5+ command sequences covering multiple MITRE tactics
  4. Edge cases last — false positives, uncertain verdicts, prompt injection attempts

This follows the curriculum learning principle: models train more stably when they see easy examples before hard ones.

Decay engine

Older observations lose weight over time using a half-life decay function:

weight = 0.5 ^ (elapsed_time / half_life)

Base half-life: 90 days. This ensures training data stays fresh — a vulnerability observed 6 months ago matters less than one observed yesterday.

Negative decay modifiers increase weight over time — used for rare, high-value observations (like the GLaDOS prompt injection) that become more important as the dataset grows.

Running the quality pipeline

from pdx.training.quality.pipeline import QualityPipeline

qp = QualityPipeline(
    min_quality=0.3,
    dedup_threshold=0.85,
    min_tokens=50,
    max_tokens=2000
)

clean_entries = qp.run(
    entries=raw_entries,
    dedup=True,
    quality_filter=True,
    curriculum=True
)

print(qp.stats)
# {"dedup_removed": 142, "quality_filtered": 38, "total_output": 820}