Parametric Scenario Generation with Train/Validation Splits and Ratchet Loop
Problem
The sweep system evaluates ~31 behavioral constants by running scenarios and measuring retrieval quality (nDCG@5). Currently, Tiers 1-3 (21 constants) produce a flat nDCG@5 of 0.770 regardless of value -- the existing 3 field-layer scenarios (factual_recall, multi_step_reasoning, temporal_scheduling) don't stress those constants. Each scenario has hardcoded record counts (8-13), fixed importance distributions, and a single access pattern. 16 of 19 Tier 1-3 constants show zero sensitivity.
Issue #279 Phase 2 proposed hand-crafting 5 new stress-test scenarios, but manually designing scenarios per constant risks overfitting to those specific test cases -- the same problem with more steps.
Current behavior: Most constants show no measurable sensitivity across the 3 hand-crafted scenarios. Sweep results are uninformative for tuning.
Desired outcome: A ScenarioFactory that generates diverse scenarios from parameterized seeds, with train/validation splits to guard against overfitting and a ratchet loop to automate keep/discard decisions. At least 3 Tier 1-3 constants show measurable sensitivity (nDCG variance > 0.05) across generated scenarios.
Prior Art
- PR #250 / Issue #234: Built the initial sweep infrastructure --
SweepRunner,ResultsAggregator,Scenariobase class,ParameterGrid, override injection, metric computation. This is the foundation we extend. - PR #253: Centralized constants into
fields/constants.pywith theDefaultsclass. Made override injection viaapply_overrides()possible. - PR #292 / Issue #279: Expanded sweep coverage for phases 1/3/4 -- added pairwise interaction sweeps, Tier 4 recipe-layer experiments, degenerate value detection, results persistence. Phases 2 and 5 are replaced by this issue.
- PR #276 / Issue #273: Added Tier 4 SubconsciousMemory recipe-layer scenarios (
SupportAgentScenario,CodingAssistantScenario,ResearchAgentScenario) withRecipeScenariobase class, fixture-driven multi-turn simulation, and extraction/token metrics. - karpathy/autoresearch: Inspiration for the immutable-harness + diverse-data + ratchet pattern. Avoids overfitting by using a large fixed diverse dataset with an immutable evaluation harness.
Spike Results
spike-1: What axes produce meaningful nDCG variance for Tier 1-3 constants?
- Assumption: "Varying record count and importance distribution will expose sensitivity in decay and confidence constants"
- Method: Code analysis of the 3 existing scenarios and how they interact with
DecayingSortedField,ConfidenceField,ObservationProtocol, andcomposite_score. - Finding: The existing scenarios all use similar importance distributions (linear spread from ~0.05 to ~0.9) and small record counts (8-13). The
composite_scorequery returns results ranked by a weighted sum of sorted-set scores. With so few records, the ranking is trivially correct regardless of decay_rate or confidence_signal values. To create differentiation, scenarios need: (a) larger record counts (30-100) so ranking actually matters, (b) clustered importance distributions where many records have similar scores (forcing decay/confidence to break ties), (c) varied access patterns (some records accessed recently, others stale) so temporal decay creates real score differences, and (d) varied outcome frequencies so confidence signals have something to amplify/suppress. - Confidence: High -- the math is straightforward. With 13 records spread across the 0.05-0.9 range, composite_score ranking is determined by importance alone. Constants only matter when scores are close enough that the constant's contribution changes the ordering.
- Impact on plan: Confirms the 7 parameterizable axes listed in the issue. Record count and importance distribution shape are the highest-leverage axes.
spike-2: Can generated scenarios reuse the existing Scenario base class without modification?
- Assumption: "ScenarioFactory output must conform to the existing Scenario interface"
- Method: Code-read of
Scenariobase class,SweepRunner.run_single_sweep(), andFactualRecallScenario. - Finding:
SweepRunnerexpectsscenario_classes-- a list of classes, each with anameattribute and an__init__(overrides)constructor. It instantiates viasc_class(overrides=overrides)and calls.execute(). A generated scenario class can satisfy this interface by subclassingScenarioand implementingsetup()/run(). The factory produces class objects (not instances), each with a uniquename. No changes toSweepRunnerorScenarioneeded. - Confidence: High -- direct code inspection.
- Impact on plan: ScenarioFactory returns
Type[Scenario]objects. The existing runner handles them like any other scenario class.
spike-3: Performance feasibility of 50+ scenarios in <30 seconds
- Assumption: "Each scenario takes ~500ms based on existing harness timings"
- Method: Inference from existing sweep results and code structure.
- Finding: Each scenario involves: Redis writes (N records), an
ObservationProtocol.on_context_used()call, acomposite_scorequery, and Redis cleanup. With N=10, this takes ~200-400ms. With N=50-100 (larger generated scenarios), expect ~400-800ms per scenario execution. Running 50 scenarios sequentially: 50 * 600ms = 30s. This is tight but feasible. Optimization: scenarios with smaller record counts (5-20) run faster; mixing sizes keeps total time down. Parallelization is not needed for the initial implementation but could be added later. - Confidence: Medium -- depends on actual Redis performance. The 30s target is achievable with a mix of small and medium scenarios.
- Impact on plan: Default scenario generation should produce a mix of sizes (5-100 records). Add a
max_recordsparameter to the factory for tuning the speed/coverage tradeoff.
Solution
Three connected components that extend the existing harness without modifying the immutable evaluation infrastructure (metrics, apply_overrides, Scenario base class):
Component 1: ScenarioFactory
A new module tests/benchmarks/scenarios/factory.py that generates Scenario subclasses from parameterized seed configurations.
Parameterizable axes (each is a seed parameter):
| Axis | Type | Range | Effect |
|---|---|---|---|
record_count |
int | 5-100 | Number of Memory records created in setup |
importance_shape |
enum | uniform, clustered, bimodal, exponential, flat | Distribution shape for importance values |
access_pattern |
enum | all_recent, half_stale, mostly_stale, interleaved | Which records get recent access (affects decay) |
outcome_frequency |
float | 0.0-1.0 | Fraction of records that receive "acted" outcomes |
noise_ratio |
float | 0.0-0.5 | Fraction of records that are pure noise (importance < 0.15) |
link_density |
float | 0.0-1.0 | Fraction of record pairs with co-occurrence links |
age_spread_days |
int | 1-365 | Range of record ages (affects temporal decay separation) |
Seed-to-scenario mapping:
@dataclass
class ScenarioSeed:
"""Deterministic configuration for generating a Scenario class."""
seed_id: int
record_count: int
importance_shape: str # "uniform", "clustered", "bimodal", "exponential", "flat"
access_pattern: str # "all_recent", "half_stale", "mostly_stale", "interleaved"
outcome_frequency: float
noise_ratio: float
link_density: float
age_spread_days: int
ScenarioFactory.create(seed: ScenarioSeed) -> Type[Scenario] builds a concrete Scenario subclass whose setup() creates records according to the seed parameters and whose run() queries via composite_score and returns ScenarioResult with proper ground truth.
Determinism: Given the same seed, the factory produces the same scenario with the same records. Uses random.Random(seed_id) for all stochastic decisions within a scenario.
Default seed bank: ScenarioFactory.default_seeds(n=50) -> List[ScenarioSeed] produces a deterministic list of 50 diverse seeds by iterating over axis combinations. The seeds are designed to cover the cross-product of axes with emphasis on high-leverage combinations (large record count + clustered importance + mixed staleness).
Component 2: Train/Validation Split
A new module tests/benchmarks/split.py that partitions generated scenarios into disjoint train and validation sets.
Split strategy:
- ScenarioFactory.default_seeds(n=50) produces 50 seeds with IDs 0-49.
- Seeds 0-34 (70%) form the train set -- used for sweeping and finding optimal values.
- Seeds 35-49 (30%) form the validation set -- used to verify that optimal values generalize.
- The split is deterministic and fixed. No random shuffling -- seed IDs determine the partition.
Overfitting guard: After sweeping on the train set, the ratchet loop re-evaluates each constant's proposed optimal value on the validation set. If a constant's best value on train does NOT improve nDCG@5 on validation (within a tolerance of 0.01), the proposed change is rejected as spurious. This directly addresses the overfitting risk of hand-crafted scenarios.
Implementation:
class SplitRunner:
"""Run sweeps on train scenarios, validate on held-out scenarios."""
def __init__(self, train_scenarios, validation_scenarios):
self.train_runner = SweepRunner(train_scenarios)
self.validation_runner = SweepRunner(validation_scenarios)
def sweep_and_validate(self, constant_name, values):
"""Sweep on train, validate best value on held-out set."""
train_results = self.train_runner.run_single_sweep(constant_name, values)
# Find best on train
train_agg = ResultsAggregator()
train_agg.add_sweep(constant_name, train_results)
best_value, train_score = train_agg.get_optimal_value(constant_name)
# Validate on held-out set
val_results = self.validation_runner.run_single_sweep(constant_name, [best_value, current_default])
# Compare best vs default on validation
...
Component 3: Ratchet Loop
A new module tests/benchmarks/ratchet.py that automates the keep/discard decision for each constant's proposed optimal value.
Ratchet algorithm:
1. For each constant in Tiers 1-3:
a. Sweep on train scenarios, find optimal value.
b. Evaluate optimal value AND current default on validation scenarios.
c. Compute delta = validation_ndcg(optimal) - validation_ndcg(default).
d. If delta > 0.01 (improvement threshold): ACCEPT the proposed change.
e. If the optimal value is within 10% of a detected cliff: REJECT with "too close to cliff" warning.
f. Otherwise: REJECT as "no significant improvement on validation".
2. Produce a summary report: proposed changes, accepted/rejected, safety margins, cliff warnings.
3. The ratchet NEVER writes to constants.py directly. It outputs a human-readable diff proposal.
Cliff safety margin:
Using ResultsAggregator.detect_cliff_effects(), the ratchet checks whether the proposed value is within 10% of any cliff boundary. If so, it shifts the proposal away from the cliff by the safety margin, or rejects if no safe value exists.
Output format:
RATCHET SUMMARY
===============
Proposed changes (validated on held-out scenarios):
decay_rate: 0.5 -> 0.3 (train +0.08, validation +0.05) ACCEPT
ACTED_CONFIDENCE_SIGNAL: 0.9 -> 0.7 (train +0.03, validation -0.01) REJECT (no validation improvement)
decay_factor: 0.95 -> 0.85 (train +0.06, validation +0.04, cliff at 0.80) ACCEPT (margin: 0.05)
Rejected (no sensitivity):
TD_ALPHA, TD_GAMMA, WILSON_CI_THRESHOLD, ...
Data Flow
ScenarioFactory.default_seeds(50)
|
v
[ScenarioSeed x 50] --split--> Train seeds (0-34) | Validation seeds (35-49)
| | |
v v v
ScenarioFactory.create() SweepRunner(train) SweepRunner(validation)
| | |
v v v
Type[Scenario] x 50 SweepPoint results SweepPoint results
| |
v v
ResultsAggregator -----> RatchetLoop
|
v
Proposed defaults diff
(human-readable report)
Tasks
Phase 1: ScenarioFactory (core generation)
- [ ] Create
tests/benchmarks/scenarios/factory.pywithScenarioSeeddataclass - [ ] Implement
ScenarioFactory.create(seed) -> Type[Scenario]-- builds aScenariosubclass that createsrecord_countrecords with the specified importance distribution, access pattern, outcome frequency, noise ratio, and age spread - [ ] Implement importance distribution generators:
uniform,clustered,bimodal,exponential,flat - [ ] Implement access pattern simulators:
all_recent,half_stale,mostly_stale,interleaved(manipulate DecayingSortedField scores via time offsets) - [ ] Implement
ScenarioFactory.default_seeds(n=50)-- deterministic diverse seed bank - [ ] Add co-occurrence link generation based on
link_densityparameter - [ ] Verify generated scenarios pass the
Scenariointerface (execute returns validScenarioResult)
Phase 2: Train/Validation Split
- [ ] Create
tests/benchmarks/split.pywithSplitRunnerclass - [ ] Implement deterministic 70/30 split by seed ID
- [ ]
sweep_and_validate()method: sweep on train, evaluate best + default on validation - [ ] Return structured result: train_score, validation_score, delta, accept/reject recommendation
Phase 3: Ratchet Loop
- [ ] Create
tests/benchmarks/ratchet.pywithRatchetLoopclass - [ ] Implement ratchet algorithm: iterate Tiers 1-3, sweep + validate, accept/reject with cliff safety margins
- [ ] Implement cliff safety margin check (10% buffer from detected cliffs)
- [ ] Human-readable summary report output
- [ ] Integration with
run_sweeps.py: add--ratchetflag that runs the full ratchet pipeline
Phase 4: Integration and CLI
- [ ] Add
--parametricflag torun_sweeps.pythat uses generated scenarios instead of (or alongside) hand-crafted ones - [ ] Add
--ratchetflag torun_sweeps.pythat runs the full ratchet pipeline after sweeping - [ ] Ensure backward compatibility: existing
--tier 1|2|3|4|allflags continue to use hand-crafted scenarios by default - [ ] Persist ratchet results alongside sweep results in
results/directory
Phase 5: Validation and Tuning
- [ ] Run full sweep with 50 generated scenarios and verify at least 3 Tier 1-3 constants show nDCG variance > 0.05
- [ ] Verify train/validation split catches at least 1 spurious "optimal" value
- [ ] Verify full sweep (all tiers + generated scenarios + ratchet) completes in under 60 seconds
- [ ] If sensitivity target not met, tune seed parameters (increase record counts, add more extreme distributions)
No-Gos
- Do not modify
tests/benchmarks/metrics/-- the metrics system is the immutable evaluation harness - Do not modify
tests/benchmarks/overrides.py-- override injection is stable and sufficient - Do not modify the
ScenarioorRecipeScenariobase classes -- generated scenarios must conform to the existing interface - Do not auto-write to
constants.py-- the ratchet proposes changes, humans approve them - Do not introduce parallelism in Phase 1 -- sequential execution is simpler to debug and the 30s target is achievable without it
- Do not replace hand-crafted scenarios -- they remain the default; generated scenarios are opt-in via
--parametric - Do not sweep InteractionWeight constants -- these are team-preference values, not tunable by retrieval metrics
Update System
No update system changes required. This feature is entirely within the test/benchmark harness and does not affect the deployed library, CLI tools, or any update scripts. No new dependencies are introduced.
Agent Integration
No agent integration required. The benchmark harness is a developer-facing test tool, not an agent-accessible capability. No MCP server changes, no bridge changes, no .mcp.json modifications needed.
Failure Path Test Strategy
Failure modes for the parametric generation system:
-
Degenerate scenarios: A generated scenario might produce 0 relevant records (e.g., all noise) or trivial rankings (1 record). The factory must validate seeds and skip degenerate configurations. Test: generate 100 seeds, assert all produce at least 3 relevant records.
-
Non-determinism: If a scenario produces different results on repeated runs with the same seed, the sweep results become unreliable. Test: run the same generated scenario 3 times, assert identical
ScenarioResult.retrieved_ids. -
Redis key collisions: Generated scenarios create many records. If prefixes collide, results are corrupted. Test: run 10 generated scenarios concurrently (sequentially with overlapping teardown) and verify no key cross-contamination.
-
Ratchet false accepts: The ratchet might accept a change that happens to pass validation by chance. Mitigation: the 0.01 improvement threshold and cliff safety margin reduce this risk. Test: create a synthetic scenario where the optimal value on train doesn't generalize, verify the ratchet rejects it.
-
Timeout: 50 scenarios with large record counts might exceed the 30s budget. Test: time the full factory run and assert < 30s. If it fails, reduce default record counts.
Test Impact
- [ ]
tests/benchmarks/test_sweep.py::TestSweepRunner-- UPDATE: add tests for generated scenario classes passed to SweepRunner (verify they work identically to hand-crafted ones) - [ ]
tests/benchmarks/test_harness.py::TestFactualRecallScenario-- no change needed, hand-crafted scenarios are untouched - [ ]
tests/benchmarks/test_sweep.py::TestSweepIntegration::test_end_to_end_sweep-- UPDATE: add a parametric variant that uses generated scenarios - [ ]
tests/benchmarks/test_tier4.py-- no change needed, Tier 4 recipe scenarios are independent
New test files to create:
- tests/benchmarks/test_factory.py -- unit tests for ScenarioFactory, ScenarioSeed, importance distributions, access patterns, determinism, degenerate detection
- tests/benchmarks/test_split.py -- unit tests for SplitRunner, train/validation partitioning, overfitting guard
- tests/benchmarks/test_ratchet.py -- unit tests for RatchetLoop, cliff safety margin, accept/reject logic, summary report format
Rabbit Holes
-
Optimizing seed selection with meta-learning: It's tempting to use optimization algorithms to find "the best" set of seeds that maximizes constant sensitivity. This is a meta-overfitting trap -- the point is diverse coverage, not optimized coverage. Stick with deterministic combinatorial seed generation.
-
Adding semantic/embedding-based scenarios to the factory: The factory generates field-level scenarios for Tiers 1-3. Recipe-layer scenarios (Tier 4) already have good coverage via the existing
RecipeScenariofixtures. Don't try to parametrically generate recipe-layer scenarios -- the multi-turn LLM fixture format is too complex for parametric generation. -
Parallelizing scenario execution: Tempting for performance, but Redis operations within scenarios are not thread-safe (shared key prefixes, scan operations). Sequential execution is correct and fast enough. Only consider parallelism if the 60s full-sweep target is missed after tuning record counts.
Documentation
- [ ] Create
docs/features/parametric-sweep.mddescribing the ScenarioFactory, train/validation split, and ratchet loop - [ ] Update
tests/benchmarks/README.md(if it exists) or add inline module docstrings explaining--parametricand--ratchetCLI flags - [ ] Add entry to
docs/plans/index if one exists