How to Add a Challenge to the Eval Environment¶

Step-by-step guide for adding a new challenge to the Gymnasium eval environment (SutroYaro/SparseParity-v0). This covers the registry, harness, and answer key. An agent or contributor should be able to complete this in one session.

For adding challenges to the research workspace (harness, docs, search space), see adding-a-challenge.md.

Overview¶

The eval environment uses a registry (sparse_parity.eval.registry) so you never need to edit env.py to add a new challenge or method.

registry.py          -- register_challenge(), register_method()
default_registry.py  -- ships 3 challenges, 16 methods
env.py               -- reads the registry at runtime
backends.py          -- looks up harness functions from registry

1. Implement the harness measure function¶

Write a measure_your_challenge(method, n_bits, k_sparse, seed, **kwargs) function that returns a dict with at least:

{
    "accuracy": float,   # 0.0 to 1.0
    "ard": float or None,
    "dmc": float or None,
    "time_s": float,
    "total_floats": int or None,
    "found_secret": list or None,
}

You can put this function anywhere importable from PYTHONPATH=src. The simplest option is adding it to src/harness.py (following the existing measure_sparse_sum as a template), but you can also put it in a separate module.

Do not modify harness.py in experiment PRs (LAB.md rule #9). If you are adding infrastructure (not running an experiment), a separate PR that modifies harness.py is fine.

2. Register the challenge¶

In your module, or in default_registry.py if this ships with the repo:

from sparse_parity.eval.registry import register_challenge

def _my_harness(**kwargs):
    import my_module
    return my_module.measure_my_challenge(**kwargs)

register_challenge(
    "my-challenge",
    harness_fn=_my_harness,
    description="One-line description of the task",
    default_config={"n_bits": 20, "k_sparse": 3, "seed": 42},
)

The harness_fn is called by the LocalBackend with keyword arguments: method, n_bits, k_sparse, seed, plus any extras.

3. Register methods for the challenge¶

from sparse_parity.eval.registry import register_method

register_method(
    "my_method",
    category="algebraic",
    applicable_challenges=["my-challenge"],
    description="What this method does",
)

If a method works on all challenges, set applicable_challenges=None.

Important: Method registration order determines the action-space index. The 16 default methods are indices 0-15. New methods get indices 16, 17, etc. This means the action space grows automatically.

4. Add answer key entries¶

Add experiments to src/sparse_parity/eval/answer_key.json so the OracleAgent and DiscoveryGrader know the ground truth:

{
    "exp_id": "my-exp1",
    "method": "my_method",
    "challenge": "my-challenge",
    "accuracy": 1.0,
    "ard": 500.0,
    "dmc": 1200.0,
    "category": "algebraic",
    "result": "SOLVED"
}

5. Run baselines¶

PYTHONPATH=src python3 -c "
from sparse_parity.eval.registry import register_challenge, register_method
# ... your registrations ...

import gymnasium as gym
import sparse_parity.eval

env = gym.make('SutroYaro/SparseParity-v0',
    challenge='my-challenge', metric='dmc', budget=5)
obs, info = env.reset()
print(info)
obs, r, _, _, info = env.step(0)
print(info)
"

6. External registration (no repo changes needed)¶

If you are developing outside the repo, you can register challenges and methods at runtime before creating the environment:

import sparse_parity.eval  # loads defaults
from sparse_parity.eval import registry

registry.register_challenge("my-challenge", harness_fn=my_fn)
registry.register_method("my-method", category="custom")

env = gym.make("SutroYaro/SparseParity-v0",
    challenge="my-challenge", metric="dmc", budget=10)

Checklist¶

Harness measure function implemented and tested standalone
Challenge registered via register_challenge()
At least 1 method registered via register_method()
Answer key entries added (if shipping with the repo)
Baselines recorded in DISCOVERIES.md
Environment creates and runs without errors
Existing tests (run_eval.py) still pass