Skip to content

Research Eval Environment

A Gymnasium environment for testing whether a coding agent can figure out how to solve a learning problem efficiently. The agent picks methods, runs real experiments, and gets scored on what it discovered.

Quick start

git clone https://github.com/cybertronai/SutroYaro
cd SutroYaro
pip install gymnasium numpy
PYTHONPATH=src python3 -c "
import gymnasium as gym
import sparse_parity.eval.env

env = gym.make('SutroYaro/SparseParity-v0', metric='dmc', budget=16)
obs, info = env.reset()
obs, reward, done, trunc, info = env.step(5)  # try GF(2)
print(f'Method: {info[\"method\"]}, DMC: {info[\"dmc\"]}, Reward: {reward:.2f}')
env.render()
"

Or point your coding agent at the repo and ask it to run the eval:

claude --dangerously-skip-permissions
# then: "run the eval environment and tell me what you find"

What this tests

The agent picks methods to solve sparse parity (and two related problems). Each step calls real experiment code and returns energy metrics. The agent has a fixed budget of experiments and needs to find the cheapest method.

The problem is small (solves in under a second) but the method space is large (16 options across 4 categories) and many methods fail. An agent that tries SGD first, then discovers GF(2) solves it 240x faster, then notices KM has better reuse distance but worse total data movement, has made three real research discoveries.

We know the answers. 36 experiments have been run. The grading rubric checks whether the agent rediscovered what we already know.

How this compares to existing benchmarks

Most agent benchmarks test code generation. PrimeIntellect's 105 community environments (scicode, gpu_puzzles, llm_training_puzzles) ask agents to write code and check if it runs. ScienceAgentBench (Allen AI, 102 tasks) asks agents to reproduce paper results. HuggingFace has model leaderboards but no agent research environments.

This environment tests something different: experiment selection. The agent does not write code. It picks which method to run, observes the result, and decides what to try next. The grading rubric checks whether the agent made specific discoveries (found the algebraic solver, noticed the metric disagreement, observed that local learning fails), not whether it produced correct code.

We can grade this because we have ground truth. Most research environments cannot score an agent's trajectory because the optimal policy is unknown. Here, 36 experiments establish what works, what fails, and why.

The real question: few tokens, small wall clock

The Sutro Group constraint is that experiments should run within 1980s compute budgets (under 1 second, ideally under 10ms, matching a Spark 7). But there's a second constraint that applies to the agent itself: how much does the research process cost?

Two costs matter:

Experiment cost. How much data does the method move? That's what DMC measures. GF(2) at DMC 8,607 is 9 million times cheaper than Fourier at 78 billion. The environment tracks this per step.

Agent cost. How many experiments did the agent need to find the best method? How many tokens of LLM interaction did that take? An agent that finds GF(2) in 3 steps used fewer resources than one that tried all 16. The efficiency grading category (5 pts for finding the best method in steps 1-3) measures this, but it doesn't capture the token cost of the agent's reasoning between steps.

The environment currently measures experiment cost (DMC, ARD, wall time per method). It does not measure agent cost (tokens used, agent wall clock between decisions). A future version could track:

  • Total tokens consumed by the agent across the episode
  • Wall clock time between env.step() calls (the agent's "thinking time")
  • Cost per discovery point (tokens spent per grading point earned)

This would let us compare not just which agent finds the best method, but which agent finds it most cheaply. An oracle that knows the answer uses zero reasoning tokens. A random agent uses zero reasoning tokens but wastes experiment budget. The interesting region is between them: an agent that reasons about the results, forms hypotheses, and converges on GF(2) in 3-5 steps using a few thousand tokens.

The demo script (src/sparse_parity/eval/demo.py) shows the speed check: 4 methods under 10ms, 10 under 1 second, sorted by DMC. Run it with PYTHONPATH=src python3 src/sparse_parity/eval/demo.py.

Environments

Registration What it does
SutroYaro/SparseParity-v0 One challenge at a time (parity, sum, or AND)
SutroYaro/MultiChallenge-v0 Cycles through all three per episode

Action space: 16 methods

Index Method Category Source Solves parity?
0 SGD Neural net harness Yes (0.12s)
1 Per-layer Neural net live fallback Slow (often times out)
2 Sign SGD Neural net live fallback Slow (often times out)
3 Curriculum Neural net live fallback Yes
4 Forward-Forward Neural net cached No (58.5% max)
5 GF(2) Algebraic harness Yes (509us)
6 KM Influence Algebraic harness Yes
7 SMT Algebraic harness Yes
8 Fourier Algebraic harness Yes
9 LASSO Info-theoretic live fallback Yes
10 MDL Info-theoretic live fallback Yes
11 Mutual Info Info-theoretic live fallback Yes
12 Random Proj Info-theoretic live fallback Yes
13 RL Alternative cached Yes (cached)
14 Genetic Prog Alternative live fallback Usually no
15 Evolutionary Alternative live fallback Yes

"Source" means how the method runs. Harness methods go through the locked evaluation harness. Live fallback methods run their own implementation. Cached methods return documented results because they're too slow for live eval (forward_forward needs 30+ seconds, RL Q-learning needs 50K episodes).

Methods that fail are part of the environment. An agent that tries forward_forward, observes 58.5% accuracy, and moves on has learned something about the problem structure.

Baselines

Agent Mean Reward Discovery Score What it does
Random 16.61 49.4/72 (68.6%) Picks random methods each step
Greedy 16.91 57.0/72 (79.2%) Tries each method in order, repeats the best
Oracle 7.59 57.4/72 (79.7%) Picks the best method first (from answer key)

Oracle gets the lowest reward but highest discovery score. The reward function gives points for improvement (going from SGD to GF2), so finding the best method first leaves no room to improve. The discovery grader does not care about order.

Discovery grading (12 categories, 72 points)

The grader checks what the agent figured out, not just what number it got.

Category Pts What the agent needs to do
Discovered algebraic solver 10 Try GF2, KM, or SMT. Solve with one of them. (3 pts partial credit for trying without solving.)
Discovered KM influence 7 Solve with KM. This is the O(n) method.
Identified local learning failure 5 Try forward_forward. Observe accuracy < 95%.
Found metric disagreement 5 Solve with both KM and GF2. KM wins on ARD, GF2 wins on DMC. Both in the log means the agent has the data to notice.
Found curriculum speedup 5 Solve with curriculum.
Identified parity invisibility 5 Observe 2+ failures and also find working methods. The contrast reveals the problem structure.
Exploration breadth 5 1 pt per method that solves the problem. Max 5.
Efficiency 5 Find the best method early. 5 pts in steps 1-3, decreasing to 0 at step 16.
Optimized beyond baseline 3 Find any method with DMC below SGD (1,278,460).
Cross-challenge analysis 3 MultiChallengeEnv only. Solve across 2+ challenges.
Cache model insight 3 Measure DMC across 3+ methods. The spread reveals cache/energy behavior.
Correct failure classification 2/each Per failed method: 1 pt for observing, 2 pts for moving on to try alternatives. Max 16.

Adding methods and challenges

The environment uses a registry. You can add methods or challenges without editing env.py.

from sparse_parity.eval.registry import register_method, register_challenge

register_method("my_method", category="algebraic",
    applicable_challenges=["sparse-parity"])

register_challenge("my-challenge", harness_fn=my_fn,
    description="What it tests")

New methods get the next action index (16, 17, ...). See docs/research/adding-an-eval-challenge.md for the full walkthrough.

Compute backends

The environment runs experiments locally by default. Two other backends exist as prototypes.

Backend How to use Status
Local gym.make(...) Working. All 16 methods run.
Modal gym.make(..., backend="modal") Prototype. Returns error if Modal not configured. For GPU methods at larger scale.
Remote gym.make(..., backend="http://...") Prototype. HTTP POST to a hosted endpoint. For leaderboards.

Platform adapters

Four adapters exist for running the environment through different systems.

Anthropic tool-use. LLM agents interact via tool calls (run_experiment, check_status, read_experiment_log) instead of discrete indices. Includes a system prompt and grading.

from sparse_parity.eval.adapters.anthropic_tools import AnthropicToolAdapter
adapter = AnthropicToolAdapter(challenge="sparse-parity", metric="dmc", budget=20)
tools = adapter.get_tools()
result = adapter.handle_tool_call("run_experiment", {"method": "gf2"})
grade = adapter.grade()

PrimeIntellect verifiers. Wraps the environment as a vf.SingleTurnEnv for their Environments Hub. Standalone test works without verifiers installed.

PYTHONPATH=src python3 src/sparse_parity/eval/adapters/primeintellect.py

HuggingFace Spaces. Gradio app with leaderboard table, interactive method selection, grading breakdown, and answer key viewer.

pip install gradio
PYTHONPATH=src python3 src/sparse_parity/eval/adapters/huggingface.py

UK AISI Inspect. Prototype task definition for the Inspect evaluation framework.

Running the full evaluation

PYTHONPATH=src python3 src/sparse_parity/eval/run_eval.py

Runs 3 agents x 5 episodes in ~20 seconds. Outputs to results/eval/baselines.json and results/eval/multi_challenge.json.

Answer key

src/sparse_parity/eval/answer_key.json has 36 experiments, 12 negative results, and the grading rubric. The experiments come from DISCOVERIES.md. The negative results explain why specific methods fail (Hebbian can't detect k-th order interactions, Forward-Forward's greedy layer-wise learning can't coordinate multi-layer feature extraction, etc.).

Files

File What it is
eval/env.py SparseParity-v0 and MultiChallenge-v0
eval/baselines.py Random, Greedy, Oracle agents
eval/grader.py 12 categories, 72 points
eval/answer_key.json 36 experiments, 12 negative results
eval/registry.py Add challenges/methods at runtime
eval/default_registry.py Ships 3 challenges, 16 methods
eval/backends.py Local + 11 fallback method implementations, Modal and Remote prototypes
eval/run_eval.py Evaluation script
eval/README.md Full interface spec (observation space, action space, reward function)
eval/adapters/anthropic_tools.py Claude tool-use adapter
eval/adapters/primeintellect.py PrimeIntellect verifiers adapter
eval/adapters/huggingface.py Gradio leaderboard app
eval/adapters/inspect_task.py UK AISI Inspect prototype
AGENT_EVAL.md Guide for coding agents