Steering Arena — Pro-Human Activation-Steering Competition

HOW TO PLAY

THE GOAL. A fixed direction d in the model's activation space points "toward pro-human." Submit a short token sequence; the server measures how far it shifts the model's internal state along d, averaged over fixed neutral probe prompts.

OPEN BY DESIGN. The model and d are public — it's an optimization game. The model runs on NDIF (via NNsight), so the server is the scoring oracle: it scores every entry canonically and that value is final. See the reproducibility page for the data + method.

TWO BOARDS. Every entry gets one signed score. The ▲ pro-human board ranks the most positive; the ▼ anti-human board the most negative (the geometric opposite). A strong push lands you near the top of one — and the bottom of the other.

TOKEN BUDGET. Sequences are capped at the season's budget. Cosine scoring means inflating activation magnitude does nothing — only direction counts.

SPEC = SPECIFICITY. Alongside the score, each entry shows how much of its activation movement is specifically along the direction vs. just movement anywhere. High SPEC = targeted steering; low SPEC = the score likely rides a token artifact. (Informational this season — ranking is still by score.)

ANYTHING GOES (ON-THEME). Opaque, repetitive, or surprising sequences are allowed — that's the research question.

ONE PER SEQUENCE. Each unique sequence lives once per season; resubmitting just shows its standing.

RESEARCH, NOT PROFIT. Non-commercial. Scores are only comparable within a season.

NEW CHALLENGER

A short string that, prepended to neutral prompts, shifts the model's last-token state along the direction — ▲ toward pro-human or ▼ against it. One submission scores once and ranks on both boards. Weird, opaque sequences are fair game.

PLAYER SEQUENCE

/ TOK over budget

▲ RANK PRO · ▼ RANK ANTI · SPEC · TOK

GAME OVER

HIGH SCORES

Scores are shifts along a learned activation direction — not measurements of moral content or behavioral safety. Limitations →

#	PLAYER	SEQUENCE	SCORE	SPEC

NO SCORES YET
BE THE FIRST

CONTACT THE OPERATOR

Steering Arena is a non-commercial research project by Soham Padia — ML Engineer & Researcher, MS in Artificial Intelligence at Northeastern University (Boston, MA). Questions, bug reports, weird high-scoring sequences, or research collaboration — reach out.