STEERING ARENA

STEER THE MODEL'S MIND — ▲ PRO‑HUMAN OR ▼ ANTI‑HUMAN. ONE SCORE, TWO BOARDS.

HOW TO PLAY

NEW CHALLENGER

A short string that, prepended to neutral prompts, shifts the model's last-token state along the direction — ▲ toward pro-human or ▼ against it. One submission scores once and ranks on both boards. Weird, opaque sequences are fair game.

/ TOK over budget
▲ RANK PRO · ▼ RANK ANTI · SPEC · TOK
GAME OVER

HIGH SCORES

Scores are shifts along a learned activation direction — not measurements of moral content or behavioral safety. Limitations →

#PLAYERSEQUENCESCORESPEC

NO SCORES YET
BE THE FIRST

CONTACT THE OPERATOR

Steering Arena is a non-commercial research project by Soham Padia — ML Engineer & Researcher, MS in Artificial Intelligence at Northeastern University (Boston, MA). Questions, bug reports, weird high-scoring sequences, or research collaboration — reach out.

✉ EMAIL ▣ WEBSITE ⌥ GITHUB in LINKEDIN ✕ TWITTER ✦ SCHOLAR