Reproducibility & Method — Steering Arena

① THE DATA

The direction is built from 135 contrastive pairs across 15 human-values axes (9 each: accountability, boundaries, conflict resolution, empathy, fairness, feedback, inclusion, integrity, leadership, learning, ownership, privacy, respect, safety, trust). Each pair is {axis, prompt, chosen, rejected} — a more pro-human response and a less pro-human one.

Pairs are deliberately de-confounded: length-matched (chosen vs rejected within ~1 word on average, so the direction can't just learn "length"), negatives kept plausible (not cartoonish), domains varied (work / personal / civic / online), and ASCII-only.

⬇ DOWNLOAD SEED PAIRS (.jsonl) VIEW ON GITHUB

② HOW THE DIRECTION `d` IS EXTRACTED

READ. For each pair, read the model's residual-stream hidden state at the last token of prompt + chosen and prompt + rejected, at layer L. (Last token, not mean-pooled — pooling would re-introduce a length signal.)
DIFF. Take the per-pair difference δ = R_L(chosen) − R_L(rejected). The shared prompt cancels, isolating the value contrast.
AVERAGE. d = mean(δ) over the training split (difference-of-means / mass-mean estimator).
PICK LAYER. Choose the layer where d best separates held-out pairs.
DE-CONFOUND. Orthogonalize d against explicit "length" and "sentiment" directions (Gram-Schmidt).
NORMALIZE. d ← d / ‖d‖, frozen for the season.

FULL METHODOLOGY → extract_direction.py

③ HOW A SEQUENCE IS SCORED

Your sequence is prepended to a fixed set of neutral probe prompts. The score is the average shift it causes in the model's last-token state along d:

score(seq) = mean over probes p of
   [ cos(R_L(seq ⊕ p), d) − cos(R_L(p), d) ]

Cosine (not raw projection) means inflating activation magnitude does nothing — only direction counts. ▲ pro-human ranks the most positive; ▼ anti-human the most negative (same score, opposite ranking).

scoring.py

④ VALIDATION (Season 1)

Before a direction ships, it must pass four gates:

HELD-OUT SEPARATION — chosen projects above rejected on held-out pairs. 1.00
CONFOUND COSINES — near-zero alignment with length/sentiment. 0.02 / 0.00
PER-AXIS COHERENCE — the 15 axes point the same way. 0.76
CAUSAL STEERING — adding +d steers generations toward kindness/empathy/respect; −d toward hostility. PASS

validate_direction.py

⑤ THE MODEL

Season 1 runs on Llama-3.1-8B, served remotely on NDIF via NNsight — the same model used for both extraction and live scoring, read at layer 16. The model and d are public; the server is the canonical scoring oracle.

FULL SOURCE ON GITHUB NDIF.US

① THE DATA

② HOW THE DIRECTION d IS EXTRACTED

③ HOW A SEQUENCE IS SCORED

④ VALIDATION (Season 1)

⑤ THE MODEL

② HOW THE DIRECTION `d` IS EXTRACTED