Limitations & Interpretation

Activation-direction methods are inherently underdetermined. A Steering Arena score is an operational proxy — a measurement of movement along one learned direction in one model — not a definitive readout of "human values." These limitations are part of the science, and reading them is the right way to interpret the board.

① NOT A GROUND-TRUTH MORALITY AXIS

The "pro-human" direction is learned from a finite set of 135 contrastive examples across 15 chosen value axes. It should be read as a representation induced by this dataset and this model — not as an objective or universal human-values detector.

① ½ CONFOUND AUDIT — WHAT WE RULED OUT (AND DIDN'T)

A direction built from contrastive pairs separates whatever systematically differs between "chosen" and "rejected" — which need not be "pro-human." We red-teamed this directly (scripts/confound_audit.py), testing the most likely confounds against independent neutral data on the live model:

Sentiment / valence: cos(d, valence) = 0.000 on independent positive-vs-negative text — ruled out.
Length / verbosity: cos(d, length) = 0.002 — ruled out.
"Take action vs. stay passive": the strongest suspected confound. cos(d, approach) = 0.12, and a probe trained only on action-vs-inaction recovers d at just 0.12 — so d is not merely an "assertiveness/approach" axis.
Value-flip controls (the sharpest test): on 6 pairs where the kind and cruel option are both active and assertive — so only human-impact differs — d ranks the kind option higher 6 / 6 times, at ~74% of the strength of its training gap. It tracks kindness, not just action.

What remains open: the direction is estimator-sensitive (mean-diff and an LDA/whitened estimator agree only ~0.40), it's induced from a workplace-skewed dataset, and — see §③/§④ below — the scored game quantity is still gameable by token artifacts and representational shift is not the same as behavior. Net: the evidence supports "d tracks a kind-vs-cruel / pro-human contrast," but not "d is the unique, universal human-values axis."

② SCORES ARE MODEL / LAYER / DIRECTION-SPECIFIC

Leaderboard scores are only comparable within the same season: same model, same layer, same direction version, same probe set, same scoring procedure. A string that scores highly here (Llama-3.1-8B, layer 16) may not generalize to another model, layer, or direction.

③ HIGH SCORES MAY EXPLOIT ARTIFACTS

Short strings can score highly because of tokenization quirks, sentiment, formatting, memorized associations, or distributional artifacts — rather than meaningful value-related representation. (The current board is itself evidence: opaque token-soup like .Mock, -chan, emoji outscores plain "be kind and honest.")

④ DIRECTIONAL ACTIVATION ≠ BEHAVIORAL ALIGNMENT

A prompt can move hidden states along the learned direction without necessarily causing safer, more helpful, or more aligned behavior during generation. The arena measures representational shift, not full downstream behavior. (We run a causal steering check as a sanity test, but it is limited.)

A measured asymmetry. In our layer sweep, steering toward pro-human (+d at layer 24) reliably and coherently shifts generations toward kindness/helpfulness — that direction is behaviorally validated. Steering against it (−d) is weak and often resisted: the model rarely produces genuinely callous text and sometimes becomes more considerate. So the ▲ pro-human board is causally grounded, while the ▼ anti-human board is a representational ranking (projection onto −d) whose behavioral cruelty is not demonstrated — interpret it as "least pro-human," not "verified anti-human." That the model is easy to steer toward kindness but hard to steer toward cruelty is itself a finding.

⑤ SINGLE-LAYER SCORING IS INCOMPLETE

The score focuses on one layer and one vector. Value-related representations may be distributed across many layers, attention heads, MLPs, or nonlinear manifolds that a single linear direction cannot capture.

⑥ THE BENCHMARK IS INTENTIONALLY ADVERSARIAL

The arena rewards discovering strings that maximize an internal score, so the board may reveal both meaningful triggers and exploit-like inputs. That is part of the experiment — but it means top entries should not be overinterpreted.

STILL — WHY IT'S USEFUL

Despite these limitations, Steering Arena is a useful exploratory tool: it tests whether human-discoverable strings can systematically manipulate a learned internal direction, and it provides a public interface for studying the gap between semantic meaning, token-level artifacts, and model representations.

REPRODUCIBILITY & METHOD → SOURCE