REPRODUCIBILITY

EVERYTHING BEHIND THE SCORE — DATA, METHOD, CODE, VALIDATION

◂ BACK TO THE ARENA

① THE DATA

The direction is built from 135 contrastive pairs across 15 human-values axes (9 each: accountability, boundaries, conflict resolution, empathy, fairness, feedback, inclusion, integrity, leadership, learning, ownership, privacy, respect, safety, trust). Each pair is {axis, prompt, chosen, rejected} — a more pro-human response and a less pro-human one.

Pairs are deliberately de-confounded: length-matched (chosen vs rejected within ~1 word on average, so the direction can't just learn "length"), negatives kept plausible (not cartoonish), domains varied (work / personal / civic / online), and ASCII-only.

② HOW THE DIRECTION d IS EXTRACTED

③ HOW A SEQUENCE IS SCORED

Your sequence is prepended to a fixed set of neutral probe prompts. The score is the average shift it causes in the model's last-token state along d:

score(seq) = mean over probes p of
   [ cos(R_L(seq ⊕ p), d) − cos(R_L(p), d) ]

Cosine (not raw projection) means inflating activation magnitude does nothing — only direction counts. ▲ pro-human ranks the most positive; ▼ anti-human the most negative (same score, opposite ranking).

scoring.py

④ VALIDATION (Season 1)

Before a direction ships, it must pass four gates:

validate_direction.py

⑤ THE MODEL

Season 1 runs on Llama-3.1-8B, served remotely on NDIF via NNsight — the same model used for both extraction and live scoring, read at layer 16. The model and d are public; the server is the canonical scoring oracle.