MATS 9.0 project with Patrick Butlin investigating whether LLM preferences are driven by evaluative representations.
For AI agents: This README describes the codebase's key modules and entry points. Before writing new extraction, embedding, or probe training code, check the relevant module below — the functionality likely already exists.
Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles — internal representations that encode valuation and causally influence choice. Across major theories of welfare (hedonism, desire satisfaction, etc.), such representations are central to moral patiency (Long et al., 2024).
Preferences are behavioral patterns — choosing A over B. Evaluative representations are the hypothesized internal mechanism: representations that encode "how good/bad is this?" and causally drive those choices. The question is whether preferences are driven by evaluative representations, or by something else (e.g., surface-level heuristics, training artifacts).
We look for evaluative representations as linear directions in activation space. The methodology follows from the definition:
- Probing — If they encode value, probes should predict preferences
- Steering — If they causally drive choices, steering should shift them
- Generalization — If they're genuine evaluative representations, they should generalize across contexts
We ground this in revealed preferences (pairwise choices where the model picks which task to complete), which have cleaner signal than stated ratings where models collapse to default values.
src/
├── probes/ # Activation extraction, probe training, and evaluation (see below)
├── steering/ # Activation steering experiments
├── measurement/ # Preference measurement (pairwise choices, stated ratings)
├── fitting/ # Utility function extraction (Thurstonian, TrueSkill)
├── models/ # LLM clients (OpenRouter)
├── task_data/ # Task datasets (WildChat, Alpaca, MATH, BailBench)
├── experiments/ # Core experiment scripts
└── analysis/ # Post-hoc analysis (probes, steering, correlations, etc.)
Config-driven pipeline that loads a HuggingFace model, runs it on tasks, and saves per-layer activations as .npz files. Supports batching, periodic checkpointing (save_every), and --resume to skip already-extracted tasks. Works with any model — use it for content encoder embeddings too, not just the target model.
Use --from-completions to extract activations from an existing completions JSON (skips re-generation).
python -m src.probes.extraction.run configs/extraction/<config>.yaml [--resume] [--from-completions path.json]Config fields: model, backend (huggingface/transformer_lens), layers_to_extract (fractional like 0.5 = middle layer), selectors (e.g. prompt_last), batch_size, n_tasks, task_origins. See configs/extraction/ for examples.
run_dir_probes.py is the main entry point for training probes from a measurement run. Loads Thurstonian scores and/or pairwise comparisons, loads activations, and trains Ridge and/or Bradley-Terry probes.
When eval_run_dir is set, uses heldout evaluation (preferred): trains on run_dir, sweeps alpha on half the eval set, evaluates on the other half. Otherwise falls back to CV alpha selection. Supports: score demeaning against confounds (topic, dataset), content-orthogonal projection, held-one-out (HOO) evaluation by group.
python -m src.probes.experiments.run_dir_probes --config configs/probes/<config>.yamlConfig fields: run_dir, activations_path, layers, modes (ridge/bradley_terry), eval_run_dir (optional, for heldout eval), eval_split_seed, demean_confounds, content_embedding_path, hoo_grouping. See configs/probes/ for examples.
linear_probe.py— Ridge regression with CV alpha sweep (alpha_sweepfor raw sweep results,train_and_evaluateto sweep + fit final model,train_at_alphafor fixed alpha). Returns probe, eval metrics, and sweep results.activations.py— Loads.npzactivation files with optional task ID filtering and layer selection (load_activations).evaluate.py— Probe evaluation:evaluate_probe_on_data(given activations + scores),evaluate_probe_on_template(cross-template transfer),compute_probe_similarity(cosine similarity between probe weight vectors).
Fits Ridge regression from content embeddings to activations (or scores), subtracts predictions. The residuals contain only variance the content encoder cannot explain. Key functions: project_out_content (activations), project_out_content_from_scores (scalar scores). Use alpha_sweep from core/linear_probe.py to CV-select the content Ridge alpha.
Embeds task prompts using a sentence transformer (default: all-MiniLM-L6-v2). Used as input to content-orthogonal projection. Functions: embed_tasks (from completions JSON), save_content_embeddings, load_content_embeddings (from .npz).
Removes group-level mean differences (topic, dataset) from preference scores via OLS on one-hot indicators. Distinct from content projection: this removes categorical confounds, content projection removes continuous content-predictable variance. Key function: demean_scores(scores, topics_json, confounds=["topic", "dataset"]).
Composable primitives for steering experiments. No rigid runner — each experiment composes the pieces it needs.
SteeredHFClient wraps a HuggingFaceModel with a steering direction and coefficient, duck-typed as OpenAICompatibleClient. Handles the coef==0 bypass, pre-computes the scaled tensor on GPU.
For coefficient sweeps, use with_coefficient() to create new clients sharing the same loaded model:
from src.steering.client import create_steered_client
client = create_steered_client("gemma-3-27b", probe_dir, "ridge_L31", coefficient=0)
for coef in [-3000, -1000, 0, 1000, 3000]:
steered = client.with_coefficient(coef)
response = steered.generate(messages)For position-selective or differential steering, use generate_with_hook() with a custom hook:
from src.models.base import position_selective_steering, differential_steering
direction = client.direction # access the loaded probe direction
tensor = torch.tensor(direction * coef, dtype=torch.bfloat16, device="cuda")
hook = position_selective_steering(tensor, start=10, end=50)
response = client.generate_with_hook(messages, hook)all_tokens_steering(tensor)— steer all positionsautoregressive_steering(tensor)— steer last token onlyposition_selective_steering(tensor, start, end)— steer tokens[start, end)during prompt processing onlydifferential_steering(tensor, pos_start, pos_end, neg_start, neg_end)—+directionon one span,-directionon anothernoop_steering()— no-op for control conditions
Utilities for finding token indices of text spans, used with position-selective hooks:
find_text_span(tokenizer, full_text, target_text, search_after=0)— general-purpose span finder using offset mappingfind_pairwise_task_spans(tokenizer, prompt, task_a, task_b, a_marker, b_marker)— convenience for pairwise comparison prompts
suggest_coefficient_range(activations_path, manifest_dir, probe_id, multipliers)— returns coefficients as multiples of the mean activation norm at the probe layer, removing the guesswork from coefficient selection
Functions for analyzing steering experiment results: aggregate_by_coefficient, compute_statistics, plot_dose_response, analyze_steering_experiment.
When building LLM judges (coherence, valence, refusal, etc.), follow the existing pattern in src/measurement/elicitation/refusal_judge.py: use instructor with Pydantic response models for structured output. Do not write regex-based response parsers — use instructor.from_openai() which guarantees valid structured responses from the LLM.