Mean Ablation for Circuit Discovery

Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.

Difficulty: medium.

Topics: Mean Ablation for Circuit Discovery, Mean Ablation, Logit Lens, Direct Logit Attribution, Activation Steering, Path Patching, Mechanistic Interpretability, Causal Inference, Information Theory, Neural Network Architecture, Statistical Analysis, Circuit Discovery, Intervention Methods, Activation Patching, Feature Attribution, Model Editing.

Implement a mean ablation function for neural network interpretability. Mean ablation is a technique used to isolate neural circuits by replacing a node's activation with its mean value computed over a reference distribution. Given node activations, a binary mask indicating which nodes to ablate, and precomputed mean activations, return the ablated activations where masked nodes are replaced with their means while unmasked nodes retain their original values.