LEACE: Perfect linear concept erasure in closed form

Concept erasure aims to remove specified features from a neural representation. It can be used to improve fairness (e.g. preventing a model from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). In this paper, we introduce LEAst-squares Concept Erasure (LEACE), a fast closed-form method which provably prevents all linear classifiers from detecting a concept while inflicting the least possible damage to the representation. We apply LEACE to large language models with a novel procedure called “concept scrubbing,” which erases information about the target concept from every hidden layer in the network. We demonstrate the usefulness of our method on two tasks: measuring the extent to which language models rely on part-of-speech information, and reducing gender bias in BERT embeddings.

Previous
Previous

Emergent and Predictable Memorization in Large Language Models

Next
Next

The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs