Stella Biderman 02/04/2021 Stella Biderman 02/04/2021

The Hard Problem of Aligning AI to Human Values

Connor Leahy and Stella Biderman. "The Hard Problem of Aligning AI to Human Values." The State of AI Ethics Report 4, p. 180-183. 2021.

We discuss how common framings of AI ethics conversations underestimate the difficulty of the task at hand: if a model becomes dangerous by the mere exposure to unethical content, it is unacceptably dangerous and broken at its core. While gating such models (as OpenAI does with GPT3) behind an API with rudimentary automatic filters plus less rudimentary human moderation is a useful temporary patch, it does not address the underlying problem. These models are fundamentally not doing what we as humans want them to do, which is to act in useful, aligned ways, not just regurgitate an accurate distribution of the text they have been trained on. We need AI that is, like humans, capable of reading all kinds of content, understanding it, and then deciding to act in an ethical manner. Indeed, learning more about unethical ideologies should enhance one's ability to act ethically and fight such toxic beliefs.