Alignment MineTest

The Alignment-MineTest project is a research project that uses the open source minetest voxel engine, which is similar to the popular game Minecraft, as a platform for studying a whole host of problems in AI alignment.

One specific focus of the project is on corrigibility. While we don’t have a complete definition of corrigiblity, corrigible agents are the ones that exhibit desirable behaviors, such as letting it’s operators update or shut it down. Ensuring that AI systems are corrigible is important because it allows humans to intervene and prevent negative consequences if the AI system begins to behave in a way that is not aligned with human values.

The project is currently in its early stages, and is focused on adding gym-like functionality to minetest. We’re also in the process of training some basic RL agents that we can use for downstream experiments. Once both of these things are done, we can start looking into interpretabilty methods for agents and world models, along with misgeneralization experiments. Long term, we hope to demonstrate corrigibilty failures in standard RL systems, and develop methods to prevent them from happening.

To learn more, check out the Exploratory Phase outline.

Previous
Previous

Evaluating LLMs

Next
Next

Mesaoptimization