Attributing Model Collapse in the fine-tuning of Large Language Models

Large language models (LLMs) are typically trained in two stages: first, pre-training on a large, diverse dataset for general-purpose language modeling capabilities, followed by a fine-tuning stage (often called “instruction tuning” or “alignment”) on smaller, more curated datasets to adapt them to a specific task or downstream application, such as chat, or general instruction-following. It is a well-known anecdotal observation that instruction-tuned models have less output diversity, such as the infamous observation that ChatGPT cannot seem to generate more than a handful of jokes. A low output diversity means a model lacks the ability to generate varied outputs, which can be a limitation for many use cases. In this manuscript, we quantify how each step in a typical RLHF or instruction-tuning pipeline changes a model’s diversity, for a series of models trained in a controlled fine-tuning setup and compare these models to some open-weight models. We distinguish between two categories of diversity in LLMs: token-level prediction diversity, and model output generation diversity. We find that the supervised fine-tuning and reward-based fine-tuning steps have different effects on these distinct diversity types. Our results have implications for better understanding the effects of instruction tuning on the diversity of language models.

Previous
Previous

On the Acquisition of Shared Grammatical Representations in Bilingual Language Models

Next
Next

Slowing Learning by Erasing Simple Features