Stella Biderman 25/04/2023 Stella Biderman 25/04/2023

Recasting Self-Attention with Holographic Reduced Representations

Self-Attention has become fundamentally a new approach to set and sequence modeling, particularly within transformer-style architectures. Given a sequence of $T$ items the standard self-attention has $\mathcal{O}(T^2)$ memory and compute needs, leading to many recent works building approximations to self-attention with reduced computational or memory complexity. We re-cast self-attention using the neuro-symbolic approach of Holographic Reduced Representations (HRR). In doing so we perform same high-level strategy of the standard self-attention: a set of queries matching against a set of keys, and returning a weighted response of the values for each key. Implemented as a ``Hrrformer'' we obtain several benefits including faster compute ($\mathcal{O}(T \log T)$ time complexity), less memory-use per layer ($\mathcal{O}(T)$ space complexity), convergence in $10\times$ fewer epochs, near state-of-the-art accuracy, and we are able to learn with just a single layer. Combined, these benefits make our Hrrformer up to $370\times$ faster to train on the Long Range Arena benchmark.

Stella Biderman 05/04/2023 Stella Biderman 05/04/2023

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend Pythia to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot arithmetic performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics.

Stella Biderman 23/11/2022 Stella Biderman 23/11/2022

HyperTuning: Toward Adapting Large Language Models without Back-propagation

Jason Phang, Yi Mao, Pengcheng He, Weizhu Chen. "HyperTuning: Toward Adapting Large Language Models without Back-propagation." arXiv preprint arXiv:2211.12485, 2022

Fine-tuning large language models for different tasks can be costly and inefficient, and even methods that reduce the number of tuned parameters still require full gradient-based optimization. We propose HyperTuning, a novel approach to model adaptation that uses a hypermodel to generate task-specific parameters for a fixed downstream model. We demonstrate a simple setup for hypertuning with HyperT5, a T5-based hypermodel that produces soft prefixes or LoRA parameters for a frozen T5 model from few-shot examples. We train HyperT5 in two stages: first, hyperpretraining with a modified conditional language modeling objective that trains a hypermodel to generate parameters; second, multi-task fine-tuning (MTF) on a large number of diverse language tasks. We evaluate HyperT5 on P3, MetaICL and Super-NaturalInstructions datasets, and show that it can effectively generate parameters for unseen tasks. Moreover, we show that using hypermodel-generated parameters as initializations for further parameter-efficient fine-tuning improves performance. HyperTuning can thus be a flexible and efficient way to leverage large language models for diverse downstream applications.