May Papers: xLSTM, Schedule-Free Optimizers, and Multi-token prediction

12 minute read

May is always an eventful time of year for ML researchers, with final ICML paper decisions and ICLR taking place in early May, and NeurIPS submission deadlines closing the month. As ever, arXiv submissions continue to grow!

This month we take a look at three papers exploring new techniques to challenge the mainstream large-scale pretraining setup: transformers trained with next-token prediction optimized with Adam/AdamW.

The first paper, xLSTM, is a long-awaited deep dive into Sepp Hochreiter’s new, improved RNN architecture, nearly 30 years after the original LSTM was published. Drawing inspiration from linear attention, the authors demonstrate scaling comparable to transformers up to 1.3B parameters.

We then take a look at Schedule-Free optimizers from a team at FAIR. The authors propose a new class of optimizers that require no finicky learning rate scheduling. By replacing gradient momentum terms in standard optimizers with parameter averages, the authors show faster convergence than scheduled optimizers on a wide battery of small-scale deep learning tasks.

A further paper from FAIR extends the standard pretraining setup for large language models from next-token to multi-token prediction. This particularly seems to improve performance for larger models and offers a natural choice of model to use for speculative sampling to accelerate inference.

Here’s our summary of this month’s chosen papers:

xLSTM: Extended Long Short-Term Memory

Authors: Maximilian Beck, Korbinian Pöppel, et al. (NXAI, Johannes Kepler University Linz)

Tags: RNNs LLMs not-transformers

The key idea

Recurrent neural networks based on Long Short-Term Memory units were the backbone of NLP models before the advent of the now-ubiquitous transformer. This work seeks to close the gap between LSTM and transformer in the crucial model-scaling regime of LLMs. They do this by extending the LSTM in two ways to create sLSTM and mLSTM, then incorporating these layers into a deep residual architecture, called xLSTM.

Scaling trends for two variants of xLSTM (xLSTM[7:1] and xLSTM[1:0]) vs Llama, Mamba and RWKV-4, for models of 125M to 1.3B parameters. The xLSTM lines are similar and are roughly parallel to Mamba, which is parallel to Llama, then RWKV (in descending order of performance).

Their method

We’ll focus on the mLSTM variant, as the sLSTM variant is omitted from many of the best-performing models in their results. I think the best way to understand the architecture is to stare at a wall of maths for a while:

To give an intuition for this, there’s:

Inputs $\mathbf{x}$ and parameters $\mathbf{W_{q,k,v,o}}$, $\mathbf{b_{q,k,v,o}}$, $\mathbf{w_{i,f}}$, $\mathbf{b_{i,f}}$.
Six linear + activation ops, depending only on the inputs: $\textbf{q}, \textbf{k}, \textbf{v}, i, f, \textbf{o}$. The $f$ (forget) and $\textbf{o}$ (output) gates have sigmoid activation, giving outputs in the range $[0, 1]$, but $i$ (input) has an exponential activation. $\textbf{q}, \textbf{k}, \textbf{v}$ are linear.
A “cell” $\textbf{C}$: a decayed and weighted sum of $\textbf{v} \textbf{k}^\top$ (which I’ll call KV mapping) over time. At each step, the state is decayed according to the forget gate $f$ and the KV mapping is weighted according to the input gate $i$. The cell maps queries to values by matching them against keys.
A normalizer $\textbf{n}$: similar, but sums just $\textbf{k}$ instead of KV mapping.
An output $\textbf{o}$, the inner product of query $\textbf{q}$ and cell, divided by the magnitude of the inner product of $\textbf{q}$ and normaliser, and multiplied by the output gate.

Like softmax dot product self-attention, this involves a normalised sum of exponentials; a key difference is that the input to exp depends only on the “source” (key, value), not on the “target” (query). It bears some similarities to linear attention, Mamba and RWKV, permitting a parallel scan over the inputs since time dependency is linear. It retains the RNN’s advantage of summarising the context in a fixed-size representation, $\textbf{C}$, for efficient autoregressive inference.

In the xLSTM architecture, this is used in a custom residual block that performs positionwise up projection before the multi-headed mLSTM.

Results

Downstream results for LLMs of up to 1.3B parameters, trained on 300B SlimPajama tokens:

Results for xLSTM with pure mLSTM and 7:1 mLSTM:sLSTM ratios, against baselines of RWKV, Llama and Mamba on multiple downstream tasks and for models up to 1.3B. The pure mLSTM version of xLSTM performs best in most cases, across SlimPajama validation perplexity, LAMBADA, HellaSwag, PIQA, ARC and WinoGrande.

(I haven’t been able to confirm if these are zero-shot or few-shot results.) Here, xLSTM[1:0] uses only the mLSTM layer described above, while xLSTM[7:1] includes 7 mLSTM layers per 1 sLSTM layer. These results appear to demonstrate the sufficiency of mLSTM for LLMs. The paper also includes a helpful set of ablations and synthetic tasks.

Takeaways

It’s refreshing to see non-transformer LLMs trained at scale, and that the xLSTM architecture appears competitive with transformers. More research could help us understand the benefits of these alternatives, and whether the scaling properties are robust.

Full paper: xLSTM: Extended Long Short-Term Memory