Luke Prince

February Papers: Longer RoPEs & Better Quantisation

Improving LLM inference is a key research topic at the moment, and something we're particularly interested in at Graphcore because of its hardware implications. February saw several developments in this area, focussing on both the efficiency and capabilities of LLM inference.

Microsoft contributed two of this month's papers, with the first showing a method of extrapolating to long sequences, and the second an approach to storing 6-bit weights. Researchers from Cornell University have gone further and pushed the limits of quantisation to as few as 3 bits for inference. Apple also introduced their new speculative streaming method, which makes efficiency gains by asking the model to predict multiple future tokens, improving over the popular speculative decoding technique.

January Papers: Great Teachers & Beyond Chinchilla

For the research community, 2023 was dominated by large transformers and the associated challenges with training, tuning and deploying them. This trend has continued into 2024, with January seeing some particularly useful developments in the area of efficient training.

Google DeepMind's work on active learning and MosaicML's work on updated scaling laws, stood out to us as particularly noteworthy. The latter paper updates the influential Chinchilla scaling laws to account for the additional cost of inference — a key practical consideration that has influenced models like Llama & Mistral.

While scaling laws assume a fixed architecture, there are also benefits to be gained by tweaking model design. Nvidia demonstrate this in their paper on diffusion model training dynamics, where they make various stability-inducing changes (we did something similar in our unit scaling paper). Finally, we note a remarkable application of LLMs to the problem of geometry solving, which had previously appeared too data-constrained and reasoning-dependent for current AI to solve.