Luka Ribar

February Papers: Longer RoPEs & Better Quantisation

Improving LLM inference is a key research topic at the moment, and something we're particularly interested in at Graphcore because of its hardware implications. February saw several developments in this area, focussing on both the efficiency and capabilities of LLM inference.

Microsoft contributed two of this month's papers, with the first showing a method of extrapolating to long sequences, and the second an approach to storing 6-bit weights. Researchers from Cornell University have gone further and pushed the limits of quantisation to as few as 3 bits for inference. Apple also introduced their new speculative streaming method, which makes efficiency gains by asking the model to predict multiple future tokens, improving over the popular speculative decoding technique.

December Papers: FP8 Training & Simpler Transformers

The last month saw impressive developments in the space of efficient transformers and applied ML, from materials discovery to chip design.

Researchers at Microsoft showed that FP8 could be used in parts of the LLM training process that until now had been kept in higher-precision, and work from ETH Zurich suggested a simplified way of designing transformer-like models.

In terms of applications, DeepMind have impressive results showing that GNNs can be used in the discovery of new inorganic crystals — a key building block of many modern technologies. Nvidia have also trained up a model to assist their engineers on chip design. This is a neat feedback loop: their chip design has facilitated better LLMs, and now their LLMs could facilitate better chip design. How useful this will be in practice remains to be seen.