Articles

July 17, 2024
in Articles
10 min read

Sparser llamas run faster — speed up LLM inference with SparQ Attention

A low-poly llama moving quickly on a running track with lightning bolts

When ChatGPT launched in 2022, it became evident how powerful the Transformer architecture, when trained on large corpora of text, is for handling natural language processing tasks. The performance of these Large Language Models (LLMs) has been attributed to the in-context learning capabilities that emerge with large-scale training.

April 24, 2024
in Articles
25 min read

A transformer walk-through, with Gemma

Transformer-based LLMs seem mysterious, but they don't need to. In this post, we'll walk through a modern transformer LLM, Google's Gemma, providing bare-bones PyTorch code and some intuition for why each step is there. If you're a programmer and casual ML enthusiast, this is written for you.

October 18, 2023
in Articles
4 min read

Almost-scaled dot-product attention

TL;DR: Scaled dot product attention isn't properly scaled, and that's a good thing!

Notebook: almost-scaled dot-product attention