Posts by Tag

LLMs

A transformer walk-through, with Gemma

36 minute read

Transformer-based LLMs seem mysterious, but they don’t need to. In this post, we’ll walk through a modern transformer LLM, Google’s Gemma, providing bare-bon...

Back to Top ↑

efficient-inference

Back to Top ↑

training-dynamics

Scale-preserving nonlinearities for u-μP

5 minute read

My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does...

Back to Top ↑

efficient-training

Back to Top ↑

quantisation

Back to Top ↑

transformers

A transformer walk-through, with Gemma

36 minute read

Transformer-based LLMs seem mysterious, but they don’t need to. In this post, we’ll walk through a modern transformer LLM, Google’s Gemma, providing bare-bon...

Back to Top ↑

fine-tuning

Back to Top ↑

mixture-of-experts

Back to Top ↑

sparsity

Back to Top ↑

mup

Scale-preserving nonlinearities for u-μP

5 minute read

My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does...

Back to Top ↑

unit-scaling

Scale-preserving nonlinearities for u-μP

5 minute read

My colleagues and I always get excited when, every once in a while, deep learning research throws up a fun little maths problem. Our recent work on u-μP does...

Back to Top ↑

GNNs

Back to Top ↑

computer-vision

Back to Top ↑

scaling-laws

Back to Top ↑

inference

Back to Top ↑

long-context

Back to Top ↑

position-embeddings

Back to Top ↑

speculative-decoding

Back to Top ↑

retrieval-augmented-generation

Back to Top ↑

sparse-attention

Back to Top ↑

not-transformers

Back to Top ↑

chip-design

Back to Top ↑

fp8

Back to Top ↑

number-formats

Back to Top ↑

materials

Back to Top ↑

DFT

Back to Top ↑

active-learning

Back to Top ↑

multi-modality

Back to Top ↑

diffusion

Back to Top ↑

automated-theorem-proving

Back to Top ↑

synthetic-data

Back to Top ↑

distributed-training

Back to Top ↑

local-updates

Back to Top ↑

optimization

Back to Top ↑

learning-rate-schedules

Back to Top ↑

RNNs

Back to Top ↑

state-space-models

Back to Top ↑

hallucinations

Back to Top ↑