3.1.2. unit_scaling.transformer_residual_scaling_rule

unit_scaling.transformer_residual_scaling_rule(residual_mult: float = 1.0, residual_attn_ratio: float = 1.0) → Callable[[int, int], float][source]

Compute the residual tau ratios for the default transformer rule.

For a transformer stack that starts with embedding, then alternates between attention and MLP layers, this rule ensures:

Every attention layer contributes the same scale.

Every MLP layer contributes the same scale.

The ratio of the average (variance) contribution of all attention and all MLP layers to the embedding layer is residual_mult.

The ratio of Attn to MLP contributions is residual_attn_ratio.

If both hyperparameters are set to 1.0, the total contribution of embedding, attention and MLP layers are all equal.

This scheme is described in Appendix G of the u-μP paper,

Parameters:

residual_mult (float, optional) – contribution of residual layers (relative to an initial/embedding layer).
residual_attn_ratio (float, optional) – contribution of attn layers relative to FFN layers.

Returns:

a function for calculating tau at a given depth.

Return type:

fn(index, layers) -> tau