3.1.19. unit_scaling.TransformerDecoder
- class unit_scaling.TransformerDecoder(hidden_size: int, vocab_size: int, layers: int, heads: int, dropout_p: float = 0.0, residual_scaling: ~typing.Callable[[int, int], float] = <function transformer_residual_scaling_rule.<locals>._tau>)[source]
A unit-scaled implementation of a decoder-type transformer.
Note: this class is currently just for demonstrating scaling and lacks key functionality (for example masking, positional embeddings, usage for inference).
- Parameters:
hidden_size (int) – the hidden dimension size of the input.
vocab_size (int) – the number of tokens in the vocabulary.
layers (int) – the number of transformer layers.
heads (int) – the number of attention heads.
dropout_p (float, optional) – the probability of embedding, residual and post-softmax dropout.
residual_scaling (Callable[[int, int], float], optional) – scheme for controlling residual weights in the transformer trunk; see
unit_scaling.core.functional.transformer_residual_scaling_rule()
(default).
- append(module: Module) Sequential
Append a given module to the end.
- Parameters:
module (nn.Module) – module to append