3.4.3. unit_scaling.formats.FPFormat

class unit_scaling.formats.FPFormat(exponent_bits: int, mantissa_bits: int, rounding: str = 'stochastic', srbits: int = 0)[source]

Generic representation of a floating-point number format.

property bits: int: The number of bits used by the format.

property max_absolute_value: float: The maximum absolute value representable by the format.

property min_absolute_normal: float: The minimum absolute normal value representable by the format.

property min_absolute_subnormal: float: The minimum absolute subnormal value representable by the format.

quantise(x: Tensor) → Tensor[source]: Non-differentiably quantise the given tensor in this format.

quantise_bwd(x: Tensor) → Tensor[source]: Quantise the given tensor in the backward pass only.

quantise_fwd(x: Tensor) → Tensor[source]: Quantise the given tensor in the forward pass only.