3.4.3. unit_scaling.formats.FPFormat

class unit_scaling.formats.FPFormat(exponent_bits: int, mantissa_bits: int, rounding: str = 'stochastic', srbits: int = 0)[source]

Generic representation of a floating-point number format.

property bits: int

The number of bits used by the format.

property max_absolute_value: float

The maximum absolute value representable by the format.

property min_absolute_normal: float

The minimum absolute normal value representable by the format.

property min_absolute_subnormal: float

The minimum absolute subnormal value representable by the format.

quantise(x: Tensor) Tensor[source]

Non-differentiably quantise the given tensor in this format.

quantise_bwd(x: Tensor) Tensor[source]

Quantise the given tensor in the backward pass only.

quantise_fwd(x: Tensor) Tensor[source]

Quantise the given tensor in the forward pass only.