Quantization
Quantize a tensor to FP4.
Sample Code
With Four Over Six
x = torch.tensor(1024, 1024, dtype=torch.bfloat16, device="cuda")
x_quantized = quantize_to_fp4(x)
Without Four Over Six
x = torch.tensor(1024, 1024, dtype=torch.bfloat16, device="cuda")
x_quantized = quantize_to_fp4(x, scale_rule=AdaptiveBlockScalingRule.always_6)
With Stochastic Rounding
x = torch.tensor(1024, 1024, dtype=torch.bfloat16, device="cuda")
x_quantized = quantize_to_fp4(x, round_style=RoundStyle.stochastic)
With the Random Hadamard Transform
from fouroversix.quantize import get_rht_matrix
x = torch.tensor(1024, 1024, dtype=torch.bfloat16, device="cuda")
had = get_rht_matrix()
x_quantized = quantize_to_fp4(x, had=had)
Backends
We provide three different implementations of FP4 quantization:
- CUDA: A fast implementation written in CUDA which currently does not support the operations required for training (2D block scaling, stochastic rounding, random Hadamard transform). Requires a Blackwell GPU.
- Triton: A slightly slower implementation written in Triton which supports all operations needed for training. Requires a Blackwell GPU.
- PyTorch: A slow implementation written in PyTorch which supports all operations and can be run on any GPU.
If quantize_to_fp4 is called with backend=None, a backend will be selected
automatically based on the following rules:
- If there is no GPU available, or if the available GPU is not a Blackwell GPU, select PyTorch.
- If any quantization options are set other than
scale_rule, select Triton.- However, if the available GPU is SM120 (i.e. RTX 5090, RTX 6000) and
round_styleis set toRoundStyle.stochastic, select PyTorch as stochastic rounding does not have hardware support on SM120 GPUs.
- However, if the available GPU is SM120 (i.e. RTX 5090, RTX 6000) and
- Otherwise, select CUDA.
Parameters
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Tensor
|
The input tensor to quantize. |
required |
backend
|
QuantizeBackend
|
The backend to use for quantization, either
|
None
|
scale_rule
|
AdaptiveBlockScalingRule
|
The scaling rule to use during quantization. See (Adaptive Block Scaling)[/adaptive_block_scaling] for more details. |
mse
|
block_scale_2d
|
bool
|
If True, scale factors will be computed across 16x16 chunks of the input rather than 1x16 chunks. This is useful to apply to the weight matrix during training, so that W and W.T will be equivalent after quantization. |
False
|
had
|
Tensor
|
A high-precision Hadamard matrix to apply to the input prior to quantization. |
None
|
fp4_format
|
FP4Format
|
The FP4 format to quantize to, either |
nvfp4
|
round_style
|
RoundStyle
|
The rounding style to apply during quantization,
either |
nearest
|
transpose
|
bool
|
If True, the output will be a quantized version of the
transposed input. This may be helpful for certain operations during training
as |
False
|
Returns:
| Type | Description |
|---|---|
FP4Tensor
|
A quantized FP4Tensor, which contains the packed E2M1 values, the FP8 scale |
FP4Tensor
|
factors, and the tensor-wide FP32 scale factor. |
Source code in src/fouroversix/frontend.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | |