Can you introduce how to perform per-token quantization on o_proj and down_proj exactly?
https://github.com/AniZpZ/AutoSmoothQuant/blob/main/autosmoothquant/layers/nn/linear.py#L310
int8_weight, weight_scale = quantize_per_tensor_absmax(module.weight)
if act_quant == "per-token":
alpha = weight_scale
when using per-token, the weight_scale is from quantize_per_tensor_absmax, this is a bit confusing for me.
Can you introduce how to perform
per-tokenquantization ono_projanddown_projexactly?https://github.com/AniZpZ/AutoSmoothQuant/blob/main/autosmoothquant/layers/nn/linear.py#L310
when using
per-token, theweight_scaleis fromquantize_per_tensor_absmax, this is a bit confusing for me.