o ZŽh;ã@s¦ddlmZmZmZddlmZmZmZmZeƒr3ddl Z ddl mZddlZddl mZddl mZeƒr.grid)r) Ú is_contiguousÚshapeÚtorchZ empty_likeÚ float8_e4m3fnÚ new_emptyÚsizerr)rr!rrr*rr)rÚ act_quant0s2r1ÚBLOCK_SIZE_MÚBLOCK_SIZE_NÚBLOCK_SIZE_KÚGROUP_SIZE_Mc6Csætjdd}t ||¡}t ||¡}||}||}||}t|||ƒ}|||}|||} ||t d|¡|}!| |t d|¡|}"t d|¡}#||!dd…df| |#ddd…f|}$||#dd…df||"ddd…f| }%||!|}&|"|}'||'|}(tj||ftjd})tdt ||¡ƒD]h}*tj|$|#ddd…f||*|kdd}+tj|%|#dd…df||*|kdd},|*|}-|-| }.t |&|.|¡}/t |(|.|¡}0|)t |+|,¡|/dd…df|0ddd…f7})|$||7}$|%||7}%qž|j jtjkr|) tj¡}1n|j jtjkr%|) tj¡}1n|) tj¡}1||t d|¡}2| |t d|¡}3|||2dd…df||3ddd…f}4|2dd…df|k|3ddd…f|k@}5tj|4|1|5ddS)z¾Triton-accelerated function used to perform linear operations (dot product) on input tensors `A` and `B` with block-wise quantization, and store the result in output tensor `C`. rr Nr$g)ÚmaskÚother)r6)rrr&ÚminrÚzerosrÚrangerÚdotrrZbfloat16rZfloat16r)6ÚAÚBÚCÚAsÚBsÚMÚNÚKZgroup_nZgroup_kZ stride_amZ stride_akZ stride_bkZ stride_bnZ stride_cmZ stride_cnZstride_As_mZstride_As_kZstride_Bs_kZstride_Bs_nr2r3r4r5rZ num_pid_mZ num_pid_nZnum_pid_in_groupZgroup_idZfirst_pid_mZgroup_size_mZpid_mZpid_nZoffs_amZoffs_bnZoffs_kZa_ptrsZb_ptrsZAs_ptrsZoffs_bsnZBs_ptrsZaccumulatorÚkÚaÚbZk_startZoffs_ksZa_sZb_sÚcZoffs_cmZoffs_cnZc_ptrsZc_maskrrrÚ_w8a8_block_fp8_matmul>sL%,,((0,(rHr<r=r?r@Úoutput_dtypecsÖt|ƒdksJ‚|d|d}}|jd|jdksJ‚|jdd…|jdd…kr/| ¡s1J‚t |jd|¡|jdksAJ‚| ¡|jd‰|jdkrX| ¡rX|jdksZJ‚|j\‰}t ˆ|¡|jdkslJ‚t ||¡|jdksyJ‚|jdd…ˆf} |j| |d} d}ˆ|kršt ˆ¡}t |dƒ}|}||dks¤J‚|} ‡‡fd d „}t |||| ||ˆˆ|||| d¡| d¡| d¡| d¡| d¡| d¡| d¡| d¡| d¡| d¡|| |dd | S)a‰This function performs matrix multiplication with block-wise quantization. It takes two input tensors `A` and `B` with scales `As` and `Bs`. The output is returned in the specified `output_dtype`. Args: A: The input tensor, e.g., activation. B: The input tensor, e.g., weight. As: The per-token-group quantization scale for `A`. Bs: The per-block quantization scale for `B`. block_size: The block size for per-block quantization. It should be 2-dim, e.g., [128, 128]. output_dytpe: The dtype of the returned tensor. Returns: torch.Tensor: The result of matmul. rrér#Nr$r écs"t ˆ|d¡t ˆ|d¡fS)Nr2r3)r%r&)ZMETA©rArBrrr*Âs"z*w8a8_block_fp8_matmul_triton..gridéþÿÿÿé)r2r3r4r5)Úlenr,r+r%r&r'Úndimr/Znext_power_of_2rrHZstride)r<r=r?r@r!rIZblock_nZblock_krCZC_shaper>r2r4r3r*rrLrÚw8a8_block_fp8_matmul_triton“s^( èrQÚinput_qÚweight_qÚinput_scaleÚweight_scalec Cs€|jdkr|jn d|jd|jdf\}}}|jd} | d|¡} | |jdd¡}| |d}||d} tj||| ftj|jd}t|ƒD]k}||d}||d}t| ƒD]X}||d}||d}| dd…||…f}|||…||…f}|dd…||d…f}|||f}tj|| ¡tj dtj|jd||d|}|dd…||…f|7<qZqH| ||| ¡}| |¡S)aÁ Performs blocked matrix multiplication with FP8 quantized matrices. Args: input_q: Quantized input tensor with 1x128 block quantization weight_q: Quantized weight tensor with 128x128 block quantization input_scale: Scaling factors for input blocks weight_scale: Scaling factors for weight blocks block_size: Tuple of (M, N) for weight block dimensions output_dtype: Desired output dtype érJrr#©rÚdeviceN)Zscale_aZscale_bZ out_dtype)rPr,Úviewr-r9rrXr:Z _scaled_mmÚtZtensorr)rRrSrTrUr!rIZ batch_sizeZseq_lenZ hidden_dimÚout_featuresZinput_reshapedZinput_scale_reshapedZnum_weight_blocks_mZnum_weight_blocks_nÚoutputÚiZm_startZm_endÚjZn_startZn_endZinput_blockZweight_blockZcurr_input_scaleZcurr_weight_scaleZblock_resultrrrÚw8a8_block_fp8_matmul_compileäs>, ûùÿé r_csbeZdZejZ ddedededee eeff‡fdd „ Z d ejdejfdd „Z‡Z S)Ú FP8LinearFNÚdynamicÚin_featuresr[Úbiasr!c sØtƒ ||¡||_||_tj tj||tj |d¡|_ |j ¡dkrJ||dd|d}||dd|d} t tj|| tj|d¡|_ n| dd¡||_||_|rdt t |j¡¡|_dS| dd¡dS)NrWrJrÚweight_scale_invrc)ÚsuperÚ__init__rbr[r-ÚnnÚ ParameterÚemptyr`rÚweightÚelement_sizerrdZregister_parameterr!Úactivation_schemerc) Úselfrbr[rcrr!rXrlZscale_out_featuresZscale_in_features©Ú __class__rrrf)s ÿzFP8Linear.__init__Úinputr"c CsÊ|j ¡dkrt ||j|j¡Stƒrtj ¡j nd}t t|tjƒ}| |j¡ t ||jdƒ\}}t||j||j|j|jd}Wdƒn1sKwY| ¡|jdur^||j}|j|jdS)NrJÚcuda)rIr$)rjrkÚFZlinearrcrr-ZacceleratorZcurrent_acceleratorÚtypeÚgetattrrqrXr1r!rQrdrZsynchronizer)rmrpZdevice_typeZtorch_accelerator_moduleZqinputÚscaler\rrrÚforwardKs&úþ zFP8Linear.forward)FNNNra)Ú__name__Ú __module__Ú__qualname__r-r.rÚintÚboolrrrfÚTensorrvÚ __classcell__rrrnrr`&s"øþýüú"r`Fc sþ|durg}| ¡D]p\}}| |¡t|tjƒr_||pgvr_d |¡‰t‡fdd„|p-gDƒƒs_tƒ#t|j |j |jdu|jj |jj|j|jd|j|<d}Wdƒn1sZwYtt| ¡ƒƒdkrut||||||d\}}| d ¡q ||fS) z%Replace Linear layers with FP8Linear.NÚ.c3s|]}|ˆvVqdS)Nr)Ú.0Úkey©Zcurrent_key_name_strrrÚ us€z+_replace_with_fp8_linear..)rbr[rcrXrrlr!Tr)Úhas_been_replacedr#)Znamed_childrenÚappendÚ isinstancergÚLinearÚjoinÚanyrr`rbr[rcrjrXrrlZweight_block_sizeZ_modulesrOÚlistÚchildrenÚ_replace_with_fp8_linearÚpop) ÚmodelÚtp_planÚmodules_to_not_convertZcurrent_key_nameÚquantization_configrƒÚnameÚmoduleÚ_rrrr‹ds< ù ö ú r‹cCs\|durdgn|}|jdur| |j¡tt|ƒƒ}t||j||d\}}|s,t d¡|S)z:Helper function to replace model layers with FP8 versions.NZlm_head)rŽrrzYou are loading your model using fp8 but no linear modules were found in your model. Please double check your model architecture.)rÚextendr‰Úsetr‹Z_tp_planÚloggerÚwarning)rrrrƒrrrÚreplace_with_fp8_linear’s üÿr˜)r )NNNNF)NN)'ÚtypingrrrÚutilsrrrr r-Ztorch.nnrgr%Ztriton.languageÚlanguagerr rrZ acceleraterZ get_loggerrwr–ZjitZ constexprrr|rzr1rHrrrQÚcompiler_r†r`r‹r˜rrrrÚsˆ &æåäãZúÿþýüûú ùQúÿþýüûúùA@ ú0ý