@@ -1413,65 +1413,61 @@ <h2>Graph Optimization<a class="headerlink" href="#graph-optimization" title="Li
1413
1413
</ dd > </ dl >
1414
1414
1415
1415
< dl class ="py function ">
1416
- < dt class ="sig sig-object py " id ="ipex.quantization.get_weight_only_quant_qconfig_mapping ">
1417
- < span class ="sig-prename descclassname "> < span class ="pre "> ipex.quantization.</ span > </ span > < span class ="sig-name descname "> < span class ="pre "> get_weight_only_quant_qconfig_mapping</ span > </ span > < span class ="sig-paren "> (</ span > < em class ="sig-param "> < span class ="o "> < span class ="pre "> *</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> weight_dtype</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> WoqWeightDtype.INT8</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> lowp_mode</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> WoqLowpMode.NONE</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> act_quant_mode</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> WoqActQuantMode.PER_BATCH_IC_BLOCK_SYM</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> group_size</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> -1</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> weight_qscheme</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> WoqWeightQScheme.UNDEFINED</ span > </ span > </ em > < span class ="sig-paren "> )</ span > < a class ="headerlink " href ="#ipex.quantization.get_weight_only_quant_qconfig_mapping " title ="Link to this definition "> </ a > </ dt >
1418
- < dd > < p > Configuration for weight-only quantization (WOQ) for LLM.
1419
- :param weight_dtype: Data type for weight, WoqWeightDtype.INT8/INT4/NF4, etc.
1420
- :param lowp_mode: specify the lowest precision data type for computation. Data types</ p >
1421
- < blockquote >
1422
- < div > < p > that has even lower precision won’t be used.
1423
- Not necessarily related to activation or weight dtype.
1424
- - NONE(0): Use the activation data type for computation.
1425
- - FP16(1): Use float16 (a.k.a. half) as the lowest precision for computation.
1426
- - BF16(2): Use bfloat16 as the lowest precision for computation.
1427
- - INT8(3): Use INT8 as the lowest precision for computation.</ p >
1428
- < blockquote >
1429
- < div > < p > Activation is quantized to int8 at runtime in this case.</ p >
1430
- </ div > </ blockquote >
1431
- </ div > </ blockquote >
1416
+ < dt class ="sig sig-object py " id ="intel_extension_for_pytorch.quantization.get_weight_only_quant_qconfig_mapping ">
1417
+ < span class ="sig-prename descclassname "> < span class ="pre "> intel_extension_for_pytorch.quantization.</ span > </ span > < span class ="sig-name descname "> < span class ="pre "> get_weight_only_quant_qconfig_mapping</ span > </ span > < span class ="sig-paren "> (</ span > < em class ="sig-param "> < span class ="o "> < span class ="pre "> *</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> weight_dtype</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> WoqWeightDtype.INT8</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> lowp_mode</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> WoqLowpMode.NONE</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> act_quant_mode</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> WoqActQuantMode.PER_BATCH_IC_BLOCK_SYM</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> group_size</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> -1</ span > </ span > </ em > , < em class ="sig-param "> < span class ="n "> < span class ="pre "> weight_qscheme</ span > </ span > < span class ="p "> < span class ="pre "> :</ span > </ span > < span class ="w "> </ span > < span class ="n "> < span class ="pre "> int</ span > </ span > < span class ="w "> </ span > < span class ="o "> < span class ="pre "> =</ span > </ span > < span class ="w "> </ span > < span class ="default_value "> < span class ="pre "> WoqWeightQScheme.UNDEFINED</ span > </ span > </ em > < span class ="sig-paren "> )</ span > < a class ="headerlink " href ="#intel_extension_for_pytorch.quantization.get_weight_only_quant_qconfig_mapping " title ="Link to this definition "> </ a > </ dt >
1418
+ < dd > < p > Configuration for weight-only quantization (WOQ) for LLM.</ p >
1432
1419
< dl class ="field-list simple ">
1433
1420
< dt class ="field-odd "> Parameters< span class ="colon "> :</ span > </ dt >
1434
1421
< dd class ="field-odd "> < ul class ="simple ">
1435
- < li > < p > < strong > act_quant_mode</ strong > – Quantization granularity of activation. It only works for lowp_mode=INT8.
1422
+ < li > < p > < strong > weight_dtype</ strong > – Data type for weight, WoqWeightDtype.INT8/INT4/NF4, etc.</ p > </ li >
1423
+ < li > < p > < strong > lowp_mode</ strong > – < p > specify the lowest precision data type for computation. Data types
1424
+ that has even lower precision won’t be used.
1425
+ Not necessarily related to activation or weight dtype.</ p >
1426
+ < ul >
1427
+ < li > < p > NONE(0): Use the activation data type for computation.</ p > </ li >
1428
+ < li > < p > FP16(1): Use float16 (a.k.a. half) as the lowest precision for computation.</ p > </ li >
1429
+ < li > < p > BF16(2): Use bfloat16 as the lowest precision for computation.</ p > </ li >
1430
+ < li > < p > INT8(3): Use INT8 as the lowest precision for computation.
1431
+ Activation is quantized to int8 at runtime in this case.</ p > </ li >
1432
+ </ ul >
1433
+ </ p > </ li >
1434
+ < li > < p > < strong > act_quant_mode</ strong > – < p > Quantization granularity of activation. It only works for lowp_mode=INT8.
1436
1435
It has no effect in other cases. The tensor is divided into groups, and
1437
1436
each group is quantized with its own quantization parameters.
1438
- Suppose the activation has shape batch_size by input_channel (IC).
1439
- - PER_TENSOR(0): Use the same quantization parameters for the entire tensor.
1440
- - PER_IC_BLOCK(1): Tensor is divided along IC with group size = IC_BLOCK.
1441
- - PER_BATCH(2): Tensor is divided along batch_size with group size = 1.
1442
- - PER_BATCH_IC_BLOCK(3): Tenosr is divided into blocks of 1 x IC_BLOCK.
1443
- Note that IC_BLOCK is determined by group_size automatically.</ p > </ li >
1437
+ Suppose the activation has shape batch_size by input_channel (IC).</ p >
1438
+ < ul >
1439
+ < li > < p > PER_TENSOR(0): Use the same quantization parameters for the entire tensor.</ p > </ li >
1440
+ < li > < p > PER_IC_BLOCK(1): Tensor is divided along IC with group size = IC_BLOCK.</ p > </ li >
1441
+ < li > < p > PER_BATCH(2): Tensor is divided along batch_size with group size = 1.</ p > </ li >
1442
+ < li > < p > PER_BATCH_IC_BLOCK(3): Tenosr is divided into blocks of 1 x IC_BLOCK.</ p > </ li >
1443
+ </ ul >
1444
+ < p > Note that IC_BLOCK is determined by group_size automatically.</ p >
1445
+ </ p > </ li >
1444
1446
< li > < p > < strong > group_size</ strong > – < p > Control quantization granularity along input channel (IC) dimension of weight.
1445
- Must be a positive power of 2 (i.e., 2^k, k > 0) or -1.
1446
- If group_size = -1:</ p >
1447
- < blockquote >
1448
- < div > < dl class ="simple ">
1449
- < dt > If act_quant_mode = PER_TENSOR ro PER_BATCH:</ dt > < dd > < p > No grouping along IC for both activation and weight</ p >
1450
- </ dd >
1451
- < dt > If act_quant_mode = PER_IC_BLOCK or PER_BATCH_IC_BLOCK:</ dt > < dd > < p > No grouping along IC for weight. For activation,
1452
- IC_BLOCK is determined automatically by IC.</ p >
1453
- </ dd >
1454
- </ dl >
1455
- </ div > </ blockquote >
1456
- < dl class ="simple ">
1457
- < dt > If group_size > 0:</ dt > < dd > < p > act_quant_mode can be any. If act_quant_mode is PER_IC_BLOCK(_SYM)
1458
- or PER_BATCH_IC_BLOCK(_SYM), weight is grouped along IC by group_size.
1459
- The IC_BLOCK for activation is determined by group_size automatically.
1460
- Each group has its own quantization parameters.</ p >
1461
- </ dd >
1462
- </ dl >
1447
+ Must be a positive power of 2 (i.e., 2^k, k > 0) or -1. The rule is</ p >
1448
+ < div class ="highlight-python notranslate "> < div class ="highlight "> < pre > < span > </ span > < span class ="n "> If</ span > < span class ="n "> group_size</ span > < span class ="o "> =</ span > < span class ="o "> -</ span > < span class ="mi "> 1</ span > < span class ="p "> :</ span >
1449
+ < span class ="n "> If</ span > < span class ="n "> act_quant_mode</ span > < span class ="o "> =</ span > < span class ="n "> PER_TENSOR</ span > < span class ="n "> ro</ span > < span class ="n "> PER_BATCH</ span > < span class ="p "> :</ span >
1450
+ < span class ="n "> No</ span > < span class ="n "> grouping</ span > < span class ="n "> along</ span > < span class ="n "> IC</ span > < span class ="k "> for</ span > < span class ="n "> both</ span > < span class ="n "> activation</ span > < span class ="ow "> and</ span > < span class ="n "> weight</ span >
1451
+ < span class ="n "> If</ span > < span class ="n "> act_quant_mode</ span > < span class ="o "> =</ span > < span class ="n "> PER_IC_BLOCK</ span > < span class ="ow "> or</ span > < span class ="n "> PER_BATCH_IC_BLOCK</ span > < span class ="p "> :</ span >
1452
+ < span class ="n "> No</ span > < span class ="n "> grouping</ span > < span class ="n "> along</ span > < span class ="n "> IC</ span > < span class ="k "> for</ span > < span class ="n "> weight</ span > < span class ="o "> .</ span > < span class ="n "> For</ span > < span class ="n "> activation</ span > < span class ="p "> ,</ span >
1453
+ < span class ="n "> IC_BLOCK</ span > < span class ="ow "> is</ span > < span class ="n "> determined</ span > < span class ="n "> automatically</ span > < span class ="n "> by</ span > < span class ="n "> IC</ span > < span class ="o "> .</ span >
1454
+ < span class ="n "> If</ span > < span class ="n "> group_size</ span > < span class ="o "> ></ span > < span class ="mi "> 0</ span > < span class ="p "> :</ span >
1455
+ < span class ="n "> act_quant_mode</ span > < span class ="n "> can</ span > < span class ="n "> be</ span > < span class ="nb "> any</ span > < span class ="o "> .</ span > < span class ="n "> If</ span > < span class ="n "> act_quant_mode</ span > < span class ="ow "> is</ span > < span class ="n "> PER_IC_BLOCK</ span > < span class ="p "> (</ span > < span class ="n "> _SYM</ span > < span class ="p "> )</ span >
1456
+ < span class ="ow "> or</ span > < span class ="n "> PER_BATCH_IC_BLOCK</ span > < span class ="p "> (</ span > < span class ="n "> _SYM</ span > < span class ="p "> ),</ span > < span class ="n "> weight</ span > < span class ="ow "> is</ span > < span class ="n "> grouped</ span > < span class ="n "> along</ span > < span class ="n "> IC</ span > < span class ="n "> by</ span > < span class ="n "> group_size</ span > < span class ="o "> .</ span >
1457
+ < span class ="n "> The</ span > < span class ="n "> IC_BLOCK</ span > < span class ="k "> for</ span > < span class ="n "> activation</ span > < span class ="ow "> is</ span > < span class ="n "> determined</ span > < span class ="n "> by</ span > < span class ="n "> group_size</ span > < span class ="n "> automatically</ span > < span class ="o "> .</ span >
1458
+ < span class ="n "> Each</ span > < span class ="n "> group</ span > < span class ="n "> has</ span > < span class ="n "> its</ span > < span class ="n "> own</ span > < span class ="n "> quantization</ span > < span class ="n "> parameters</ span > < span class ="o "> .</ span >
1459
+ </ pre > </ div >
1460
+ </ div >
1463
1461
</ p > </ li >
1464
1462
< li > < p > < strong > weight_qscheme</ strong > – < p > Specify how to quantize weight, asymmetrically or symmetrically. Generally,
1465
1463
asymmetric quantization has better accuracy than symmetric quantization at
1466
1464
the cost of performance. Symmetric quantization is faster but may have worse
1467
1465
accuracy. Default is undefined and determined by weight dtype: asymmetric in
1468
1466
most cases and symmetric if</ p >
1469
- < blockquote >
1470
- < div > < ol class ="arabic simple ">
1467
+ < ol class ="arabic simple ">
1471
1468
< li > < p > weight_dtype is NF4, or</ p > </ li >
1472
1469
< li > < p > weight_dtype is INT8 and lowp_mode is INT8.</ p > </ li >
1473
1470
</ ol >
1474
- </ div > </ blockquote >
1475
1471
< p > One must use WoqWeightQScheme.SYMMETRIC in the above two cases.</ p >
1476
1472
</ p > </ li >
1477
1473
</ ul >
@@ -1781,4 +1777,4 @@ <h2>Graph Optimization<a class="headerlink" href="#graph-optimization" title="Li
1781
1777
</ script >
1782
1778
1783
1779
</ body >
1784
- </ html >
1780
+ </ html >
0 commit comments