Quantized dot products for CUDA mul mat vec (#2067)

author: Johannes Gäßler <johannesg@5d6.de> 2023-07-05 14:19:42 +0200
committer: GitHub <noreply@github.com> 2023-07-05 14:19:42 +0200
commit: 924dd22fd3ba93e097f8d19ba5cda919ca2fe2fb (patch)
tree: ca169c258f2d00f7e31c8b743a9f1206280b4d6b /README.md
parent: 051c70dcd55709c9cbbfa849af035951fe720433 (diff)
1 files changed, 2 insertions, 1 deletions
diff --git a/README.md b/README.md
index 6c2bb39..32f17c2 100644
--- a/README.md
+++ b/README.md
@@ -345,8 +345,9 @@ Building the program with BLAS support may lead to some performance improvements
 
   | Option                  | Legal values           | Default | Description |
   |-------------------------|------------------------|---------|-------------|
+  | LLAMA_CUDA_FORCE_DMMV   | Boolean                |   false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 7.0/Turing/RTX 2000 or higher). Does not affect k-quants. |
   | LLAMA_CUDA_DMMV_X       | Positive integer >= 32 |      32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
-  | LLAMA_CUDA_DMMV_Y       | Positive integer       |       1 | Block size in y direction for the CUDA dequantization + mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
+  | LLAMA_CUDA_MMV_Y       | Positive integer       |       1 | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
   | LLAMA_CUDA_DMMV_F16     | Boolean                |   false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels. Can improve performance on relatively recent GPUs. |
   | LLAMA_CUDA_KQUANTS_ITER | 1 or 2                 |       2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
author	Johannes Gäßler <johannesg@5d6.de>	2023-07-05 14:19:42 +0200
committer	GitHub <noreply@github.com>	2023-07-05 14:19:42 +0200
commit	924dd22fd3ba93e097f8d19ba5cda919ca2fe2fb (patch)
tree	ca169c258f2d00f7e31c8b743a9f1206280b4d6b /README.md
parent	051c70dcd55709c9cbbfa849af035951fe720433 (diff)