Convert vector to f16 for dequantize mul mat vec (#1913)

* Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
author: Johannes Gäßler <johannesg@5d6.de> 2023-06-19 10:23:56 +0200
committer: GitHub <noreply@github.com> 2023-06-19 10:23:56 +0200
commit: 16b9cd193965769089881bb8ec012fccca7b37b6 (patch)
tree: 2ee329793e782f253966fd81f89ea05f5a1a2495 /README.md
parent: b24c3049d96557c24782e4d32feaae65f47277af (diff)
1 files changed, 8 insertions, 1 deletions
diff --git a/README.md b/README.md
index e5b3f59..2d05de3 100644
--- a/README.md
+++ b/README.md
@@ -337,7 +337,14 @@ Building the program with BLAS support may lead to some performance improvements
     cmake --build . --config Release
     ```
 
-  The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used.
+  The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
+
+  | Option                  | Legal values           | Default | Description |
+  |-------------------------|------------------------|---------|-------------|
+  | LLAMA_CUDA_DMMV_X       | Positive integer >= 32 |      32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
+  | LLAMA_CUDA_DMMV_Y       | Positive integer       |       1 | Block size in y direction for the CUDA dequantization + mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
+  | LLAMA_CUDA_DMMV_F16     | Boolean                |   false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels. Can improve performance on relatively recent GPUs. |
+  | LLAMA_CUDA_KQUANTS_ITER | 1 or 2                 |       2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value 2 1 can improve performance for slow GPUs. |
 
 - #### CLBlast
author	Johannes Gäßler <johannesg@5d6.de>	2023-06-19 10:23:56 +0200
committer	GitHub <noreply@github.com>	2023-06-19 10:23:56 +0200
commit	16b9cd193965769089881bb8ec012fccca7b37b6 (patch)
tree	2ee329793e782f253966fd81f89ea05f5a1a2495 /README.md
parent	b24c3049d96557c24782e4d32feaae65f47277af (diff)