CUDA: mmq CLI option, fixed mmq build issues (#2453)

author: Johannes Gäßler <johannesg@5d6.de> 2023-07-31 15:44:35 +0200
committer: GitHub <noreply@github.com> 2023-07-31 15:44:35 +0200
commit: 0728c5a8b9569183ffca0399caac099afef87595 (patch)
tree: 58915b38ddcc28bda0171925548d6b4d6fea2707 /README.md
parent: 1215ed7d5ccf854a55eccb52661427bb985e7f2c (diff)
1 files changed, 3 insertions, 1 deletions
diff --git a/README.md b/README.md
index 42fc42b..b231d24 100644
--- a/README.md
+++ b/README.md
@@ -400,9 +400,11 @@ Building the program with BLAS support may lead to some performance improvements
 
   The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
 
+<!---
+  | LLAMA_CUDA_CUBLAS       | Boolean                |   false | Use cuBLAS instead of custom CUDA kernels for prompt processing. Faster for all quantization formats except for q4_0 and q8_0, especially for k-quants. Increases VRAM usage (700 MiB for 7b, 970 MiB for 13b, 1430 MiB for 33b). |
+--->
   | Option                  | Legal values           | Default | Description |
   |-------------------------|------------------------|---------|-------------|
-  | LLAMA_CUDA_CUBLAS       | Boolean                |   false | Use cuBLAS instead of custom CUDA kernels for prompt processing. Faster for all quantization formats except for q4_0 and q8_0, especially for k-quants. Increases VRAM usage (700 MiB for 7b, 970 MiB for 13b, 1430 MiB for 33b). |
   | LLAMA_CUDA_MMQ_Y        | Positive integer >= 32 |      64 | Tile size in y direction when using the custom CUDA kernels for prompt processing. Higher values can be faster depending on the amount of shared memory available. Power of 2 heavily recommended. |
   | LLAMA_CUDA_FORCE_DMMV   | Boolean                |   false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
   | LLAMA_CUDA_DMMV_X       | Positive integer >= 32 |      32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
author	Johannes Gäßler <johannesg@5d6.de>	2023-07-31 15:44:35 +0200
committer	GitHub <noreply@github.com>	2023-07-31 15:44:35 +0200
commit	0728c5a8b9569183ffca0399caac099afef87595 (patch)
tree	58915b38ddcc28bda0171925548d6b4d6fea2707 /README.md
parent	1215ed7d5ccf854a55eccb52661427bb985e7f2c (diff)