llama.cpp.git - llama.cpp

diff options

author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2023-06-16 20:08:44 +0300
committer	GitHub <noreply@github.com>	2023-06-16 20:08:44 +0300
commit	3d0112261042b356621e93db3fa4c6798a5d098f (patch)
tree	3634baa70ed23142f86c5a44701bbf4b0971c2fd /ggml-cuda.h
parent	602c748863e15270d80d74aa2c3bf86ab8139e07 (diff)

CUDA : faster k-quant dot kernels (#1862)

* cuda : faster k-quant dot kernels * Imrove Q2_K dot kernel on older GPUs We now have a K_QUANTS_PER_ITERATION macro, which should be set to 1 on older and to 2 on newer GPUs. With this, we preserve the performance of the original PR on RTX-4080, and are faster compared to master on GTX-1660. * Imrove Q6_K dot kernel on older GPUs Using the same K_QUANTS_PER_ITERATION macro as last commit, we preserve performance on RTX-4080 and speed up Q6_K on a GTX-1660. * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile Allowed values are 1 or 2. 2 gives the best performance on modern GPUs and is set as default. On older GPUs 1 may work better. * PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'ggml-cuda.h')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: