llama.cpp.git - llama.cpp

diff options

author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2023-06-19 18:14:09 +0300
committer	GitHub <noreply@github.com>	2023-06-19 18:14:09 +0300
commit	ca7c3f4da5d144d4cd1dd44903552e6ba49b8ec8 (patch)
tree	8a0fab78e1cb85d11e4c2c61f4be3e124a72ae5f /README.md
parent	b97ca431db35ec96a339a721acb1219c1dd78bed (diff)

cuda : faster k-quants on older GPUs (#1930)

* k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'README.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: