diff options
author | Kawrakow <48489457+ikawrakow@users.noreply.github.com> | 2023-06-19 18:14:09 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2023-06-19 18:14:09 +0300 |
commit | ca7c3f4da5d144d4cd1dd44903552e6ba49b8ec8 (patch) | |
tree | 8a0fab78e1cb85d11e4c2c61f4be3e124a72ae5f /.devops | |
parent | b97ca431db35ec96a339a721acb1219c1dd78bed (diff) |
cuda : faster k-quants on older GPUs (#1930)
* k_quants: hopefully much faster Q4_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 65.5 ms/tok
to 41.5 ms/tok!
* k_quants: hopefully much faster Q3_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 60.3 ms/tok
to 41.0 ms/tok!
* k_quants: faster Q2_K on older GPUs
It looks like I didn't need to change anything
compared to what we already had, so this is just
adding clarifying comments. But I now measure
36.3 ms/tok on the GTX-1660, instead fo the
47.2 ms/tok that I have written in the faster
k-quants PR.
* k_quants: faster Q5_K on older GPUs
68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
For some reason the same access pattern that leads
to such resounding success for Q2_K to Q4_K did not
work at all for Q5_K.
It is also more difficult to measure because for Q5_K_S
we only have 32 layers on the GTX-1660, so output, tok embeddings
and kv cache are done on the CPU.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to '.devops')
0 files changed, 0 insertions, 0 deletions