metal : Q6_K implementation (#1752) - llama.cpp.git

diff options

author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2023-06-08 19:46:22 +0300
committer	GitHub <noreply@github.com>	2023-06-08 19:46:22 +0300
commit	0f291e1f65c1d68201e71ce99c89562a36686b6d (patch)
tree	5325c9bd1f8954db8862d7021331c8b60840b631 /.dockerignore
parent	8fc8179919a11738910db07a800f2b176f8adf09 (diff)

metal : Q6_K implementation (#1752)

* Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to '.dockerignore')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: