llama.cpp.git - llama.cpp

diff options

author	Kawrakow <48489457+ikawrakow@users.noreply.github.com>	2023-04-21 17:18:26 +0200
committer	GitHub <noreply@github.com>	2023-04-21 18:18:26 +0300
commit	1bfc153e2f35ddd9d64b084e8d1a5e6fa57ad1c9 (patch)
tree	1ef11d7122696efe948430a2630a8e192a28c85b /build.zig
parent	3d59769c3bb7e72c915646ddb1e239b1face19f5 (diff)

ggml : a faster version for Q4_1 x Q8_0 dot products (#1083)

* A faster version for Q4_1 x Q8_0 dot products The idea nehind being that Q8_0 quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. Here we pre-compute the sum during Q8_0 quantization, store it in the now modified block_q8_0 struct, and then reuse this result in the subsequent dot products. In a synthetic benchmark (just compute a bunch of dot products), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0. In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%). I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation. * Cleaning up --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'build.zig')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: