llama.cpp.git - llama.cpp

diff options

author	Shouzheng Liu <lshzh.hi@gmail.com>	2023-07-20 06:32:22 -0400
committer	GitHub <noreply@github.com>	2023-07-20 13:32:22 +0300
commit	417a85a0010519224cf154eb85d383ffeafeeead (patch)
tree	eb9b9668426c7318e2ab1389f04118e126752a8e /ggml-opencl.h
parent	294f424554c1599784ac9962462fc39ace92d8a5 (diff)

metal: minor q4 optimization and reduce code size (#2248)

* metal: use uint16_t instead of uint8_t. Apple GPU doesn't like uint8_t. For every operation on uint8_t the gpu need to copy the uint8_t to an empty 16 bit register, then it can issue other instructions. For the matrix-vector multiplication kernel only, we observed a 340~350 GB/s memory read speed on M1 Max after this commit, which is very close to the reported hardware limit. * metal: update rms_norm kernel This commit double the speed of rms_norm operations by using 512 threads per threadgroup, combining with SIMD primitives to minimize the need for thread group barriers. * metal: use template to reduce size Revert modifications on block_q4_0 and block_q4_1.

Diffstat (limited to 'ggml-opencl.h')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: