Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
|
|
|
|
* mmq implementation for non k-quants
* q6_K
* q2_K
* q3_k
* q4_K
* vdr
* q5_K
* faster q8_1 loading
* loop unrolling
* add __restrict__
* q2_K sc_high
* GGML_CUDA_MMQ_Y
* Updated Makefile
* Update Makefile
* DMMV_F16 -> F16
* Updated README, CMakeLists
* Fix CMakeLists.txt
* Fix CMakeLists.txt
* Fix multi GPU out-of-bounds
|
|
|
|
* Fix Q4_K and Q5_K for QK_K = 64
* Very slightly better Q5_K bit fiddling
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* make rms_norm_eps a parameter
* add rms_norm_eps to command line
* fix baby llama, test-grad0
* use scientific notation for eps param in the help
ggml-ci
|
|
* ggml : sync (unary ops, tests)
ggml-ci
* tests : remove unnecessary funcs
|
|
* Faster Q5_K on CUDA
* Small Q5_K improvement on older GPUs
* Spped up Q4_K on CUDA
GTX1660: 29.5 ms/t -> 25.6 ms/t
RTX4080: 8.40 ms/t -> 8.25 ms/t
* Spped up Q4_K on CUDA
GTX1660: 36.7 ms/t -> 35.6 ms/t
RTX4080: 9.8 ms/t -> 9.5 ms/t
* Address PR comments
* Add some comments to satisfy PR reviewer
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* ggml: move op parameters from tensors to ggml_tensor::op_params
* alibi: use memcpy for float params
* remove `src[1] = NULL` in ops
|
|
* CUDA: GQA implementation
* llama : support for GQA and LLaMAv2 70B
ggml-ci
* py : fix hparams parsing (if-else blocks)
ggml-ci
* py : oh boy ..
ggml-ci
* help : fix gqa value for 70B
ggml-ci
---------
Co-authored-by: JohannesGaessler <johannesg@5d6.de>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* Custom RoPE + bettter memory management for CUDA
* Adjusted look ahead in ggml_cuda_pool_malloc to 5%
This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
(#2220)
|
|
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* Add ggml changes
* Update train-text-from-scratch for change
* mpi : adapt to new ggml_tensor->src
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
|
|
* Fix crash of test-tokenizer-0 under Debug build
* Change per comment
|
|
|
|
- Not used
|
|
|
|
|
|
* k_quants: WIP super-blocks with 64 weights
* k_quants: WIP super-blocks with 64 weights
Q6_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q4_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)
* k_quants: WIP super-blocks with 64 weights
Q3_K scalar and AVX2 works.
* k_quants: WIP super-blocks with 64 weights
Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar
* k_quants: WIP super-blocks with 64 weights
Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,
* k_quants: WIP super-blocks with 64 weights
Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q3_K working on CUDA.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on CUDA, and with this CUDA is done.
* k_quants: WIP super-blocks with 64 weights
Q6_K working on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Q4_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q2_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.
With that, we have full support for ARM_NEON, although
performance is not quite there.
* k_quants: WIP super-blocks with 64 weights
Slightly more efficient Q3_K and Q5_K
* k_quants: WIP super-blocks with 64 weights
Another small improvement for Q3_K and Q5_K on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.
* k_quants: WIP super-blocks with 64 weights
* We are able to pass preprocessor macros to the Metal
compiler
* Q6_K works and is actually slightly more efficient than
the QK_K = 256 version (25.2 ms vs 25.8 ms)
* k_quants: WIP super-blocks with 64 weights
Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).
* k_quants: WIP super-blocks with 64 weights
Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).
* k_quants: WIP super-blocks with 64 weights
Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).
* k_quants: WIP super-blocks with 64 weights
Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).
* k_quants: call them _K, not _k, also on Metal
* k_quants: correctly define QK_K in llama.cpp
* Fixed bug in q4_K quantization added with the 64-block addition
* Simplify via lambda
* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64
Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.
* k_quants: switch Q4_K to 4-bit scales when QK_K = 64
Here the loss in accuracy is greater than for Q3_K,
but the Q4_K points still move further to the left on
the perplexity vs size curve.
* k_quants: forgot to add the Metal changes in last commit
* k_quants: change Q5_K to be type 0 when QK_K = 64
Still needs AVX2 implementation
* k_quants: AVX2 implementation for new 64-weight Q5_K
* k_quants: 10% faster ARM_NEON Q5_K dot product
* k_quants: fixed issue caused by merging with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Fix assert via initializing extra structure always.
CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument
|
|
* #1869 Fix null reference errors when training from scratch with CUDA build
Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly.
* ggml : do not dereference src0 if NULL
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* k_quants: hopefully much faster Q4_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 65.5 ms/tok
to 41.5 ms/tok!
* k_quants: hopefully much faster Q3_K on older GPUs
On the GTX-1660 that I have available to represent
"old GPUs", token prediction drops from 60.3 ms/tok
to 41.0 ms/tok!
* k_quants: faster Q2_K on older GPUs
It looks like I didn't need to change anything
compared to what we already had, so this is just
adding clarifying comments. But I now measure
36.3 ms/tok on the GTX-1660, instead fo the
47.2 ms/tok that I have written in the faster
k-quants PR.
* k_quants: faster Q5_K on older GPUs
68.5 ms/tok -> 62.0 ms/tok on GTX-1660.
For some reason the same access pattern that leads
to such resounding success for Q2_K to Q4_K did not
work at all for Q5_K.
It is also more difficult to measure because for Q5_K_S
we only have 32 layers on the GTX-1660, so output, tok embeddings
and kv cache are done on the CPU.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Convert vector to f16 for dmmv
* compile option
* Added compilation option description to README
* Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
|
|
|
|
|
|
* cuda : faster k-quant dot kernels
* Imrove Q2_K dot kernel on older GPUs
We now have a K_QUANTS_PER_ITERATION macro, which should be
set to 1 on older and to 2 on newer GPUs.
With this, we preserve the performance of the original
PR on RTX-4080, and are faster compared to master on
GTX-1660.
* Imrove Q6_K dot kernel on older GPUs
Using the same K_QUANTS_PER_ITERATION macro as last commit,
we preserve performance on RTX-4080 and speed up
Q6_K on a GTX-1660.
* Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile
Allowed values are 1 or 2. 2 gives the best performance on
modern GPUs and is set as default. On older GPUs 1 may work
better.
* PR comments
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
* Fixed CUDA RoPE
* ggml_cuda_mul_mat_vec_p021
* ggml_cuda_scale
* ggml_cuda_diag_mask_inf
* ggml_is_permuted
* ggml_cuda_cpy
* flatten rows for ggml_cuda_op
* Added a --low-vram option
* Fixed Windows performance
* Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
|
|
* Rebase to latest
* Show progress
* Add assert to make sure we only allocate temp buffer for non-CPU backend tensor
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
|
* In the function , add the cuda error bypass.
* remove excessive codes and prints
---------
Co-authored-by: liang <liangmanlai@126.com>
|
|
Fix gibberish output on Windows when using CUDA
|