llama.cpp.git - llama.cpp

Age	Commit message (Collapse)	Author
2023-08-02	CUDA: faster non k-quant mul_mat_q kernels (#2483)	Johannes Gäßler

2023-08-02	CUDA: Fix models with output size != 32000 (#2480)	Johannes Gäßler

2023-07-31	CUDA: mmq CLI option, fixed mmq build issues (#2453)	Johannes Gäßler

2023-07-31	CUDA: Implemented row flattening for non-glm RoPE (#2468)	Johannes Gäßler

2023-07-31	CUDA: fewer memory bank conflicts for mul_mat_q (#2458)	Johannes Gäßler

2023-07-29	CUDA: Quantized matrix matrix multiplication (#2160)	Johannes Gäßler
	* mmq implementation for non k-quants * q6_K * q2_K * q3_k * q4_K * vdr * q5_K * faster q8_1 loading * loop unrolling * add __restrict__ * q2_K sc_high * GGML_CUDA_MMQ_Y * Updated Makefile * Update Makefile * DMMV_F16 -> F16 * Updated README, CMakeLists * Fix CMakeLists.txt * Fix CMakeLists.txt * Fix multi GPU out-of-bounds
2023-07-29	CUDA: faster multi GPU synchronization (#2448)	Johannes Gäßler

2023-07-25	Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359)	Kawrakow
	* Fix Q4_K and Q5_K for QK_K = 64 * Very slightly better Q5_K bit fiddling --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-24	make rms_norm_eps a parameter (#2374)	slaren
	* make rms_norm_eps a parameter * add rms_norm_eps to command line * fix baby llama, test-grad0 * use scientific notation for eps param in the help ggml-ci
2023-07-24	ggml : sync (unary ops refactor, static-correctness) (#2370)	Georgi Gerganov
	* ggml : sync (unary ops, tests) ggml-ci * tests : remove unnecessary funcs
2023-07-24	Some more Q4_K and Q5_K speedup on CUDA (#2346)	Kawrakow
	* Faster Q5_K on CUDA * Small Q5_K improvement on older GPUs * Spped up Q4_K on CUDA GTX1660: 29.5 ms/t -> 25.6 ms/t RTX4080: 8.40 ms/t -> 8.25 ms/t * Spped up Q4_K on CUDA GTX1660: 36.7 ms/t -> 35.6 ms/t RTX4080: 9.8 ms/t -> 9.5 ms/t * Address PR comments * Add some comments to satisfy PR reviewer --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-23	ggml: move op parameters from tensors to ggml_tensor::op_params (#2333)	slaren
	* ggml: move op parameters from tensors to ggml_tensor::op_params * alibi: use memcpy for float params * remove `src[1] = NULL` in ops
2023-07-23	llama : grouped-query attention + LLaMAv2 70B support (#2276)	Georgi Gerganov
	* CUDA: GQA implementation * llama : support for GQA and LLaMAv2 70B ggml-ci * py : fix hparams parsing (if-else blocks) ggml-ci * py : oh boy .. ggml-ci * help : fix gqa value for 70B ggml-ci --------- Co-authored-by: JohannesGaessler <johannesg@5d6.de>
2023-07-23	Speed up Q4_K (#2322)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-22	CUDA: Fixed 7b q3_K_S with mul_mat_vec_q (#2313)	Johannes Gäßler

2023-07-21	Custom RoPE + bettter memory management for CUDA (#2295)	Kawrakow
	* Custom RoPE + bettter memory management for CUDA * Adjusted look ahead in ggml_cuda_pool_malloc to 5% This is sufficient it seems. We end up using about 200 MB less VRAM that way when running the 13B model with context 8192. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-07-21	llama : make tensor_split ptr instead of array (#2272)	Georgi Gerganov

2023-07-17	Support dup & cont ops on CUDA (#2242)	Jiahao Li

2023-07-14	cuda : allocate all temporary ggml_tensor_extra_gpu from a fixed-size buffer ↵	Bach Le
	(#2220)
2023-07-14	cuda : support broadcast add & mul (#2192)	Jiahao Li
	Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-14	CUDA: mul_mat_vec_q kernels for k-quants (#2203)	Johannes Gäßler

2023-07-14	ggml : sync (ggml_conv_2d, fix mul_mat bug, CUDA GLM rope)	Georgi Gerganov

2023-07-13	Fix compile error on Windows CUDA (#2207)	Howard Su

2023-07-12	cuda : add gelu support	Georgi Gerganov

2023-07-12	Fixed __dp4a compute capability: 6.0 -> 6.1 (#2189)	Johannes Gäßler

2023-07-12	ggml : revert CUDA broadcast changes from #2183 (#2191)	Georgi Gerganov

2023-07-11	ggml : sync (abort callback, mul / add broadcast, fix alibi) (#2183)	Georgi Gerganov

2023-07-11	ggml : remove src0 and src1 from ggml_tensor and rename opt to src (#2178)	Spencer Sutton
	* Add ggml changes * Update train-text-from-scratch for change * mpi : adapt to new ggml_tensor->src --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-08	Fixed OpenLLaMA 3b CUDA mul_mat_vec_q (#2144)	Johannes Gäßler

2023-07-08	CUDA: add __restrict__ to mul mat vec kernels (#2140)	Johannes Gäßler

2023-07-05	Quantized dot products for CUDA mul mat vec (#2067)	Johannes Gäßler

2023-07-03	Fix crash of test-tokenizer-0 under Debug build (#2064)	Howard Su
	* Fix crash of test-tokenizer-0 under Debug build * Change per comment
2023-07-01	Better CUDA synchronization logic (#2057)	Johannes Gäßler

2023-06-28	cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)	Salvador E. Tropea
	- Not used
2023-06-28	cuda : fix missing const qualifier in casts (#2027)	Salvador E. Tropea

2023-06-28	CUDA GPU acceleration for LoRAs + f16 models (#1970)	Johannes Gäßler

2023-06-26	k-quants : support for super-block size of 64 (#2001)	Kawrakow
	* k_quants: WIP super-blocks with 64 weights * k_quants: WIP super-blocks with 64 weights Q6_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q4_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower than the scalar implementation) * k_quants: WIP super-blocks with 64 weights Q3_K scalar and AVX2 works. * k_quants: WIP super-blocks with 64 weights Q5_K scalar and AVX2 works, and with that all k_quants are done on AVX2 and scalar * k_quants: WIP super-blocks with 64 weights Q6_K working on CUDA. Cannot make it run quite as gast as with super-blocks with 256 weigths: 8% slower on 4080, 20% slower on the 1660 (but there we fit 1 less layer on the GPU because pf the larger model size), so some fraction of these 20% is due to that, * k_quants: WIP super-blocks with 64 weights Q4_K working on CUDA. ~10% slower on GTX-1660, 16% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q2_K working on CUDA. ~3% slower on GTX-1660, 10% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q3_K working on CUDA. * k_quants: WIP super-blocks with 64 weights Q5_K working on CUDA, and with this CUDA is done. * k_quants: WIP super-blocks with 64 weights Q6_K working on ARM_NEON * k_quants: WIP super-blocks with 64 weights Q4_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q2_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q3_K working on ARM_NEON, but quite a bit slower than 256 weights. * k_quants: WIP super-blocks with 64 weights Q5_K working on ARM_NEON, but quite a bit slower than 256 weights. With that, we have full support for ARM_NEON, although performance is not quite there. * k_quants: WIP super-blocks with 64 weights Slightly more efficient Q3_K and Q5_K * k_quants: WIP super-blocks with 64 weights Another small improvement for Q3_K and Q5_K on ARM_NEON * k_quants: WIP super-blocks with 64 weights Yet another speedup for Q5_K on ARM_NEON. We are now within 10% of the QK_K = 256 version. * k_quants: WIP super-blocks with 64 weights * We are able to pass preprocessor macros to the Metal compiler * Q6_K works and is actually slightly more efficient than the QK_K = 256 version (25.2 ms vs 25.8 ms) * k_quants: WIP super-blocks with 64 weights Q4_K works on Metal and is actually slightly faster than QK_K = 256 (21.95 ms vs 24.0 ms). * k_quants: WIP super-blocks with 64 weights Q2_K works on Metal and is very slightly faster than QK_K = 256 (23.8 ms vs 24.2 ms). * k_quants: WIP super-blocks with 64 weights Q3_K works on Metal and is slightly faster than QK_K = 256 (26.6 ms vs 28.3 ms). * k_quants: WIP super-blocks with 64 weights Q5_K works on Metal and is slightly faster than QK_K = 256 (23.7 ms vs 26.3 ms). * k_quants: call them _K, not _k, also on Metal * k_quants: correctly define QK_K in llama.cpp * Fixed bug in q4_K quantization added with the 64-block addition * Simplify via lambda * k_quants: swicth Q3_K to 4-bit scales when QK_K = 64 Otherwise there isn't much benefit from this quantization type. There is some very slight loss in accuracy, but we reduce size by ~7%. E.g., for OpenLLaMA-3B, Q3_K_S perplexity is 8.6131 with 8-bit scales and 8.6352 with 4-bit, while file size decreases from 1.53G to 1.44G. * k_quants: switch Q4_K to 4-bit scales when QK_K = 64 Here the loss in accuracy is greater than for Q3_K, but the Q4_K points still move further to the left on the perplexity vs size curve. * k_quants: forgot to add the Metal changes in last commit * k_quants: change Q5_K to be type 0 when QK_K = 64 Still needs AVX2 implementation * k_quants: AVX2 implementation for new 64-weight Q5_K * k_quants: 10% faster ARM_NEON Q5_K dot product * k_quants: fixed issue caused by merging with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26	Fix assert when free invalid cuda pointer (#2005)	Howard Su
	Fix assert via initializing extra structure always. CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument
2023-06-24	#1869 Fix null reference errors when training from scratch with CUDA (#1907)	Robyn
	* #1869 Fix null reference errors when training from scratch with CUDA build Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly. * ggml : do not dereference src0 if NULL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-19	cuda : faster k-quants on older GPUs (#1930)	Kawrakow
	* k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19	Convert vector to f16 for dequantize mul mat vec (#1913)	Johannes Gäßler
	* Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
2023-06-17	Only one CUDA stream per device for async compute (#1898)	Johannes Gäßler

2023-06-17	ggml : fix warnings under MSVC (#1908)	Howard Su

2023-06-16	CUDA : faster k-quant dot kernels (#1862)	Kawrakow
	* cuda : faster k-quant dot kernels * Imrove Q2_K dot kernel on older GPUs We now have a K_QUANTS_PER_ITERATION macro, which should be set to 1 on older and to 2 on newer GPUs. With this, we preserve the performance of the original PR on RTX-4080, and are faster compared to master on GTX-1660. * Imrove Q6_K dot kernel on older GPUs Using the same K_QUANTS_PER_ITERATION macro as last commit, we preserve performance on RTX-4080 and speed up Q6_K on a GTX-1660. * Add LLAMA_CUDA_KQUANTS_ITER to CMakeLists.txt and Makefile Allowed values are 1 or 2. 2 gives the best performance on modern GPUs and is set as default. On older GPUs 1 may work better. * PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-15	Fixed CUDA runtime version check (#1879)	Johannes Gäßler

2023-06-15	Fix the validation of main device (#1872)	Howard Su

2023-06-14	CUDA full GPU acceleration, KV cache in VRAM (#1827)	Johannes Gäßler
	* Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
2023-06-12	Leverage mmap for offloading tensors to GPU (#1597)	Howard Su
	* Rebase to latest * Show progress * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-06-11	Fixed WSL cuda's OOM error (#1594)	Kyle Liang
	* In the function , add the cuda error bypass. * remove excessive codes and prints --------- Co-authored-by: liang <liangmanlai@126.com>
2023-06-09	Windows nvcc workaround (#1753)	Johannes Gäßler
	Fix gibberish output on Windows when using CUDA