llama.cpp.git - llama.cpp

Age	Commit message (Collapse)	Author
2023-06-26	tests : fix quantize perf (#1990)	katsu560
	* fix test quantize perf * avoid the global state
2023-06-26	k-quants : add AVX support to dot functions (#1916)	katsu560
	* k_quants : add AVX support * k_quants : apply review comments
2023-06-26	readme : add link to new k-quants for visibility	Georgi Gerganov

2023-06-26	k-quants : support for super-block size of 64 (#2001)	Kawrakow
	* k_quants: WIP super-blocks with 64 weights * k_quants: WIP super-blocks with 64 weights Q6_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q4_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower than the scalar implementation) * k_quants: WIP super-blocks with 64 weights Q3_K scalar and AVX2 works. * k_quants: WIP super-blocks with 64 weights Q5_K scalar and AVX2 works, and with that all k_quants are done on AVX2 and scalar * k_quants: WIP super-blocks with 64 weights Q6_K working on CUDA. Cannot make it run quite as gast as with super-blocks with 256 weigths: 8% slower on 4080, 20% slower on the 1660 (but there we fit 1 less layer on the GPU because pf the larger model size), so some fraction of these 20% is due to that, * k_quants: WIP super-blocks with 64 weights Q4_K working on CUDA. ~10% slower on GTX-1660, 16% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q2_K working on CUDA. ~3% slower on GTX-1660, 10% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q3_K working on CUDA. * k_quants: WIP super-blocks with 64 weights Q5_K working on CUDA, and with this CUDA is done. * k_quants: WIP super-blocks with 64 weights Q6_K working on ARM_NEON * k_quants: WIP super-blocks with 64 weights Q4_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q2_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q3_K working on ARM_NEON, but quite a bit slower than 256 weights. * k_quants: WIP super-blocks with 64 weights Q5_K working on ARM_NEON, but quite a bit slower than 256 weights. With that, we have full support for ARM_NEON, although performance is not quite there. * k_quants: WIP super-blocks with 64 weights Slightly more efficient Q3_K and Q5_K * k_quants: WIP super-blocks with 64 weights Another small improvement for Q3_K and Q5_K on ARM_NEON * k_quants: WIP super-blocks with 64 weights Yet another speedup for Q5_K on ARM_NEON. We are now within 10% of the QK_K = 256 version. * k_quants: WIP super-blocks with 64 weights * We are able to pass preprocessor macros to the Metal compiler * Q6_K works and is actually slightly more efficient than the QK_K = 256 version (25.2 ms vs 25.8 ms) * k_quants: WIP super-blocks with 64 weights Q4_K works on Metal and is actually slightly faster than QK_K = 256 (21.95 ms vs 24.0 ms). * k_quants: WIP super-blocks with 64 weights Q2_K works on Metal and is very slightly faster than QK_K = 256 (23.8 ms vs 24.2 ms). * k_quants: WIP super-blocks with 64 weights Q3_K works on Metal and is slightly faster than QK_K = 256 (26.6 ms vs 28.3 ms). * k_quants: WIP super-blocks with 64 weights Q5_K works on Metal and is slightly faster than QK_K = 256 (23.7 ms vs 26.3 ms). * k_quants: call them _K, not _k, also on Metal * k_quants: correctly define QK_K in llama.cpp * Fixed bug in q4_K quantization added with the 64-block addition * Simplify via lambda * k_quants: swicth Q3_K to 4-bit scales when QK_K = 64 Otherwise there isn't much benefit from this quantization type. There is some very slight loss in accuracy, but we reduce size by ~7%. E.g., for OpenLLaMA-3B, Q3_K_S perplexity is 8.6131 with 8-bit scales and 8.6352 with 4-bit, while file size decreases from 1.53G to 1.44G. * k_quants: switch Q4_K to 4-bit scales when QK_K = 64 Here the loss in accuracy is greater than for Q3_K, but the Q4_K points still move further to the left on the perplexity vs size curve. * k_quants: forgot to add the Metal changes in last commit * k_quants: change Q5_K to be type 0 when QK_K = 64 Still needs AVX2 implementation * k_quants: AVX2 implementation for new 64-weight Q5_K * k_quants: 10% faster ARM_NEON Q5_K dot product * k_quants: fixed issue caused by merging with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26	Fix assert when free invalid cuda pointer (#2005)	Howard Su
	Fix assert via initializing extra structure always. CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument
2023-06-25	readme : add new roadmap + manifesto	Georgi Gerganov

2023-06-25	ggml : sync latest ggml (custom operators)	Georgi Gerganov

2023-06-25	fix server sampling: top k sampler first (#1977)	anon998
	Co-authored-by: anon <anon@example.org>
2023-06-25	readme : add Azure CI discussion link	Georgi Gerganov

2023-06-25	zig : upgrade build system support (#1981)	sjinzh
	* upgrade zig build system support * zig : add new line at the end of the file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24	#1869 Fix null reference errors when training from scratch with CUDA (#1907)	Robyn
	* #1869 Fix null reference errors when training from scratch with CUDA build Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly. * ggml : do not dereference src0 if NULL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24	tests : sync test-grad0 from ggml	Georgi Gerganov

2023-06-24	flake : fix ggml-metal.metal path and run nixfmt (#1974)	Rowan Hart

2023-06-24	convert : fix invalid params in write_vocab_only (#1975)	AN Long

2023-06-24	ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978)	slaren
	* Improve ggml_graph_dump_dot, add ggml_format_name * add more automatic names to view ops * fix name of copies
2023-06-24	readme : fix whitespaces	Georgi Gerganov

2023-06-24	readme : fixed termux instructions (#1973)	Alberto

2023-06-24	llama : fix top-p sampling to match the canonical definition (#1953)	Alex Renda
	* Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p) * top-p: correct gt to gte * add test for correct top-p behavior
2023-06-24	llama : make model stateless and context stateful (llama_state) (#1797)	Didzis Gosko
	* llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-23	Add OpenLLaMA instructions to the README (#1954)	eiery
	* add openllama to readme
2023-06-22	rework convert.py to read hyper-parameters from config.json (#1958)	Erik Scholz
	* Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise. This allows converting open_llama 3B and other non-standard model designs.
2023-06-21	cmake: revert CUDA arch default to 52, 61 if f16 (#1959)	Johannes Gäßler

2023-06-21	Fix typo in README.md (#1961)	Rahul Vivek Nair

2023-06-20	readme : add link to p1	Georgi Gerganov

2023-06-20	Fix typo (#1949)	Xiake Sun

2023-06-20	llama : fix params struct slignment (#1936)	Ettore Di Giacinto
	* Workaround struct misalignment during value-copy Signed-off-by: mudler <mudler@localai.io> * Move booleans at the bottom of the structure Signed-off-by: mudler <mudler@localai.io> * Add comment Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: mudler <mudler@localai.io>
2023-06-20	[Fix] Reenable server embedding endpoint (#1937)	Henri Vasserman
	* Add back embedding feature * Update README
2023-06-19	ggml : fix bug in LBFGS optimizer (found by ggml tests)	Georgi Gerganov

2023-06-19	llama : use aligned memory during ggml_init call from loading saved sessions ↵	l3utterfly
	(#1934) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original
2023-06-19	cmake : fix trailing whitespaces	Georgi Gerganov

2023-06-19	llama : only use Q6_K for output weights if tensor size is multiple of 256 ↵	Kawrakow
	(#1932) * Only use Q6_K for output weights if tensor size is multiple of 256 * Fixed copy/paste mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19	cuda : faster k-quants on older GPUs (#1930)	Kawrakow
	* k_quants: hopefully much faster Q4_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 65.5 ms/tok to 41.5 ms/tok! * k_quants: hopefully much faster Q3_K on older GPUs On the GTX-1660 that I have available to represent "old GPUs", token prediction drops from 60.3 ms/tok to 41.0 ms/tok! * k_quants: faster Q2_K on older GPUs It looks like I didn't need to change anything compared to what we already had, so this is just adding clarifying comments. But I now measure 36.3 ms/tok on the GTX-1660, instead fo the 47.2 ms/tok that I have written in the faster k-quants PR. * k_quants: faster Q5_K on older GPUs 68.5 ms/tok -> 62.0 ms/tok on GTX-1660. For some reason the same access pattern that leads to such resounding success for Q2_K to Q4_K did not work at all for Q5_K. It is also more difficult to measure because for Q5_K_S we only have 32 layers on the GTX-1660, so output, tok embeddings and kv cache are done on the CPU. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-19	ggml : sync latest ggml repo (#1924)	Georgi Gerganov
	* ggml : sync latest ggml repo * ggml : remove unused comments * ggml : asserts
2023-06-19	cmake : fix build shared ggml when CUDA is enabled (#1929)	Howard Su
	Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-19	Convert vector to f16 for dequantize mul mat vec (#1913)	Johannes Gäßler
	* Convert vector to f16 for dmmv * compile option * Added compilation option description to README * Changed cmake CUDA_ARCHITECTURES from "OFF" to "native"
2023-06-18	Added tokens per second to info prints (#1928)	Johannes Gäßler

2023-06-18	Fixed incorrectly applying RMS norm twice (#1925)	Johannes Gäßler

2023-06-18	ggml : fix bug in ggml_compute_forward_add_q_f32 (#1918)	l3utterfly

2023-06-18	readme : update Android build instructions (#1922)	Mike
	Add steps for using termux on android devices to prevent common errors.
2023-06-18	llama : prevent usage of k-quants when tensor size is not a multiple of 256 ↵	Kawrakow
	(#1921) * Fix examples/metal * k-quants: prevent usage when tensor size is not divisible by 256 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-18	examples : fix examples/metal (#1920)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-18	metal : handle buffers larger than device's maxBufferLength (#1826)	Georgi Gerganov
	* metal : handle buffers larger than device's maxBufferLength * metal : print more verbose device info + handle errors * metal : fix prints for overlapping views * metal : minimize view overlap to try to utilize device memory better
2023-06-18	cmake : add CUDA_ARCHITECTURES to new target ggml_static (#1917)	Howard Su

2023-06-17	make : do not print help for simple example	Georgi Gerganov

2023-06-17	minor : warning fixes	Georgi Gerganov

2023-06-17	Only one CUDA stream per device for async compute (#1898)	Johannes Gäßler

2023-06-17	llama : fix kv_cache `n` init (close #1903)	Georgi Gerganov

2023-06-17	make : update for latest Arch (#1701)	DaniAndTheWeb
	With the upcoming change to the openblas package in arch the Makefile workaround is no longer needed.
2023-06-17	ggml : fix warnings under MSVC (#1908)	Howard Su

2023-06-17	metal : add norm, cpy f16->f16, alibi kernels (#1823)	Aaron Miller