llama.cpp.git - llama.cpp

Age	Commit message (Collapse)	Author
2023-07-01	train : fix compile warning	Georgi Gerganov

2023-07-01	ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995)	Qingyou Meng
	Will not be scheduled unless explicitly enabled.
2023-06-29	Use unsigned for random seed (#2006)	Howard Su
	* Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-29	Porting the improved K-Quant CUDA kernels to OpenCL (#1966)	LostRuins
	* Added broken new q4k quant * xx + ib0 * Fix q2_k fast kernel * Use preprocessor for QK_K * Add q6_k fast matmul kernel * ported q3k speedup successfully * ported q2k and q5k speedups * remove old dot kernels and template * fixed global const struct types * fixing address spaces * fixed string too long CI issue --------- Co-authored-by: 0cc4m <picard12@live.de>
2023-06-28	llama : replacing auto &kv with const auto &kv (#2041)	m3ndax
	* Replacing auto &kv with const auto &kv * Create codacy.yml * Delete codacy.yml
2023-06-28	cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)	Salvador E. Tropea
	- Not used
2023-06-28	cuda : fix missing const qualifier in casts (#2027)	Salvador E. Tropea

2023-06-28	llama : remove shards weight file support (#2000)	Howard Su
	* Remove multiple shards * Remove multiple file loaders * Remove llama_load_tensor_shard class * Simplify load logic * Remove dead code guess_n_parts function * Remove vocab_only from constructor of llama_model_loader * Remove alignment_prevents_mmap which is not more needed. * Remove useless check
2023-06-28	CUDA GPU acceleration for LoRAs + f16 models (#1970)	Johannes Gäßler

2023-06-28	llama : support input embeddings directly (#1910)	ningshanwutuobang
	* add interface for float input * fixed inpL shape and type * add examples of input floats * add test example for embd input * fixed sampling * add free for context * fixed add end condition for generating * add examples for llava.py * add READMD for llava.py * add READMD for llava.py * add example of PandaGPT * refactor the interface and fixed the styles * add cmake build for embd-input * add cmake build for embd-input * Add MiniGPT-4 example * change the order of the args of llama_eval_internal * fix ci error
2023-06-27	fix pthreads setaffinity usage on android (#2020)	Erik Scholz

2023-06-27	baby-llama : fix build after ggml_rope change (#2016)	Howard Su

2023-06-27	llama : fix rope usage after ChatGLM change	Georgi Gerganov

2023-06-27	ggml : add support for ChatGLM RoPE	Georgi Gerganov

2023-06-26	readme : add Scala 3 bindings repo (#2010)	Roman Parykin

2023-06-26	ggml : increase max tensor name + clean up compiler warnings in train-text ↵	David Yang
	(#1988) * Clean up compiler warnings in train-text Some brackets to disambiguate order of operations * Increase GGML_MAX_NAME Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
2023-06-26	readme : LD_LIBRARY_PATH complement for some Android devices when building ↵	Gustavo Rocha Dias
	with CLBlast inside Termux (#2007) * docs - Alternative way to build at Android, with CLBlast. * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux. * doc- fix typo
2023-06-26	ggml : avoid conv 2d kernel round up	Georgi Gerganov

2023-06-26	ggml : add NUMA support (#1556)	zrm
	* detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-26	k-quants : fix indentation	Georgi Gerganov

2023-06-26	tests : fix quantize perf (#1990)	katsu560
	* fix test quantize perf * avoid the global state
2023-06-26	k-quants : add AVX support to dot functions (#1916)	katsu560
	* k_quants : add AVX support * k_quants : apply review comments
2023-06-26	readme : add link to new k-quants for visibility	Georgi Gerganov

2023-06-26	k-quants : support for super-block size of 64 (#2001)	Kawrakow
	* k_quants: WIP super-blocks with 64 weights * k_quants: WIP super-blocks with 64 weights Q6_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q4_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower than the scalar implementation) * k_quants: WIP super-blocks with 64 weights Q3_K scalar and AVX2 works. * k_quants: WIP super-blocks with 64 weights Q5_K scalar and AVX2 works, and with that all k_quants are done on AVX2 and scalar * k_quants: WIP super-blocks with 64 weights Q6_K working on CUDA. Cannot make it run quite as gast as with super-blocks with 256 weigths: 8% slower on 4080, 20% slower on the 1660 (but there we fit 1 less layer on the GPU because pf the larger model size), so some fraction of these 20% is due to that, * k_quants: WIP super-blocks with 64 weights Q4_K working on CUDA. ~10% slower on GTX-1660, 16% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q2_K working on CUDA. ~3% slower on GTX-1660, 10% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q3_K working on CUDA. * k_quants: WIP super-blocks with 64 weights Q5_K working on CUDA, and with this CUDA is done. * k_quants: WIP super-blocks with 64 weights Q6_K working on ARM_NEON * k_quants: WIP super-blocks with 64 weights Q4_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q2_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q3_K working on ARM_NEON, but quite a bit slower than 256 weights. * k_quants: WIP super-blocks with 64 weights Q5_K working on ARM_NEON, but quite a bit slower than 256 weights. With that, we have full support for ARM_NEON, although performance is not quite there. * k_quants: WIP super-blocks with 64 weights Slightly more efficient Q3_K and Q5_K * k_quants: WIP super-blocks with 64 weights Another small improvement for Q3_K and Q5_K on ARM_NEON * k_quants: WIP super-blocks with 64 weights Yet another speedup for Q5_K on ARM_NEON. We are now within 10% of the QK_K = 256 version. * k_quants: WIP super-blocks with 64 weights * We are able to pass preprocessor macros to the Metal compiler * Q6_K works and is actually slightly more efficient than the QK_K = 256 version (25.2 ms vs 25.8 ms) * k_quants: WIP super-blocks with 64 weights Q4_K works on Metal and is actually slightly faster than QK_K = 256 (21.95 ms vs 24.0 ms). * k_quants: WIP super-blocks with 64 weights Q2_K works on Metal and is very slightly faster than QK_K = 256 (23.8 ms vs 24.2 ms). * k_quants: WIP super-blocks with 64 weights Q3_K works on Metal and is slightly faster than QK_K = 256 (26.6 ms vs 28.3 ms). * k_quants: WIP super-blocks with 64 weights Q5_K works on Metal and is slightly faster than QK_K = 256 (23.7 ms vs 26.3 ms). * k_quants: call them _K, not _k, also on Metal * k_quants: correctly define QK_K in llama.cpp * Fixed bug in q4_K quantization added with the 64-block addition * Simplify via lambda * k_quants: swicth Q3_K to 4-bit scales when QK_K = 64 Otherwise there isn't much benefit from this quantization type. There is some very slight loss in accuracy, but we reduce size by ~7%. E.g., for OpenLLaMA-3B, Q3_K_S perplexity is 8.6131 with 8-bit scales and 8.6352 with 4-bit, while file size decreases from 1.53G to 1.44G. * k_quants: switch Q4_K to 4-bit scales when QK_K = 64 Here the loss in accuracy is greater than for Q3_K, but the Q4_K points still move further to the left on the perplexity vs size curve. * k_quants: forgot to add the Metal changes in last commit * k_quants: change Q5_K to be type 0 when QK_K = 64 Still needs AVX2 implementation * k_quants: AVX2 implementation for new 64-weight Q5_K * k_quants: 10% faster ARM_NEON Q5_K dot product * k_quants: fixed issue caused by merging with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26	Fix assert when free invalid cuda pointer (#2005)	Howard Su
	Fix assert via initializing extra structure always. CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument
2023-06-25	readme : add new roadmap + manifesto	Georgi Gerganov

2023-06-25	ggml : sync latest ggml (custom operators)	Georgi Gerganov

2023-06-25	fix server sampling: top k sampler first (#1977)	anon998
	Co-authored-by: anon <anon@example.org>
2023-06-25	readme : add Azure CI discussion link	Georgi Gerganov

2023-06-25	zig : upgrade build system support (#1981)	sjinzh
	* upgrade zig build system support * zig : add new line at the end of the file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24	#1869 Fix null reference errors when training from scratch with CUDA (#1907)	Robyn
	* #1869 Fix null reference errors when training from scratch with CUDA build Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly. * ggml : do not dereference src0 if NULL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24	tests : sync test-grad0 from ggml	Georgi Gerganov

2023-06-24	flake : fix ggml-metal.metal path and run nixfmt (#1974)	Rowan Hart

2023-06-24	convert : fix invalid params in write_vocab_only (#1975)	AN Long

2023-06-24	ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978)	slaren
	* Improve ggml_graph_dump_dot, add ggml_format_name * add more automatic names to view ops * fix name of copies
2023-06-24	readme : fix whitespaces	Georgi Gerganov

2023-06-24	readme : fixed termux instructions (#1973)	Alberto

2023-06-24	llama : fix top-p sampling to match the canonical definition (#1953)	Alex Renda
	* Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p) * top-p: correct gt to gte * add test for correct top-p behavior
2023-06-24	llama : make model stateless and context stateful (llama_state) (#1797)	Didzis Gosko
	* llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-23	Add OpenLLaMA instructions to the README (#1954)	eiery
	* add openllama to readme
2023-06-22	rework convert.py to read hyper-parameters from config.json (#1958)	Erik Scholz
	* Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise. This allows converting open_llama 3B and other non-standard model designs.
2023-06-21	cmake: revert CUDA arch default to 52, 61 if f16 (#1959)	Johannes Gäßler

2023-06-21	Fix typo in README.md (#1961)	Rahul Vivek Nair

2023-06-20	readme : add link to p1	Georgi Gerganov

2023-06-20	Fix typo (#1949)	Xiake Sun

2023-06-20	llama : fix params struct slignment (#1936)	Ettore Di Giacinto
	* Workaround struct misalignment during value-copy Signed-off-by: mudler <mudler@localai.io> * Move booleans at the bottom of the structure Signed-off-by: mudler <mudler@localai.io> * Add comment Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: mudler <mudler@localai.io>
2023-06-20	[Fix] Reenable server embedding endpoint (#1937)	Henri Vasserman
	* Add back embedding feature * Update README
2023-06-19	ggml : fix bug in LBFGS optimizer (found by ggml tests)	Georgi Gerganov

2023-06-19	llama : use aligned memory during ggml_init call from loading saved sessions ↵	l3utterfly
	(#1934) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original
2023-06-19	cmake : fix trailing whitespaces	Georgi Gerganov