aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-07-01embd-input : fix returning ptr to temporaryGeorgi Gerganov
2023-07-01train : fix compile warningGeorgi Gerganov
2023-07-01ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995)Qingyou Meng
Will not be scheduled unless explicitly enabled.
2023-06-29Use unsigned for random seed (#2006)Howard Su
* Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-29Porting the improved K-Quant CUDA kernels to OpenCL (#1966)LostRuins
* Added broken new q4k quant * xx + ib0 * Fix q2_k fast kernel * Use preprocessor for QK_K * Add q6_k fast matmul kernel * ported q3k speedup successfully * ported q2k and q5k speedups * remove old dot kernels and template * fixed global const struct types * fixing address spaces * fixed string too long CI issue --------- Co-authored-by: 0cc4m <picard12@live.de>
2023-06-28llama : replacing auto &kv with const auto &kv (#2041)m3ndax
* Replacing auto &kv with const auto &kv * Create codacy.yml * Delete codacy.yml
2023-06-28cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)Salvador E. Tropea
- Not used
2023-06-28cuda : fix missing const qualifier in casts (#2027)Salvador E. Tropea
2023-06-28llama : remove shards weight file support (#2000)Howard Su
* Remove multiple shards * Remove multiple file loaders * Remove llama_load_tensor_shard class * Simplify load logic * Remove dead code guess_n_parts function * Remove vocab_only from constructor of llama_model_loader * Remove alignment_prevents_mmap which is not more needed. * Remove useless check
2023-06-28CUDA GPU acceleration for LoRAs + f16 models (#1970)Johannes Gäßler
2023-06-28llama : support input embeddings directly (#1910)ningshanwutuobang
* add interface for float input * fixed inpL shape and type * add examples of input floats * add test example for embd input * fixed sampling * add free for context * fixed add end condition for generating * add examples for llava.py * add READMD for llava.py * add READMD for llava.py * add example of PandaGPT * refactor the interface and fixed the styles * add cmake build for embd-input * add cmake build for embd-input * Add MiniGPT-4 example * change the order of the args of llama_eval_internal * fix ci error
2023-06-27fix pthreads setaffinity usage on android (#2020)Erik Scholz
2023-06-27baby-llama : fix build after ggml_rope change (#2016)Howard Su
2023-06-27llama : fix rope usage after ChatGLM changeGeorgi Gerganov
2023-06-27ggml : add support for ChatGLM RoPEGeorgi Gerganov
2023-06-26readme : add Scala 3 bindings repo (#2010)Roman Parykin
2023-06-26ggml : increase max tensor name + clean up compiler warnings in train-text ↵David Yang
(#1988) * Clean up compiler warnings in train-text Some brackets to disambiguate order of operations * Increase GGML_MAX_NAME Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
2023-06-26readme : LD_LIBRARY_PATH complement for some Android devices when building ↵Gustavo Rocha Dias
with CLBlast inside Termux (#2007) * docs - Alternative way to build at Android, with CLBlast. * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux. * doc- fix typo
2023-06-26ggml : avoid conv 2d kernel round upGeorgi Gerganov
2023-06-26ggml : add NUMA support (#1556)zrm
* detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-26k-quants : fix indentationGeorgi Gerganov
2023-06-26tests : fix quantize perf (#1990)katsu560
* fix test quantize perf * avoid the global state
2023-06-26k-quants : add AVX support to dot functions (#1916)katsu560
* k_quants : add AVX support * k_quants : apply review comments
2023-06-26readme : add link to new k-quants for visibilityGeorgi Gerganov
2023-06-26k-quants : support for super-block size of 64 (#2001)Kawrakow
* k_quants: WIP super-blocks with 64 weights * k_quants: WIP super-blocks with 64 weights Q6_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q4_K scalar and AVX2 works * k_quants: WIP super-blocks with 64 weights Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower than the scalar implementation) * k_quants: WIP super-blocks with 64 weights Q3_K scalar and AVX2 works. * k_quants: WIP super-blocks with 64 weights Q5_K scalar and AVX2 works, and with that all k_quants are done on AVX2 and scalar * k_quants: WIP super-blocks with 64 weights Q6_K working on CUDA. Cannot make it run quite as gast as with super-blocks with 256 weigths: 8% slower on 4080, 20% slower on the 1660 (but there we fit 1 less layer on the GPU because pf the larger model size), so some fraction of these 20% is due to that, * k_quants: WIP super-blocks with 64 weights Q4_K working on CUDA. ~10% slower on GTX-1660, 16% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q2_K working on CUDA. ~3% slower on GTX-1660, 10% slower on 4080. * k_quants: WIP super-blocks with 64 weights Q3_K working on CUDA. * k_quants: WIP super-blocks with 64 weights Q5_K working on CUDA, and with this CUDA is done. * k_quants: WIP super-blocks with 64 weights Q6_K working on ARM_NEON * k_quants: WIP super-blocks with 64 weights Q4_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q2_K working on ARM_NEON, but quite a bit slower than 256 weights * k_quants: WIP super-blocks with 64 weights Q3_K working on ARM_NEON, but quite a bit slower than 256 weights. * k_quants: WIP super-blocks with 64 weights Q5_K working on ARM_NEON, but quite a bit slower than 256 weights. With that, we have full support for ARM_NEON, although performance is not quite there. * k_quants: WIP super-blocks with 64 weights Slightly more efficient Q3_K and Q5_K * k_quants: WIP super-blocks with 64 weights Another small improvement for Q3_K and Q5_K on ARM_NEON * k_quants: WIP super-blocks with 64 weights Yet another speedup for Q5_K on ARM_NEON. We are now within 10% of the QK_K = 256 version. * k_quants: WIP super-blocks with 64 weights * We are able to pass preprocessor macros to the Metal compiler * Q6_K works and is actually slightly more efficient than the QK_K = 256 version (25.2 ms vs 25.8 ms) * k_quants: WIP super-blocks with 64 weights Q4_K works on Metal and is actually slightly faster than QK_K = 256 (21.95 ms vs 24.0 ms). * k_quants: WIP super-blocks with 64 weights Q2_K works on Metal and is very slightly faster than QK_K = 256 (23.8 ms vs 24.2 ms). * k_quants: WIP super-blocks with 64 weights Q3_K works on Metal and is slightly faster than QK_K = 256 (26.6 ms vs 28.3 ms). * k_quants: WIP super-blocks with 64 weights Q5_K works on Metal and is slightly faster than QK_K = 256 (23.7 ms vs 26.3 ms). * k_quants: call them _K, not _k, also on Metal * k_quants: correctly define QK_K in llama.cpp * Fixed bug in q4_K quantization added with the 64-block addition * Simplify via lambda * k_quants: swicth Q3_K to 4-bit scales when QK_K = 64 Otherwise there isn't much benefit from this quantization type. There is some very slight loss in accuracy, but we reduce size by ~7%. E.g., for OpenLLaMA-3B, Q3_K_S perplexity is 8.6131 with 8-bit scales and 8.6352 with 4-bit, while file size decreases from 1.53G to 1.44G. * k_quants: switch Q4_K to 4-bit scales when QK_K = 64 Here the loss in accuracy is greater than for Q3_K, but the Q4_K points still move further to the left on the perplexity vs size curve. * k_quants: forgot to add the Metal changes in last commit * k_quants: change Q5_K to be type 0 when QK_K = 64 Still needs AVX2 implementation * k_quants: AVX2 implementation for new 64-weight Q5_K * k_quants: 10% faster ARM_NEON Q5_K dot product * k_quants: fixed issue caused by merging with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-06-26Fix assert when free invalid cuda pointer (#2005)Howard Su
Fix assert via initializing extra structure always. CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument
2023-06-25readme : add new roadmap + manifestoGeorgi Gerganov
2023-06-25ggml : sync latest ggml (custom operators)Georgi Gerganov
2023-06-25fix server sampling: top k sampler first (#1977)anon998
Co-authored-by: anon <anon@example.org>
2023-06-25readme : add Azure CI discussion linkGeorgi Gerganov
2023-06-25zig : upgrade build system support (#1981)sjinzh
* upgrade zig build system support * zig : add new line at the end of the file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24#1869 Fix null reference errors when training from scratch with CUDA (#1907)Robyn
* #1869 Fix null reference errors when training from scratch with CUDA build Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly. * ggml : do not dereference src0 if NULL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-24tests : sync test-grad0 from ggmlGeorgi Gerganov
2023-06-24flake : fix ggml-metal.metal path and run nixfmt (#1974)Rowan Hart
2023-06-24convert : fix invalid params in write_vocab_only (#1975)AN Long
2023-06-24ggml : improve ggml_graph_dump_dot, add ggml_format_name (#1978)slaren
* Improve ggml_graph_dump_dot, add ggml_format_name * add more automatic names to view ops * fix name of copies
2023-06-24readme : fix whitespacesGeorgi Gerganov
2023-06-24readme : fixed termux instructions (#1973)Alberto
2023-06-24llama : fix top-p sampling to match the canonical definition (#1953)Alex Renda
* Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p) * top-p: correct gt to gte * add test for correct top-p behavior
2023-06-24llama : make model stateless and context stateful (llama_state) (#1797)Didzis Gosko
* llama : make model stateless and context stateful * llama : minor cleanup * llama : update internal API declaration * Apply suggestions from code review fix style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Missing model memory release * Fix style * Add deprecated warning for public API function llama_init_from_file * Update public API use cases: move away from deprecated llama_init_from_file * Deprecate public API function llama_apply_lora_from_file --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-23Add OpenLLaMA instructions to the README (#1954)eiery
* add openllama to readme
2023-06-22rework convert.py to read hyper-parameters from config.json (#1958)Erik Scholz
* Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise. This allows converting open_llama 3B and other non-standard model designs.
2023-06-21cmake: revert CUDA arch default to 52, 61 if f16 (#1959)Johannes Gäßler
2023-06-21Fix typo in README.md (#1961)Rahul Vivek Nair
2023-06-20readme : add link to p1Georgi Gerganov
2023-06-20Fix typo (#1949)Xiake Sun
2023-06-20llama : fix params struct slignment (#1936)Ettore Di Giacinto
* Workaround struct misalignment during value-copy Signed-off-by: mudler <mudler@localai.io> * Move booleans at the bottom of the structure Signed-off-by: mudler <mudler@localai.io> * Add comment Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: mudler <mudler@localai.io>
2023-06-20[Fix] Reenable server embedding endpoint (#1937)Henri Vasserman
* Add back embedding feature * Update README
2023-06-19ggml : fix bug in LBFGS optimizer (found by ggml tests)Georgi Gerganov
2023-06-19llama : use aligned memory during ggml_init call from loading saved sessions ↵l3utterfly
(#1934) * fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions * - removed commented out old code from fix - updated another instance of same issue below original