Age | Commit message (Collapse) | Author |
|
|
|
Will not be scheduled unless explicitly enabled.
|
|
* Use unsigned for random seed. Keep -1 as the value to use a time based seed.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Added broken new q4k quant
* xx + ib0
* Fix q2_k fast kernel
* Use preprocessor for QK_K
* Add q6_k fast matmul kernel
* ported q3k speedup successfully
* ported q2k and q5k speedups
* remove old dot kernels and template
* fixed global const struct types
* fixing address spaces
* fixed string too long CI issue
---------
Co-authored-by: 0cc4m <picard12@live.de>
|
|
* Replacing auto &kv with const auto &kv
* Create codacy.yml
* Delete codacy.yml
|
|
- Not used
|
|
|
|
* Remove multiple shards
* Remove multiple file loaders
* Remove llama_load_tensor_shard class
* Simplify load logic
* Remove dead code guess_n_parts function
* Remove vocab_only from constructor of llama_model_loader
* Remove alignment_prevents_mmap which is not more needed.
* Remove useless check
|
|
|
|
* add interface for float input
* fixed inpL shape and type
* add examples of input floats
* add test example for embd input
* fixed sampling
* add free for context
* fixed add end condition for generating
* add examples for llava.py
* add READMD for llava.py
* add READMD for llava.py
* add example of PandaGPT
* refactor the interface and fixed the styles
* add cmake build for embd-input
* add cmake build for embd-input
* Add MiniGPT-4 example
* change the order of the args of llama_eval_internal
* fix ci error
|
|
|
|
|
|
|
|
|
|
|
|
(#1988)
* Clean up compiler warnings in train-text
Some brackets to disambiguate order of operations
* Increase GGML_MAX_NAME
Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
|
|
with CLBlast inside Termux (#2007)
* docs - Alternative way to build at Android, with CLBlast.
* doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux.
* doc- fix typo
|
|
|
|
* detect NUMA systems and pin work threads to nodes (linux)
* disable mmap prefetch/readahead for NUMA systems
* avoid sending finalize op to thread pool if it does nothing
* silence robot
* fix args
* make --numa a param
* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement
* lower synchronization overhead
* statically allocate
* move numa state to g_state
* add description for --numa
* ggml : minor style changes
* ggml : minor style + try fix sanitizer build
* llama : allow to initialize backend with NUMA support
* llama : avoid ggml include in llama-util.h
* ggml : style / formatting
* ggml : fix handling of ops with n_threads > n_tasks > 1
* server : utilize numa parameter
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* fix test quantize perf
* avoid the global state
|
|
* k_quants : add AVX support
* k_quants : apply review comments
|
|
|
|
* k_quants: WIP super-blocks with 64 weights
* k_quants: WIP super-blocks with 64 weights
Q6_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q4_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)
* k_quants: WIP super-blocks with 64 weights
Q3_K scalar and AVX2 works.
* k_quants: WIP super-blocks with 64 weights
Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar
* k_quants: WIP super-blocks with 64 weights
Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,
* k_quants: WIP super-blocks with 64 weights
Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q3_K working on CUDA.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on CUDA, and with this CUDA is done.
* k_quants: WIP super-blocks with 64 weights
Q6_K working on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Q4_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q2_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.
With that, we have full support for ARM_NEON, although
performance is not quite there.
* k_quants: WIP super-blocks with 64 weights
Slightly more efficient Q3_K and Q5_K
* k_quants: WIP super-blocks with 64 weights
Another small improvement for Q3_K and Q5_K on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.
* k_quants: WIP super-blocks with 64 weights
* We are able to pass preprocessor macros to the Metal
compiler
* Q6_K works and is actually slightly more efficient than
the QK_K = 256 version (25.2 ms vs 25.8 ms)
* k_quants: WIP super-blocks with 64 weights
Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).
* k_quants: WIP super-blocks with 64 weights
Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).
* k_quants: WIP super-blocks with 64 weights
Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).
* k_quants: WIP super-blocks with 64 weights
Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).
* k_quants: call them _K, not _k, also on Metal
* k_quants: correctly define QK_K in llama.cpp
* Fixed bug in q4_K quantization added with the 64-block addition
* Simplify via lambda
* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64
Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.
* k_quants: switch Q4_K to 4-bit scales when QK_K = 64
Here the loss in accuracy is greater than for Q3_K,
but the Q4_K points still move further to the left on
the perplexity vs size curve.
* k_quants: forgot to add the Metal changes in last commit
* k_quants: change Q5_K to be type 0 when QK_K = 64
Still needs AVX2 implementation
* k_quants: AVX2 implementation for new 64-weight Q5_K
* k_quants: 10% faster ARM_NEON Q5_K dot product
* k_quants: fixed issue caused by merging with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Fix assert via initializing extra structure always.
CUDA error 1 at C:\GPT\llama.cpp\ggml-cuda.cu:2536: invalid argument
|
|
|
|
|
|
Co-authored-by: anon <anon@example.org>
|
|
|
|
* upgrade zig build system support
* zig : add new line at the end of the file
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* #1869 Fix null reference errors when training from scratch with CUDA build
Calling ggml_compute_forward when node->src0 was null was causing train-text-from-scratch.exe to terminate unexpectedly.
* ggml : do not dereference src0 if NULL
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
|
|
* Improve ggml_graph_dump_dot, add ggml_format_name
* add more automatic names to view ops
* fix name of copies
|
|
|
|
|
|
* Fix top-p sampling to match the standard definition (smallest set that has probability mass at least p, not largest set with probability mass less than p)
* top-p: correct gt to gte
* add test for correct top-p behavior
|
|
* llama : make model stateless and context stateful
* llama : minor cleanup
* llama : update internal API declaration
* Apply suggestions from code review
fix style
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Missing model memory release
* Fix style
* Add deprecated warning for public API function llama_init_from_file
* Update public API use cases: move away from deprecated llama_init_from_file
* Deprecate public API function llama_apply_lora_from_file
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* add openllama to readme
|
|
* Read hyper-parameters from HuggingFace-transformer config.json, if they exist, and fall back to guessing, like before otherwise.
This allows converting open_llama 3B and other non-standard model designs.
|
|
|
|
|
|
|
|
|
|
* Workaround struct misalignment during value-copy
Signed-off-by: mudler <mudler@localai.io>
* Move booleans at the bottom of the structure
Signed-off-by: mudler <mudler@localai.io>
* Add comment
Signed-off-by: mudler <mudler@localai.io>
---------
Signed-off-by: mudler <mudler@localai.io>
|
|
* Add back embedding feature
* Update README
|
|
|
|
(#1934)
* fixed issue: memory is not guaranteed to be aligned properly during ggml_init call from loading saved sessions
* - removed commented out old code from fix
- updated another instance of same issue below original
|
|
|