Age | Commit message (Collapse) | Author |
|
|
|
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
Fixes https://github.com/ggerganov/llama.cpp/issues/2166 by moving commands after the CFLAGS are changed.
|
|
|
|
* 3-5% faster Q4_0 on Metal
* 7-25% faster Q4_1 on Metal
* Oops, forgot to delete the original Q4_1 kernel
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Has perf regression when mlock is used.
This reverts commit 2347463201a9f4159ae95b737e1544dd300569c8.
|
|
|
|
This prevents accidentally expanding arguments that contain spaces.
|
|
Prefetch data to improve GPU utilization. ~48% faster for 33B model.
|
|
* ggml : broadcast mul_mat + conv batch support
* ggml : apply mul_mat broadcast fix by @jploski
|
|
|
|
|
|
* FP16 is supported in CM=6.0
* Building PTX code for both of 60 and 61
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
|
|
|
|
|
|
|
* Add ggml changes
* Update train-text-from-scratch for change
* mpi : adapt to new ggml_tensor->src
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Initial implementation
* Remove debug print
* Restore signature of llama_init_from_gpt_params
* Free guidance context
* Make freeing of guidance_ctx conditional
* Make Classifier-Free Guidance a sampling function
* Correct typo. CFG already means context-free grammar.
* Record sampling time in llama_sample_classifier_free_guidance
* Shift all values by the max value before applying logsoftmax
* Fix styling based on review
|
|
|
|
|
|
* Support using mmap when applying LoRA
* Fix Linux
* Update comment to reflect the support lora with mmap
|
|
* This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.
* Fix indentation
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* As an alternative, to avoid failing on Metal due to lack of Q8_0 support, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* MPI support, first cut
* fix warnings, update README
* fixes
* wrap includes
* PR comments
* Update CMakeLists.txt
* Add GH workflow, fix test
* Add info to README
* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)
* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()
* mpi : move all MPI logic into ggml-mpi
Not tested yet
* mpi : various fixes - communication now works but results are wrong
* mpi : fix output tensor after MPI compute (still not working)
* mpi : fix inference
* mpi : minor
* Add OpenMPI to GH action
* [mpi] continue-on-error: true
* mpi : fix after master merge
* [mpi] Link MPI C++ libraries to fix OpenMPI
* tests : fix new llama_backend API
* [mpi] use MPI_INT32_T
* mpi : factor out recv / send in functions and reuse
* mpi : extend API to allow usage with outer backends (e.g. Metal)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
The file pathing is significant when running models inside of Termux on Android devices. llama.cpp performance is improved with loading a .bin from the $HOME directory.
|
|
* Fix buidling with Intel MKL but ask for "cblas.h" issue
* Use angle brackets to indicate the system library
|
|
* Update README.md to add more docs indexes
* Update README.md to add more docs indexes
|
|
|
|
|
|
Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287
* rewrite: no longer consider backward compitability; plan and make_plan
* minor: rename ctx as plan; const
* remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward
* add static ggml_graph_compute_sugar()
* minor: update comments
* reusable buffers
* ggml : more consistent naming + metal fixes
* ggml : fix docs
* tests : disable grad / opt + minor naming changes
* ggml : add ggml_graph_compute_with_ctx()
- backwards compatible API
- deduplicates a lot of copy-paste
* ci : enable test-grad0
* examples : factor out plan allocation into a helper function
* llama : factor out plan stuff into a helper function
* ci : fix env
* llama : fix duplicate symbols + refactor example benchmark
* ggml : remove obsolete assert + refactor n_tasks section
* ggml : fix indentation in switch
* llama : avoid unnecessary bool
* ggml : remove comments from source file and match order in header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
Fixes #1473
|
|
|
|
|
|
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.
Co-authored-by: Judd <foldl@boxvest.com>
|
|
The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
|
|
* use javascript generators as much cleaner API
Also add ways to access completion as promise and EventSource
* export llama_timings as struct and expose them in server
* update readme, update baked includes
* llama : uniform variable names + struct init
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text
---------
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
|
|
|
|
|
|
* Generalize quantize_fns for simpler FP16 handling
* Remove call to ggml_cuda_mul_mat_get_wsize
* ci : disable FMA for mac os actions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
|
|
|
|
|
|
|
|
|