Age | Commit message (Collapse) | Author |
|
1. guess n_layers;
2. relax warnings on context size;
3. add a note that its derivations are also supported.
Co-authored-by: Judd <foldl@boxvest.com>
|
|
* Use unsigned for random seed. Keep -1 as the value to use a time based seed.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* detect NUMA systems and pin work threads to nodes (linux)
* disable mmap prefetch/readahead for NUMA systems
* avoid sending finalize op to thread pool if it does nothing
* silence robot
* fix args
* make --numa a param
* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement
* lower synchronization overhead
* statically allocate
* move numa state to g_state
* add description for --numa
* ggml : minor style changes
* ggml : minor style + try fix sanitizer build
* llama : allow to initialize backend with NUMA support
* llama : avoid ggml include in llama-util.h
* ggml : style / formatting
* ggml : fix handling of ops with n_threads > n_tasks > 1
* server : utilize numa parameter
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* llama : make model stateless and context stateful
* llama : minor cleanup
* llama : update internal API declaration
* Apply suggestions from code review
fix style
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Missing model memory release
* Fix style
* Add deprecated warning for public API function llama_init_from_file
* Update public API use cases: move away from deprecated llama_init_from_file
* Deprecate public API function llama_apply_lora_from_file
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
MinGW libstdc++ may define `NOMINMAX` unconditionally. This fixes the case when it is already defined.
|
|
|
|
* Fixed CUDA RoPE
* ggml_cuda_mul_mat_vec_p021
* ggml_cuda_scale
* ggml_cuda_diag_mask_inf
* ggml_is_permuted
* ggml_cuda_cpy
* flatten rows for ggml_cuda_op
* Added a --low-vram option
* Fixed Windows performance
* Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
|
|
|
|
* Fix issue where interactive mode in the main example crashes when input exceeds ctx size
* Ensure the context size is at least 8 tokens in the main example.
Closes #1768
|
|
The prompt cache constitutes a nice speed up when using the same prompt
prefix across multiple evaluations, but when using it, it will also be
updated, which is not always desirable. One use case is to have a large
prompt containing some context and usage rules, and a second part
containing variable data of the problem being studied. In this case it's
desirable to be able to save the first part once, and to always reuse it
as-is without updating it with the second part.
The new argument --prompt-cache-ro enables this read-only mode on the
prompt cache. The prompt's contents that match the cache are loaded
from the cache but the rest is not modified. This allowed to reduce a
total analysis time from 112s to 49.7s here, without having to backup
and restore a copy of the prompt, which takes significant time at 500
MB.
Signed-off-by: Willy Tarreau <w@1wt.eu>
|
|
* CUDA multi GPU + scratch
ggml_cuda_compute_forward
Tensor parallelism
ggml_cuda_add
ggml_cuda_rms_norm
ggml_cuda_silu
CUDA scratch buffer
--main-gpu CLI option
|
|
* mtl : export the LLaMA computation graph
* ci : disable temporary
* mtl : adapt the MNIST example as starter
* mtl : no need for mtl-export tool, add cli arg for main instead
* mtl : export just a small part of the graph for now to make it easier
* mtl : move MSL code into separate file for easy editing
* mtl : initial get_rows_q4_0 kernel
* mtl : confirmed get_rows_q4_0 is working correctly
* mtl : add rms_norm kernel + confirm working
* mtl : add mul kernel + confirm working
* mtl : initial mul_mat Q4 kernel (wrong results)
* mtl : mul_mat fixes (still wrong)
* mtl : another mul_mat Q4 (still does not work)
* mtl : working mul_mat q4
* ggml : fix handling of "view" ops in ggml_graph_import()
* mtl : add rope kernel
* mtl : add reshape and transpose handling
* ggml : store offset as opt arg for ggml_view_xd() operators
* mtl : add cpy kernel + handle view ops
* mtl : confirm f16 x f32 attention mul mat
* mtl : add scale kernel
* mtl : add diag_mask_inf kernel
* mtl : fix soft_max kernel
* ggml : update ggml_nbytes() to handle non-contiguous tensors
* mtl : verify V tensor contents
* mtl : add f32 -> f32 cpy kernel
* mtl : add silu kernel
* mtl : add non-broadcast mul kernel
* mtl : full GPU inference of the computation graph
* mtl : optimize rms_norm and soft_max kernels
* mtl : add f16 mat x f32 vec multiplication kernel
* mtl : fix bug in f16 x f32 mul mat + speed-up computation
* mtl : faster mul_mat_q4_0_f32 kernel
* mtl : fix kernel signature + roll inner loop
* mtl : more threads for rms_norm + better timing
* mtl : remove printfs from inner loop
* mtl : simplify implementation
* mtl : add save/load vocab to ggml file
* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)
* mtl : make it work with main example
Lots of hacks but at least now it generates text
* mtl : preparing for merge
* mtl : clean-up ggml mtl interface + suport scratch / inplace
* mtl : remove temp / debug code
* metal : final refactoring and simplification
* Revert "ci : disable temporary"
This reverts commit 98c267fc77fe811082f672538fc91bcfc9072d63.
* metal : add comments
* metal : clean-up stuff, fix typos
* readme : add Metal instructions
* readme : add example for main
|
|
* Fix prompt cache saving and chat-persistent rollover (fixes #1670)
* clang-tidy
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
|
* Work around for recalculating logits in cached prompts
|
|
1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS)
2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible.
3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation
4. Update `main` and `server` examples documentation to use the new style dash separator argument format
5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility.
6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356
|
|
Improvements to loading the session with `--prompt-cache` in the `main` example.
1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt.
2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.
|
|
|
|
mode (#1032)
* Make reverse prompt option act as a stop token in non-interactive scenarios
* Making requested review changes
* Update gpt_params_parse and fix a merge error
* Revert "Update gpt_params_parse and fix a merge error"
This reverts commit 2bb2ff1748513591ad45b175a75ed1d8089d84c8.
* Update gpt_params_parse and fix a merge error take 2
|
|
* Fix for w64devkit and mingw
|
|
|
|
|
|
* main : add option to save full output to session
* split behavior into --session and --prompt-cache
* restore original implementation with new names
* PR comments
* move the check for incompatible parameters to gpt_params_parse
* Fix whitespace
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
---------
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
|
|
(#1040)
* Interface improvements
* Multiline input
* Track character width
* Works with all characters and control codes + Windows console fixes
|
|
* llama : require first token to be BOS
* scripts : add ppl-run-all.sh
* perplexity : add BOS for each chunk
* readme : update perplexity values after BOS fix
* perplexity : add clarifying comments
|
|
|
|
* adding --in-suffix option
* print input suffix before generation
|
|
|
|
|
|
* fix reverse prompt and multi line
* Code Formatting
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
Signed-off-by: deadprogram <ron@hybridgroup.com>
|
|
Signed-off-by: deadprogram <ron@hybridgroup.com>
|
|
|
|
Signed-off-by: deadprogram <ron@hybridgroup.com>
|
|
* Add git-based build information for better issue tracking
* macOS fix
* "build (hash)" and "CMAKE_SOURCE_DIR" changes
* Redo "CMAKE_CURRENT_SOURCE_DIR" and clearer build messages
* Fix conditional dependency on missing target
* Broke out build-info.cmake, added find_package fallback, and added build into to all examples, added dependencies to Makefile
* 4 space indenting for cmake, attempt to clean up my mess in Makefile
* Short hash, less fancy Makefile, and don't modify build-info.h if it wouldn't change it
|
|
|
|
|
|
* Sample interface, new samplers.
New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat
Ignore EOS fix: -inf should be used.
* mirostat
* Added --logit-bias and --no-penalize-nl, removed std::span
* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)
* Save and load example adjust
* Tests
* Windows build fix
* Windows test fix
|
|
|
|
|
|
|
|
|
|
|
|
* print timings on ctrl+c exit
* remove redundant free memory call.
* add global pointer to ctx.
|
|
* examples : evaluate tokens in batches after swapping context
* Update examples/main/main.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
This reverts commit f4d277ae17247ee51129ef1a9ff74d377cc90b1b.
|
|
* Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt
* Add multiline mode, update text input.
* bugfix
* update implementation
* typos
* Change --multiline implementation to be toggled by EOF.
* bugfix
* default multiline mode
* add more configs
* update formating
* update formatting
* apply suggestions
|