aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-07-05Expose generation timings from server & update completions.js (#2116)Tobias Lütke
* use javascript generators as much cleaner API Also add ways to access completion as promise and EventSource * export llama_timings as struct and expose them in server * update readme, update baked includes * llama : uniform variable names + struct init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05Update Server Instructions (#2113)Jesse Jojo Johnson
* Update server instructions for web front end * Update server README * Remove duplicate OAI instructions * Fix duplicate text --------- Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05ggml : fix bug introduced in #1237Georgi Gerganov
2023-07-05tests : fix test-grad0Georgi Gerganov
2023-07-05ggml : generalize `quantize_fns` for simpler FP16 handling (#1237)Stephan Walter
* Generalize quantize_fns for simpler FP16 handling * Remove call to ggml_cuda_mul_mat_get_wsize * ci : disable FMA for mac os actions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05Update server instructions for web front end (#2103)Jesse Jojo Johnson
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05Quantized dot products for CUDA mul mat vec (#2067)Johannes Gäßler
2023-07-05llama: Don't double count the sampling time (#2107)Howard Su
2023-07-05Fixed OpenCL offloading prints (#2082)Johannes Gäßler
2023-07-05embd-input: Fix input embedding example unsigned int seed (#2105)Nigel Bosch
2023-07-04readme : add link web chat PRGeorgi Gerganov
2023-07-04ggml : sync latest (new ops, macros, refactoring) (#2106)Georgi Gerganov
- add ggml_argmax() - add ggml_tanh() - add ggml_elu() - refactor ggml_conv_1d() and variants - refactor ggml_conv_2d() and variants - add helper macros to reduce code duplication in ggml.c
2023-07-04Add an API example using server.cpp similar to OAI. (#2009)jwj7140
* add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding
2023-07-04Simple webchat for server (#1998)Tobias Lütke
* expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-04Allow old Make to build server. (#2098)Henri Vasserman
Also make server build by default. Tested with Make 3.82
2023-07-04Update Makefile: clean simple (#2097)ZhouYuChen
2023-07-04CI: make the brew update temporarily optional. (#2092)Erik Scholz
until they decide to fix the brew installation in the macos runners. see the open issues. eg https://github.com/actions/runner-images/pull/7710
2023-07-04[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088)Govlzkoy
2023-07-04fix server crashes (#2076)Henri Vasserman
2023-07-03Fix crash of test-tokenizer-0 under Debug build (#2064)Howard Su
* Fix crash of test-tokenizer-0 under Debug build * Change per comment
2023-07-03[llama] No need to check file version when loading vocab score (#2079)Howard Su
2023-07-03server: add option to output probabilities for completion (#1962)WangHaoranRobin
* server: add option to output probabilities for completion * server: fix issue when handling probability output for incomplete tokens for multibyte character generation * server: fix llama_sample_top_k order * examples/common.h: put all bool variables in gpt_params together
2023-07-02ggml : fix build with OpenBLAS (close #2066)Georgi Gerganov
2023-07-01Better CUDA synchronization logic (#2057)Johannes Gäßler
2023-07-01Test-based VRAM scratch size + context adjustment (#2056)Johannes Gäßler
2023-07-01cmake : don't force -mcpu=native on aarch64 (#2063)Daniel Drake
It's currently not possible to cross-compile llama.cpp for aarch64 because CMakeLists.txt forces -mcpu=native for that target. -mcpu=native doesn't make sense if your build host is not the target architecture, and clang rejects it for that reason, aborting the build. This can be easily reproduced using the current Android NDK to build for aarch64 on an x86_64 host. If there is not a specific CPU-tuning target for aarch64 then -mcpu should be omitted completely. I think that makes sense, there is not enough variance in the aarch64 instruction set to warrant a fixed -mcpu optimization at this point. And if someone is building natively and wishes to enable any possible optimizations for the host device, then there is already the LLAMA_NATIVE option available. Fixes #495.
2023-07-01metal : release buffers when freeing metal context (#2062)Aaron Miller
2023-07-01convert : add support of baichuan-7b (#2055)Judd
Co-authored-by: Judd <foldl@boxvest.com>
2023-07-01llama : fix return value of llama_load_session_file_internal (#2022)Georgi Gerganov
2023-07-01llama : catch llama_load_session_file_internal exceptions (#2022)Rand Xie
* convert checks in llama_load_session_file to throw and handle them * make llama_load_session_file_internal static * address feedbacks to avoid using exceptions
2023-07-01embd-input : fix returning ptr to temporaryGeorgi Gerganov
2023-07-01train : fix compile warningGeorgi Gerganov
2023-07-01ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (#1995)Qingyou Meng
Will not be scheduled unless explicitly enabled.
2023-06-29Use unsigned for random seed (#2006)Howard Su
* Use unsigned for random seed. Keep -1 as the value to use a time based seed. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-06-29Porting the improved K-Quant CUDA kernels to OpenCL (#1966)LostRuins
* Added broken new q4k quant * xx + ib0 * Fix q2_k fast kernel * Use preprocessor for QK_K * Add q6_k fast matmul kernel * ported q3k speedup successfully * ported q2k and q5k speedups * remove old dot kernels and template * fixed global const struct types * fixing address spaces * fixed string too long CI issue --------- Co-authored-by: 0cc4m <picard12@live.de>
2023-06-28llama : replacing auto &kv with const auto &kv (#2041)m3ndax
* Replacing auto &kv with const auto &kv * Create codacy.yml * Delete codacy.yml
2023-06-28cuda : remove nchannels_x argument from mul_mat_vec_nc_f16_f32 (#2028)Salvador E. Tropea
- Not used
2023-06-28cuda : fix missing const qualifier in casts (#2027)Salvador E. Tropea
2023-06-28llama : remove shards weight file support (#2000)Howard Su
* Remove multiple shards * Remove multiple file loaders * Remove llama_load_tensor_shard class * Simplify load logic * Remove dead code guess_n_parts function * Remove vocab_only from constructor of llama_model_loader * Remove alignment_prevents_mmap which is not more needed. * Remove useless check
2023-06-28CUDA GPU acceleration for LoRAs + f16 models (#1970)Johannes Gäßler
2023-06-28llama : support input embeddings directly (#1910)ningshanwutuobang
* add interface for float input * fixed inpL shape and type * add examples of input floats * add test example for embd input * fixed sampling * add free for context * fixed add end condition for generating * add examples for llava.py * add READMD for llava.py * add READMD for llava.py * add example of PandaGPT * refactor the interface and fixed the styles * add cmake build for embd-input * add cmake build for embd-input * Add MiniGPT-4 example * change the order of the args of llama_eval_internal * fix ci error
2023-06-27fix pthreads setaffinity usage on android (#2020)Erik Scholz
2023-06-27baby-llama : fix build after ggml_rope change (#2016)Howard Su
2023-06-27llama : fix rope usage after ChatGLM changeGeorgi Gerganov
2023-06-27ggml : add support for ChatGLM RoPEGeorgi Gerganov
2023-06-26readme : add Scala 3 bindings repo (#2010)Roman Parykin
2023-06-26ggml : increase max tensor name + clean up compiler warnings in train-text ↵David Yang
(#1988) * Clean up compiler warnings in train-text Some brackets to disambiguate order of operations * Increase GGML_MAX_NAME Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
2023-06-26readme : LD_LIBRARY_PATH complement for some Android devices when building ↵Gustavo Rocha Dias
with CLBlast inside Termux (#2007) * docs - Alternative way to build at Android, with CLBlast. * doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux. * doc- fix typo
2023-06-26ggml : avoid conv 2d kernel round upGeorgi Gerganov
2023-06-26ggml : add NUMA support (#1556)zrm
* detect NUMA systems and pin work threads to nodes (linux) * disable mmap prefetch/readahead for NUMA systems * avoid sending finalize op to thread pool if it does nothing * silence robot * fix args * make --numa a param * recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement * lower synchronization overhead * statically allocate * move numa state to g_state * add description for --numa * ggml : minor style changes * ggml : minor style + try fix sanitizer build * llama : allow to initialize backend with NUMA support * llama : avoid ggml include in llama-util.h * ggml : style / formatting * ggml : fix handling of ops with n_threads > n_tasks > 1 * server : utilize numa parameter --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>