llama.cpp.git - llama.cpp

Age	Commit message (Collapse)	Author
2023-07-11	Merge branch 'ggerganov:master' into master	SIGSEGV

2023-07-11	docker : add '--server' option (#2174)	Jinwoo Jeong

2023-07-11	Merge branch 'ggerganov:master' into master	SIGSEGV

2023-07-11	readme : fix zig build instructions (#2171)	Chad Brewbaker

2023-07-11	Support using mmap when applying LoRA (#2095)	Howard Su
	* Support using mmap when applying LoRA * Fix Linux * Update comment to reflect the support lora with mmap
2023-07-11	Possible solution to allow K-quants on models with n_vocab!=32000 (#2148)	LostRuins
	* This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions. * Fix indentation Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * As an alternative, to avoid failing on Metal due to lack of Q8_0 support, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-11	Merge branch 'ggerganov:master' into master	SIGSEGV

2023-07-10	mpi : add support for distributed inference via MPI (#2099)	Evan Miller
	* MPI support, first cut * fix warnings, update README * fixes * wrap includes * PR comments * Update CMakeLists.txt * Add GH workflow, fix test * Add info to README * mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099) * mpi : add names for layer inputs + prep ggml_mpi_graph_compute() * mpi : move all MPI logic into ggml-mpi Not tested yet * mpi : various fixes - communication now works but results are wrong * mpi : fix output tensor after MPI compute (still not working) * mpi : fix inference * mpi : minor * Add OpenMPI to GH action * [mpi] continue-on-error: true * mpi : fix after master merge * [mpi] Link MPI C++ libraries to fix OpenMPI * tests : fix new llama_backend API * [mpi] use MPI_INT32_T * mpi : factor out recv / send in functions and reuse * mpi : extend API to allow usage with outer backends (e.g. Metal) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-10	update flake.lock	aditya

2023-07-10	add pip	aditya

2023-07-09	llama : remove "first token must be BOS" restriction (#2153)	oobabooga

2023-07-09	main : escape prompt prefix/suffix (#2151)	Nigel Bosch

2023-07-09	readme : update Termux instructions (#2147)	JackJollimore
	The file pathing is significant when running models inside of Termux on Android devices. llama.cpp performance is improved with loading a .bin from the $HOME directory.
2023-07-09	ggml : fix buidling with Intel MKL but ask for "cblas.h" issue (#2104) (#2115)	clyang
	* Fix buidling with Intel MKL but ask for "cblas.h" issue * Use angle brackets to indicate the system library
2023-07-09	readme : add more docs indexes (#2127)	rankaiyx
	* Update README.md to add more docs indexes * Update README.md to add more docs indexes
2023-07-08	Fixed OpenLLaMA 3b CUDA mul_mat_vec_q (#2144)	Johannes Gäßler

2023-07-08	CUDA: add __restrict__ to mul mat vec kernels (#2140)	Johannes Gäßler

2023-07-07	docker : add support for CUDA in docker (#1461)	dylan
	Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07	ci : switch threads to 1 (#2138)	Georgi Gerganov

2023-07-07	ggml : change ggml_graph_compute() API to not require context (#1999)	Qingyou Meng
	* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287 * rewrite: no longer consider backward compitability; plan and make_plan * minor: rename ctx as plan; const * remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward * add static ggml_graph_compute_sugar() * minor: update comments * reusable buffers * ggml : more consistent naming + metal fixes * ggml : fix docs * tests : disable grad / opt + minor naming changes * ggml : add ggml_graph_compute_with_ctx() - backwards compatible API - deduplicates a lot of copy-paste * ci : enable test-grad0 * examples : factor out plan allocation into a helper function * llama : factor out plan stuff into a helper function * ci : fix env * llama : fix duplicate symbols + refactor example benchmark * ggml : remove obsolete assert + refactor n_tasks section * ggml : fix indentation in switch * llama : avoid unnecessary bool * ggml : remove comments from source file and match order in header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-07	ggml : remove sched_yield() call in ggml_graph_compute_thread() (#2134)	Georgi Gerganov

2023-07-07	convert.py: add mapping for safetensors bf16 (#1598)	Aarni Koskela
	Fixes #1473
2023-07-07	Fix opencl by wrap #if-else-endif with \n (#2086)	Howard Su

2023-07-06	ggml : fix restrict usage	Georgi Gerganov

2023-07-06	convert : update for baichuan (#2081)	Judd
	1. guess n_layers; 2. relax warnings on context size; 3. add a note that its derivations are also supported. Co-authored-by: Judd <foldl@boxvest.com>
2023-07-06	alpaca.sh : update model file name (#2074)	tslmy
	The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `ggmlv3.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
2023-07-05	Expose generation timings from server & update completions.js (#2116)	Tobias Lütke
	* use javascript generators as much cleaner API Also add ways to access completion as promise and EventSource * export llama_timings as struct and expose them in server * update readme, update baked includes * llama : uniform variable names + struct init --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05	Update Server Instructions (#2113)	Jesse Jojo Johnson
	* Update server instructions for web front end * Update server README * Remove duplicate OAI instructions * Fix duplicate text --------- Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05	ggml : fix bug introduced in #1237	Georgi Gerganov

2023-07-05	tests : fix test-grad0	Georgi Gerganov

2023-07-05	ggml : generalize `quantize_fns` for simpler FP16 handling (#1237)	Stephan Walter
	* Generalize quantize_fns for simpler FP16 handling * Remove call to ggml_cuda_mul_mat_get_wsize * ci : disable FMA for mac os actions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-05	Update server instructions for web front end (#2103)	Jesse Jojo Johnson
	Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
2023-07-05	Quantized dot products for CUDA mul mat vec (#2067)	Johannes Gäßler

2023-07-05	llama: Don't double count the sampling time (#2107)	Howard Su

2023-07-05	Fixed OpenCL offloading prints (#2082)	Johannes Gäßler

2023-07-05	embd-input: Fix input embedding example unsigned int seed (#2105)	Nigel Bosch

2023-07-04	readme : add link web chat PR	Georgi Gerganov

2023-07-04	ggml : sync latest (new ops, macros, refactoring) (#2106)	Georgi Gerganov
	- add ggml_argmax() - add ggml_tanh() - add ggml_elu() - refactor ggml_conv_1d() and variants - refactor ggml_conv_2d() and variants - add helper macros to reduce code duplication in ggml.c
2023-07-04	Add an API example using server.cpp similar to OAI. (#2009)	jwj7140
	* add api_like_OAI.py * add evaluated token count to server * add /v1/ endpoints binding
2023-07-04	Simple webchat for server (#1998)	Tobias Lütke
	* expose simple web interface on root domain * embed index and add --path for choosing static dir * allow server to multithread because web browsers send a lot of garbage requests we want the server to multithread when serving 404s for favicon's etc. To avoid blowing up llama we just take a mutex when it's invoked. * let's try this with the xxd tool instead and see if msvc is happier with that * enable server in Makefiles * add /completion.js file to make it easy to use the server from js * slightly nicer css * rework state management into session, expose historyTemplate to settings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-07-04	Allow old Make to build server. (#2098)	Henri Vasserman
	Also make server build by default. Tested with Make 3.82
2023-07-04	Update Makefile: clean simple (#2097)	ZhouYuChen

2023-07-04	CI: make the brew update temporarily optional. (#2092)	Erik Scholz
	until they decide to fix the brew installation in the macos runners. see the open issues. eg https://github.com/actions/runner-images/pull/7710
2023-07-04	[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088)	Govlzkoy

2023-07-04	fix server crashes (#2076)	Henri Vasserman

2023-07-03	Fix crash of test-tokenizer-0 under Debug build (#2064)	Howard Su
	* Fix crash of test-tokenizer-0 under Debug build * Change per comment
2023-07-03	[llama] No need to check file version when loading vocab score (#2079)	Howard Su

2023-07-03	server: add option to output probabilities for completion (#1962)	WangHaoranRobin
	* server: add option to output probabilities for completion * server: fix issue when handling probability output for incomplete tokens for multibyte character generation * server: fix llama_sample_top_k order * examples/common.h: put all bool variables in gpt_params together
2023-07-02	ggml : fix build with OpenBLAS (close #2066)	Georgi Gerganov

2023-07-01	Better CUDA synchronization logic (#2057)	Johannes Gäßler