llama.cpp.git - llama.cpp

Age	Commit message (Collapse)	Author
2023-05-13	ggml : multi-thread mul and diag_mask ops (#1428)	Georgi Gerganov

2023-05-13	ggml : GPU-accelerated token generation (#1412)	Johannes Gäßler
	* CUDA kernel for q4_0 dequant. + mat. vec. mult. * Added q4_1 via template * Added missing __syncthreads(); * --gpu_layers -> --gpu-layers * Shorter dequantize_mul_mat_vec line * q5_0 dequantize_mul_mat kernel * More readable dequantize_mul_mat_vec logic * dequantize_mul_mat_vec kernels for q5_1, q8_0, f16 * llama : offload "output" tensor to GPU too + coding style fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13	ggml : implement backward pass for llama + small training-llama-from-scratch ↵	xaedes
	example (#1360) * implement 8 of 14 missing backward pass operations used by llama - GGML_OP_ADD_AT - GGML_OP_CPY - GGML_OP_MUL_MAT (src0.grad) - GGML_OP_PERMUTE - GGML_OP_RESHAPE - GGML_OP_SCALE - GGML_OP_TRANSPOSE - GGML_OP_VIEW implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW. this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset). the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0. still missing backward passes for llama: - GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_ROPE - GGML_OP_SILU - GGML_OP_SOFT_MAX * implement 5 of 6 missing backward pass operations used by llama - GGML_OP_DIAG_MASK_INF - GGML_OP_GET_ROWS - GGML_OP_RMS_NORM - GGML_OP_SILU - GGML_OP_SOFT_MAX add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1. GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know... GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF. Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants. staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and functions with "_inplace" are added which are inplace. in llama we need to call the inplace variants so that it is implemented as before. for llama backward pass we need to use the non-inplace variants. still not completely implemented backward passes for llama: - GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK - GGML_OP_GET_ROWS: only necessary for tokenizer * norm & rms_norm can not be threaded: after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees. * remove already resolved TODO * implement backward pass of ggml_rope and ggml_rope_back * implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back * add test-grad0.c * use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console * test both gradients of mul_mat * disable graph dot export as it floods console * bug fixes for silu_back * successfully test silu backward * bug fix for scale backward pass use sum instead of mean for gradient of scalar scale parameter * successfully test scale backward * improve performance of sum backward pass use add1(x,y) instead of add(x,repeat(y,x)) * improve performance of sqr backward pass use scale(x,y) instead of mul(x,repeat(y,x)) * successfully test rope backward * bug fix for cpy backward pass * successfully test cpy backward * bug fix for reshape backward pass * successfully test reshape backward * add test-opt.c this uses ggml_opt to train a,b for minimal e=sum(sqr(c - ab)) for random initial a,b,c correctly implement softmax backward pass using new operation ggml_diag ggml_diag constructs diagonal matrices with entries. ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d] * successfully test soft_max backward * align shape annotations * add shape annotations for llama * de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type. with this we can duplicate tensor of any typ as long as they are contiguous. * fix ggml_compute_forward_dup_same_cont for when nelements < nthreads when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy * bug fix for add_at forward required for view backward pass src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function. * successfully test view backward * minor code format improvement * fix ggml_forward_add functions to work correctly with transposed tensors uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions. this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32. * fix ggml_forward_add1 functions to work correctly with transposed tensors uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions. this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32. * test-grad0.c : add print_elements to help with debugging * successfully test permute backward * some minor test-grad0 fixes * fix sub, mul and div functions to work correctly with transposed tensors uses the same logic as in add * implement ggml_cont backward pass * successfully test transpose backward and permute for all permutations also test sub, mul and div up to max n_dims * test-grad0.c add TODO for view_2d and view_3d add_at (required for view backward pass) is a bit tricky for n_dims > 1. * fix comments * successfully test diag_mask_inf and diag_mask_zero backward * test-grad0 : fix test for div nargs and ndims was swapped, corrupting the stack * fix diag_mask to work with non-inplace input * move dup call into the actual add_at functions * fix get rows backward pass * successfully test get_rows backward * fix view backward pass add nb parameters to add_at like in view. together with offset they define how to view dst and src0 during the add_at operation. * successfully test backward pass of view_1d, view_2d and view_3d * fix backward pass for rms_norm I would have used formulas from other frameworks, but they differed so I could not decide which is correct. Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification. * successfully test backward pass of rms_norm some tests may fail when gradients are large. could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds. when looking at the values the "failed" tests look actually ok. for example: rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324 it is due to the test logic in check_gradients that they fail. * add todos for llama backward pass - implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required) - repeat is not yet tested and looks like it only works for single element src0 inputs. * add operation ggml_sum_rows ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d] * add missing GGML_OP_SUM_ROWS * fix backward pass for repeat requires ggml_sum_rows * successfully test backward pass of repeat * update quantization types in switch-case of add_at and add1 * add baby-llama example training a very small llama model from scratch to output a sinusoidal wave. had to increase maximum number of optimization parameters to train from scratch. * fix softmax in baby-llama example * switching from training with adam to lbfgs produces much better results in the baby-llama example * train with two examples, creating new tensors each time.. * fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed. so we need to keep the original gradients and make dups for opt * train on multiple examples, generate & print tokens with trained model afterwards ctx0 for evaluation and optimization is renewed for each sample * add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d * fix soft_max backward pass for input->ne[1] != 1 * add ggml_log operation necessary for cross entropy loss * add test for ggml_log gradients * implement backward pass for ggml_sum_rows, necessary for cross entropy loss * implement ggml_repeat support for rank > 2 tensors * add test for ggml_sum_rows gradients * fix training get_example_targets predict the next token, not the current token! * add square_error_loss and cross_entropy_loss functions * optimize loss over multiple samples this increases computation graph, need parallel batched forward for more efficiency. * fix backward pass for add_at and change arguments to have same order as in view * add ggml_set(ctx, a, b) to set b in view of a and return modified a necessary to set values into kv_self cache and properly propagate the gradients * fix kv_self gradients for training use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients * replace inplace operations for training with copying operations to allow gradient propagation * add GGML_ASSERT to catch ggml_rope and back value errors * add trainable lora-only model with all big matrices C split into A,B with AB=C this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices. training this instead of the normal model resulted in much worse results though... vastly improve training results instead of logit targets 0 and 1 use -1 and +1. * shorten code using a variable * change name of GGML_OP_ADD_AT to GGML_OP_ACC * smaller default values for baby llama model parameters * update static assert of GGML_OP_COUNT * remove shape annotations in llama_eval_internal * revert disabling of threading for rms_norm and norm * rename print functions in baby-llama example * fix call to ggml_set_name * add missing include for strcmp, etc * remove trailing whitespace * reduce number of test-grad0 iterations avoid exceeding timeout of automated tests * remove busy loop that was used as sleep for slower sinus wave generation * disable slow tests grad0 and opt to avoid exceeding timeouts * c++ in baby-llama example use c++ includes instead of c includes use std::min, std::max instead of MIN, MAX macros * c++ in baby-llama example use c++ includes instead of c includes use std::min, std::max instead of MIN, MAX macros * ggml : fix compiler warnings + cosmetic changes * ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back * swap arguments to vDSP_vdiv call documentation for vDSP_vdiv states: "Note that B comes before A!" * swap arguments to vDSP_vdiv call documentation for vDSP_vdiv states: "Note that B comes before A!" * ggml : swap vDSP_vsub args as per documentation * add parallel batched forward function for baby-llama training * cleanup code for batched training * remove trailing whitespace * minor : fix compiler warnings + indentation style * ggml : fix null ptr deref in backward pass * ggml : remove Q4_2 remnants * ggml : fix clang-tidy warnings * baby-llama : couple of clang-tidy warnings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13	ggml : sync alibi fix from ggml repo	Georgi Gerganov

2023-05-13	Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 (#1413)	3ooabkhxtn

2023-05-12	ggml : remove bit shuffling (#1405)	Georgi Gerganov
	* ggml : remove Q4_0 bit shufling (ARM NEON) * ggml : remove Q4_1 bit shuffling (ARM NEON + reference) * ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON) * ggml : remove Q4_2 bit shuffling (WIP, BROKEN) * ggml : remove Q5_0 bit shuffling (ARM NEON) * ggml : 2x faster scalar implementations * ggml : remove Q5_1 bit shuffling (ARM NEON + scalar) * ggml : simplify scalar dot * ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit * ggml : fix Q4_1 quantization * ggml : update cuBLAS + normalize variable names * ggml : remove Q4_2 mode * ggml : minor formatting * ggml : fix Q5_0 quantization * scripts : add script for measuring the time per token * AVX implementations (#1370) * ggml : uniform 5th bit extraction * llama : produce error upon loading old model files * llama : fix model magic/version write * ggml : speed-up Q5_0 + Q5_1 at 4 threads * ggml : preserve old Q4 and Q5 formats * ggml : simplify Q8_1 - no need for low / high sums anymore * ggml : fix Q8_0 and Q8_1 rounding * Revert "AVX implementations (#1370)" This reverts commit 948d124837f9d287d8490f41338e0e4cceb0814f. * ggml : fix AVX2 implementation * sha : update hashes for 7B and 13B * readme : update timings + remove warning banner * llama : update v2 PR number to 1405 * ggml : fix WASM comments * ggml : back to original bit order * readme : add note that Q4 and Q5 have been changed * llama : fix return for unknown version --------- Co-authored-by: Stephan Walter <stephan@walter.name>
2023-05-09	use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler (#1314)	Sami Farin
	* use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler Tested with a 13B model. * use _mm_pause() in busyloop * use _mm_pause() in busyloop on x86_64 to reduce power consumption
2023-05-06	ggml : Allow usage of CLBlast alongside Accelerate.framework (#1336)	swittk
	Minor edit in ggml.c which originally would prevent OpenCL from loading completely if GGML_USE_ACCELERATE was defined. Minor speedup in prompt eval time.
2023-05-04	ggml : change immintrin.h to intrin.h for compatibility (#1307)	Ron Jailall
	* change immintrin.h to intrin.h for compatibility Building on windows11 arm throws an error on this line. Seems like using intrin.h covers x86 and and arm * conditional def of intrin.h * fix typo in ggml.c
2023-05-03	ggml : vectorize Q8_0 quantization	Georgi Gerganov
	https://github.com/ggerganov/ggml/pull/127#issuecomment-1533648531
2023-05-02	ggml : fix 32-bit ARM	Georgi Gerganov

2023-05-02	ggml : fix ppc64le build error and make cmake detect Power processors (#1284)	Marvin Gießing
	* Fix ppc64le build issue * Added support to detect ppc64* processors
2023-05-02	ggml: add names to tensors (#1268)	slaren
	* ggml: add names to tensors * minor improvements to dot file formatting
2023-05-01	cuBLAS: refactor and optimize f16 mat mul performance (#1259)	slaren
	* cuBLAS: refactor, convert fp16 to fp32 on device * cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16 * fix build * cuBLAS: update block_q5_1
2023-05-01	ggml : fix ggml_used_mem() (#1264)	Kerfuffle

2023-04-30	ggml : fix UB (int << 31)	Georgi Gerganov

2023-04-30	ggml : add Q5 WASM SIMD + GGML_FTYPE	Georgi Gerganov

2023-04-30	ggml : fix labels for GGML_OP_ALIBI	Georgi Gerganov

2023-04-29	ggml : fix 32-bit ARM NEON	Georgi Gerganov

2023-04-29	ggml : use vzip instead of vuzp for consistency	Georgi Gerganov

2023-04-29	ggml : fix visibility and unused warnings	Georgi Gerganov

2023-04-29	ggml : fix #if for f32_f32 mul_mat (CLBlast) (#1229)	Georgi Gerganov

2023-04-29	ggml : adjust mul_mat_f16 work memory (#1226)	Georgi Gerganov
	* llama : minor - remove explicity int64_t cast * ggml : reduce memory buffer for F16 mul_mat when not using cuBLAS * ggml : add asserts to guard for incorrect wsize
2023-04-29	cuBLAS: use host pinned memory and dequantize while copying (#1207)	slaren
	* cuBLAS: dequantize simultaneously while copying memory * cuBLAS: use host pinned memory * cuBLAS: improve ggml_compute_forward_mul_mat_f16_f32 with pinned memory * cuBLAS: also pin kv cache * fix rebase
2023-04-29	cuBLAS: non-contiguous tensor support (#1215)	Henri Vasserman
	* Cuda: non-contiguous tensor support * remove extra stuff * rename * fix error * more fixes, now OpenBLAS and CLBlast build too * now then?
2023-04-28	Remove Q4_3 which is no better than Q5 (#1218)	Stephan Walter

2023-04-28	ggml : sync ggml (ggml_alibi)	Georgi Gerganov

2023-04-28	ggml : add helper debug printf in soft_max	Georgi Gerganov

2023-04-28	ggml : add CLBlast support (#1164)	0cc4m
	* Allow use of OpenCL GPU-based BLAS using ClBlast instead of OpenBLAS for context processing * Improve ClBlast implementation, avoid recreating buffers, remove redundant transfers * Finish merge of ClBlast support * Move CLBlast implementation to separate file Add buffer reuse code (adapted from slaren's cuda implementation) * Add q4_2 and q4_3 CLBlast support, improve code * Double CLBlast speed by disabling OpenBLAS thread workaround Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com> * Fix device selection env variable names * Fix cast in opencl kernels * Add CLBlast to CMakeLists.txt * Replace buffer pool with static buffers a, b, qb, c Fix compile warnings * Fix typos, use GGML_TYPE defines, improve code * Improve btype dequant kernel selection code, add error if type is unsupported * Improve code quality * Move internal stuff out of header * Use internal enums instead of CLBlast enums * Remove leftover C++ includes and defines * Make event use easier to read Co-authored-by: Henri Vasserman <henv@hot.ee> * Use c compiler for opencl files * Simplify code, fix include * First check error, then release event * Make globals static, fix indentation * Rename dequant kernels file to conform with other file names * Fix import cl file name --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <2141330+slaren@users.noreply.github.com> Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-04-28	add avx2 for dot_q8_0_q8_0, 2x faster than scalar (#1211)	Yann Follet

2023-04-26	ggml : slightly faster AVX2 implementation for Q5 (#1197)	Stephan Walter

2023-04-26	ggml : add Q5_0 and Q5_1 quantization (#1187)	Georgi Gerganov
	* ggml : add Q5_0 quantization (cuBLAS only) * ggml : fix Q5_0 qh -> uint32_t * ggml : fix q5_0 histogram stats * ggml : q5_0 scalar dot product * ggml : q5_0 ARM NEON dot * ggml : q5_0 more efficient ARM NEON using uint64_t masks * ggml : rename Q5_0 -> Q5_1 * ggml : adding Q5_0 mode * quantize : add Q5_0 and Q5_1 to map * ggml : AVX2 optimizations for Q5_0, Q5_1 (#1195) --------- Co-authored-by: Stephan Walter <stephan@walter.name>
2023-04-25	ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) ↵	Georgi Gerganov
	(#1179) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3
2023-04-25	ggml : use full range for Q4_0 and Q4_2 quantization (#729)	unbounded
	* Use full range for q4_0 quantization By keeping the sign of the highest magnitude, we can make sure the highest value maps to -8, which is currently unused. This is a bit of a freebie since it is fully backwards compatible with the current format. * Update quantize_row_q4_0 for AVX/AVX2 * Update quantize_row_q4_0 for WASM Untested * Update quantize_row_q4_0 for Arm NEON * Update quantize_row_q4_0 for PowerPC Untested * Use full range for q4_2 quantization
2023-04-24	ggml : fix bug in ggml_compute_forward_sum_f32 (#1162)	xaedes
	The sum over all rows is now computed instead of just the last row
2023-04-24	Fix build for gcc 8 and test in CI (#1154)	Stephan Walter

2023-04-23	ggml : do not print perf ops that have not been used at all	Georgi Gerganov

2023-04-23	ggml : better PERF prints + support "LLAMA_PERF=1 make"	Georgi Gerganov

2023-04-23	Improve AVX2 for vec_dot_q4_3_q8_0 (#1138)	Stephan Walter

2023-04-23	A better `packNibbles` and `mul_sum_i8_pairs_float` implementation using ↵	Yishuo Wang
	AVX512 (#1119)
2023-04-22	ggml : fix Q4_3 cuBLAS	Georgi Gerganov

2023-04-22	Fix CI: ARM NEON, quantization unit tests, editorconfig (#1122)	Stephan Walter

2023-04-22	ggml : fix AVX build + update to new Q8_0 format	Georgi Gerganov

2023-04-22	ggml : alternative Q4_3 implementation using modified Q8_0 (#1109)	Georgi Gerganov
	* ggml : prefer vzip to vuzp This way we always use the same type of instruction across all quantizations * ggml : alternative Q4_3 implementation using modified Q8_0 * ggml : fix Q4_3 scalar imlpementation * ggml : slight improvement of Q4_3 - no need for loop unrolling * ggml : fix AVX paths for Q8_0 quantization
2023-04-22	ggml : AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring (#1099)	Stephan Walter
	* AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring * finish AVX vectorization of quantize_row_q8_0 * Rename hsum_int_8 to hsum_i32_8
2023-04-21	Improve cuBLAS performance by using a memory pool (#1094)	slaren
	* Improve cuBLAS performance by using a memory pool * Move cuda specific definitions to ggml-cuda.h/cu * Add CXX flags to nvcc * Change memory pool synchronization mechanism to a spin lock General code cleanup
2023-04-21	ggml : a faster version for Q4_1 x Q8_0 dot products (#1083)	Kawrakow
	* A faster version for Q4_1 x Q8_0 dot products The idea nehind being that Q8_0 quantized values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. Here we pre-compute the sum during Q8_0 quantization, store it in the now modified block_q8_0 struct, and then reuse this result in the subsequent dot products. In a synthetic benchmark (just compute a bunch of dot products), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0. In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%). I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation. * Cleaning up --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-04-20	ggml : sync ggml (add GPT-NeoX RoPE implementation)	Georgi Gerganov

2023-04-20	ggml : fix bug in ggml_compute_forward_dup_f32()	Georgi Gerganov

2023-04-20	ggml : do not break cuBLAS build (Q4_3 is not yet implemented)	Georgi Gerganov