aboutsummaryrefslogtreecommitdiff
path: root/examples/server
diff options
context:
space:
mode:
authorJohannes Gäßler <johannesg@5d6.de>2023-06-14 19:47:19 +0200
committerGitHub <noreply@github.com>2023-06-14 19:47:19 +0200
commit254a7a7a5ff4c874ff8488f1f5cbdd7e9c89d682 (patch)
tree65f35a2d189f3cf6f1f625b2acb343c2dd77790d /examples/server
parent92549202659fc23ba9fec5e688227d0da9b06b40 (diff)
CUDA full GPU acceleration, KV cache in VRAM (#1827)
* Fixed CUDA RoPE * ggml_cuda_mul_mat_vec_p021 * ggml_cuda_scale * ggml_cuda_diag_mask_inf * ggml_is_permuted * ggml_cuda_cpy * flatten rows for ggml_cuda_op * Added a --low-vram option * Fixed Windows performance * Fixed LLAMA_CUDA_DMMV_Y > 1 for WizardLM
Diffstat (limited to 'examples/server')
-rw-r--r--examples/server/README.md1
-rw-r--r--examples/server/server.cpp9
2 files changed, 10 insertions, 0 deletions
diff --git a/examples/server/README.md b/examples/server/README.md
index b011302..7dabac9 100644
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -289,6 +289,7 @@ Test();
- `-ngl N, --n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
- `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.
- `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
+- `-lv, --low-vram`: Do not allocate a VRAM scratch buffer for holding temporary results. Reduces VRAM usage at the cost of performance, particularly prompt processing speed. Requires cuBLAS.
- `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
- `--port`: Set the port to listen. Default: `8080`.
diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index 31d8087..8727500 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -405,6 +405,7 @@ void server_print_usage(int /*argc*/, char **argv, const gpt_params &params)
fprintf(stderr, " how to split tensors across multiple GPUs, comma-separated list of proportions, e.g. 3,1\n");
fprintf(stderr, " how to split tensors across multiple GPUs, comma-separated list of proportions, e.g. 3,1\n");
fprintf(stderr, " -mg i, --main-gpu i the GPU to use for scratch and small tensors\n" );
+ fprintf(stderr, " -lv, --low-vram don't allocate VRAM scratch buffer\n" );
#endif
fprintf(stderr, " -m FNAME, --model FNAME\n");
fprintf(stderr, " model path (default: %s)\n", params.model.c_str());
@@ -539,6 +540,14 @@ bool server_params_parse(int argc, char **argv, server_params &sparams, gpt_para
fprintf(stderr, "WARNING: llama.cpp was compiled without cuBLAS. It is not possible to set a tensor split.\n");
#endif // GGML_USE_CUBLAS
}
+ else if (arg == "--low-vram" || arg == "-lv")
+ {
+#ifdef GGML_USE_CUBLAS
+ params.low_vram = true;
+#else
+ fprintf(stderr, "warning: llama.cpp was compiled without cuBLAS. It is not possible to set lower vram usage.\n");
+#endif // GGML_USE_CUBLAS
+ }
else if (arg == "--main-gpu" || arg == "-mg")
{
if (++i >= argc)