Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)

* CUDA multi GPU + scratch ggml_cuda_compute_forward Tensor parallelism ggml_cuda_add ggml_cuda_rms_norm ggml_cuda_silu CUDA scratch buffer --main-gpu CLI option
author: Johannes Gäßler <johannesg@5d6.de> 2023-06-06 21:33:23 +0200
committer: GitHub <noreply@github.com> 2023-06-06 21:33:23 +0200
commit: 17366df842e358768c0df7024484fffecfc7865b (patch)
tree: f042c8142311d45f8712db10debf89111b2c7e57 /examples/server/README.md
parent: 44f906e8537fcec965e312d621c80556d6aa9bec (diff)
1 files changed, 2 insertions, 0 deletions
diff --git a/examples/server/README.md b/examples/server/README.md
index bba513c..b011302 100644
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -287,6 +287,8 @@ Test();
 -   `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
 -   `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
 -   `-ngl N, --n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
+-   `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.
+-   `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
 -   `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
 -   `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
 -   `--port`: Set the port to listen. Default: `8080`.
author	Johannes Gäßler <johannesg@5d6.de>	2023-06-06 21:33:23 +0200
committer	GitHub <noreply@github.com>	2023-06-06 21:33:23 +0200
commit	17366df842e358768c0df7024484fffecfc7865b (patch)
tree	f042c8142311d45f8712db10debf89111b2c7e57 /examples/server/README.md
parent	44f906e8537fcec965e312d621c80556d6aa9bec (diff)