aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md13
1 files changed, 7 insertions, 6 deletions
diff --git a/README.md b/README.md
index 77332ca..28842e9 100644
--- a/README.md
+++ b/README.md
@@ -267,11 +267,11 @@ Any value larger than 0 will offload the computation to the GPU. For example:
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
-- **Accelerate Framework**:
+- #### Accelerate Framework:
This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
-- **OpenBLAS**:
+- #### OpenBLAS:
This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
@@ -305,11 +305,11 @@ Building the program with BLAS support may lead to some performance improvements
cmake --build . --config Release
```
-- **BLIS**
+- #### BLIS
Check [BLIS.md](BLIS.md) for more information.
-- **Intel MKL**
+- #### Intel MKL
By default, `LLAMA_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DLLAMA_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. You may also specify it by:
@@ -320,7 +320,7 @@ Building the program with BLAS support may lead to some performance improvements
cmake --build . --config Release
```
-- **cuBLAS**
+- #### cuBLAS
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
- Using `make`:
@@ -339,7 +339,7 @@ Building the program with BLAS support may lead to some performance improvements
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used.
-- **CLBlast**
+- #### CLBlast
OpenCL acceleration is provided by the matrix multiplication kernels from the [CLBlast](https://github.com/CNugteren/CLBlast) project and custom kernels for ggml that can generate tokens on the GPU.
@@ -684,3 +684,4 @@ docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /mode
### Docs
- [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
+- [Performance troubleshooting](./docs/token_generation_performance_tips.md)