aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authoraditya <bluenerd@protonmail.com>2023-08-10 12:32:35 +0530
committeraditya <bluenerd@protonmail.com>2023-08-10 12:32:35 +0530
commita9ff78b3f48dc9f81943c41531c4959ce7e2ae9d (patch)
tree49ee8c3c9148038f04112802265d928ef1aba428 /README.md
parent2516af4cd61f509c995b4f78fdf123cba33f3509 (diff)
parent916a9acdd0a411426690400ebe2bb7ce840a6bba (diff)
resolve merge conflict
Diffstat (limited to 'README.md')
-rw-r--r--README.md51
1 files changed, 46 insertions, 5 deletions
diff --git a/README.md b/README.md
index 476cc43..6900b11 100644
--- a/README.md
+++ b/README.md
@@ -77,9 +77,10 @@ as the main playground for developing new features for the [ggml](https://github
**Supported models:**
- [X] LLaMA 🦙
+- [x] LLaMA 2 🦙🦙
- [X] [Alpaca](https://github.com/ggerganov/llama.cpp#instruction-mode-with-alpaca)
- [X] [GPT4All](https://github.com/ggerganov/llama.cpp#using-gpt4all)
-- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
+- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
- [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
@@ -87,6 +88,7 @@ as the main playground for developing new features for the [ggml](https://github
- [X] [Pygmalion 7B / Metharme 7B](#using-pygmalion-7b--metharme-7b)
- [X] [WizardLM](https://github.com/nlpxucan/WizardLM)
- [X] [Baichuan-7B](https://huggingface.co/baichuan-inc/baichuan-7B) and its derivations (such as [baichuan-7b-sft](https://huggingface.co/hiyouga/baichuan-7b-sft))
+- [X] [Aquila-7B](https://huggingface.co/BAAI/Aquila-7B) / [AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B)
**Bindings:**
@@ -242,6 +244,23 @@ In order to build llama.cpp you have three different options.
zig build -Doptimize=ReleaseFast
```
+- Using `gmake` (FreeBSD):
+
+ 1. Install and activate [DRM in FreeBSD](https://wiki.freebsd.org/Graphics)
+ 2. Add your user to **video** group
+ 3. Install compilation dependencies.
+
+ ```bash
+ sudo pkg install gmake automake autoconf pkgconf llvm15 clinfo clover \
+ opencl clblast openblas
+
+ gmake CC=/usr/local/bin/clang15 CXX=/usr/local/bin/clang++15 -j4
+ ```
+
+ **Notes:** With this packages you can build llama.cpp with OPENBLAS and
+ CLBLAST support for use OpenCL GPU acceleration in FreeBSD. Please read
+ the instructions for use and activate this options in this document below.
+
### Metal Build
Using Metal allows the computation to be executed on the GPU for Apple devices:
@@ -382,12 +401,15 @@ Building the program with BLAS support may lead to some performance improvements
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
+<!---
+ | LLAMA_CUDA_CUBLAS | Boolean | false | Use cuBLAS instead of custom CUDA kernels for prompt processing. Faster for all quantization formats except for q4_0 and q8_0, especially for k-quants. Increases VRAM usage (700 MiB for 7b, 970 MiB for 13b, 1430 MiB for 33b). |
+--->
| Option | Legal values | Default | Description |
|-------------------------|------------------------|---------|-------------|
- | LLAMA_CUDA_FORCE_DMMV | Boolean | false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 7.0/Turing/RTX 2000 or higher). Does not affect k-quants. |
+ | LLAMA_CUDA_FORCE_DMMV | Boolean | false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
- | LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
- | LLAMA_CUDA_DMMV_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels. Can improve performance on relatively recent GPUs. |
+ | LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
+ | LLAMA_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
- #### CLBlast
@@ -470,6 +492,9 @@ Building the program with BLAS support may lead to some performance improvements
# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
+ # [Optional] for models using BPE tokenizers
+ ls ./models
+ 65B 30B 13B 7B vocab.json
# install Python dependencies
python3 -m pip install -r requirements.txt
@@ -477,6 +502,9 @@ python3 -m pip install -r requirements.txt
# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/
+ # [Optional] for models using BPE tokenizers
+ python convert.py models/7B/ --vocabtype bpe
+
# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
@@ -633,6 +661,19 @@ python3 convert.py pygmalion-7b/ --outtype q4_1
- The LLaMA models are officially distributed by Facebook and will **never** be provided through this repository.
- Refer to [Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to request access to the model data.
+### Obtaining and using the Facebook LLaMA 2 model
+
+- Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.
+- Alternatively, if you want to save time and space, you can download already converted and quantized models from [TheBloke](https://huggingface.co/TheBloke), including:
+ - [LLaMA 2 7B base](https://huggingface.co/TheBloke/Llama-2-7B-GGML)
+ - [LLaMA 2 13B base](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
+ - [LLaMA 2 70B base](https://huggingface.co/TheBloke/Llama-2-70B-GGML)
+ - [LLaMA 2 7B chat](https://huggingface.co/TheBloke/Llama-2-7B-chat-GGML)
+ - [LLaMA 2 13B chat](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML)
+ - [LLaMA 2 70B chat](https://huggingface.co/TheBloke/Llama-2-70B-chat-GGML)
+- Specify `-eps 1e-5` for best generation quality
+- Specify `-gqa 8` for 70B models to work
+
### Verifying the model files
Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files.
@@ -640,7 +681,7 @@ Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files t
```bash
# run the verification script
-python3 .\scripts\verify-checksum-models.py
+./scripts/verify-checksum-models.py
```
- On linux or macOS it is also possible to run the following commands to verify if you have all possible latest files in your self-installed `./models` subdirectory: