diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 44 |
1 files changed, 38 insertions, 6 deletions
@@ -17,12 +17,11 @@ The main goal is to run the model using 4-bit quantization on a MacBook. This was hacked in an evening - I have no idea if it works correctly. -So far, I've tested just the 7B model. -Here is a typical run: +Here is a typical run using LLaMA-7B: ```java -make -j && ./main -m ../LLaMA-4bit/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512 -I llama.cpp build info: +make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512 +I llama.cpp build info: I UNAME_S: Darwin I UNAME_P: arm I UNAME_M: arm64 @@ -34,7 +33,7 @@ I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202) make: Nothing to be done for `default'. main: seed = 1678486056 -llama_model_load: loading model from '../LLaMA-4bit/7B/ggml-model-q4_0.bin' - please wait ... +llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ... llama_model_load: n_vocab = 32000 llama_model_load: n_ctx = 512 llama_model_load: n_embd = 4096 @@ -110,6 +109,8 @@ https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8 ## Usage +Here are the step for the LLaMA-7B model: + ```bash # build this repo git clone https://github.com/ggerganov/llama.cpp @@ -133,9 +134,40 @@ python3 convert-pth-to-ggml.py models/7B/ 1 ./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 ``` +For the bigger models, there are a few extra quantization steps. For example, for LLaMA-13B, converting to FP16 format +will create 2 ggml files, instead of one: + +```bash +ggml-model-f16.bin +ggml-model-f16.bin.1 +``` + +You need to quantize each of them separately like this: + +```bash +./quantize ./models/13B/ggml-model-f16.bin ./models/13B/ggml-model-q4_0.bin 2 +./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2 +``` + +Everything else is the same. Simply run: + +```bash +./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 +``` + +The number of files generated for each model is as follows: + +``` +7B -> 1 file +13B -> 2 files +33B -> 4 files +65B -> 8 files +``` + +When running the larger models, make sure you have enough disk space to store all the intermediate files. + ## Limitations -- Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models. However, in theory, you should be able to run 65B on a 64GB MacBook - Not sure if my tokenizer is correct. There are a few places where we might have a mistake: - https://github.com/ggerganov/llama.cpp/blob/26c084662903ddaca19bef982831bfb0856e8257/convert-pth-to-ggml.py#L79-L87 - https://github.com/ggerganov/llama.cpp/blob/26c084662903ddaca19bef982831bfb0856e8257/utils.h#L65-L69 |