aboutsummaryrefslogtreecommitdiff
path: root/examples/server/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'examples/server/README.md')
-rw-r--r--examples/server/README.md318
1 files changed, 91 insertions, 227 deletions
diff --git a/examples/server/README.md b/examples/server/README.md
index 3b11165..474a28b 100644
--- a/examples/server/README.md
+++ b/examples/server/README.md
@@ -1,37 +1,74 @@
# llama.cpp/example/server
-This example allow you to have a llama.cpp http server to interact from a web page or consume the API.
+This example demonstrates a simple HTTP API server to interact with llama.cpp.
-## Table of Contents
+Command line options:
-1. [Quick Start](#quick-start)
-2. [Node JS Test](#node-js-test)
-3. [API Endpoints](#api-endpoints)
-4. [More examples](#more-examples)
-5. [Common Options](#common-options)
-6. [Performance Tuning and Memory Options](#performance-tuning-and-memory-options)
+- `--threads N`, `-t N`: Set the number of threads to use during computation.
+- `-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
+- `-m ALIAS`, `--alias ALIAS`: Set an alias for the model. The alias will be returned in API responses.
+- `-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
+- `-ngl N`, `--n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
+- `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.
+- `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
+- `-lv, --low-vram`: Do not allocate a VRAM scratch buffer for holding temporary results. Reduces VRAM usage at the cost of performance, particularly prompt processing speed. Requires cuBLAS.
+- `-b N`, `--batch-size N`: Set the batch size for prompt processing. Default: `512`.
+- `--memory-f32`: Use 32-bit floats instead of 16-bit floats for memory key+value. Not recommended.
+- `--mlock`: Lock the model in memory, preventing it from being swapped out when memory-mapped.
+- `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed.
+- `--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
+- `--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.
+- `-to N`, `--timeout N`: Server read/write timeout in seconds. Default `600`.
+- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`.
+- `--port`: Set the port to listen. Default: `8080`.
+
+## Build
+
+Build llama.cpp with server from repository root with either make or CMake.
+
+- Using `make`:
+
+ ```bash
+ LLAMA_BUILD_SERVER=1 make
+ ```
+
+- Using `CMake`:
+
+ ```bash
+ mkdir build-server
+ cd build-server
+ cmake -DLLAMA_BUILD_SERVER=ON ..
+ cmake --build . --config Release
+ ```
## Quick Start
To get started right away, run the following command, making sure to use the correct path for the model you have:
-#### Unix-based systems (Linux, macOS, etc.):
-Make sure to build with the server option on
-```bash
-LLAMA_BUILD_SERVER=1 make
-```
+### Unix-based systems (Linux, macOS, etc.):
```bash
-./server -m models/7B/ggml-model.bin --ctx_size 2048
+./server -m models/7B/ggml-model.bin -c 2048
```
-#### Windows:
+### Windows:
```powershell
-server.exe -m models\7B\ggml-model.bin --ctx_size 2048
+server.exe -m models\7B\ggml-model.bin -c 2048
```
-That will start a server that by default listens on `127.0.0.1:8080`. You can consume the endpoints with Postman or NodeJS with axios library.
+The above command will start a server that by default listens on `127.0.0.1:8080`.
+You can consume the endpoints with Postman or NodeJS with axios library.
+
+## Testing with CURL
+
+Using [curl](https://curl.se/). On Windows `curl.exe` should be available in the base OS.
+
+```sh
+curl --request POST \
+ --url http://localhost:8080/completion \
+ --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
+```
## Node JS Test
@@ -54,7 +91,6 @@ const prompt = `Building a website can be done in 10 simple steps:`;
async function Test() {
let result = await axios.post("http://127.0.0.1:8080/completion", {
prompt,
- batch_size: 128,
n_predict: 512,
});
@@ -73,247 +109,75 @@ node .
## API Endpoints
-You can interact with this API Endpoints. This implementations just support chat style interaction.
+- **POST** `/completion`: Given a prompt, it returns the predicted completion.
-- **POST** `hostname:port/completion`: Setting up the Llama Context to begin the completions tasks.
+ *Options:*
-*Options:*
+ `temperature`: Adjust the randomness of the generated text (default: 0.8).
-`batch_size`: Set the batch size for prompt processing (default: 512).
+ `top_k`: Limit the next token selection to the K most probable tokens (default: 40).
-`temperature`: Adjust the randomness of the generated text (default: 0.8).
+ `top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
-`top_k`: Limit the next token selection to the K most probable tokens (default: 40).
+ `n_predict`: Set the number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. (default: 128, -1 = infinity).
-`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
+ `n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context.
+ By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
-`n_predict`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
+ `stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
-`threads`: Set the number of threads to use during computation.
+ `prompt`: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate.
-`n_keep`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
+ `stop`: Specify a JSON array of stopping strings.
+ These words will not be included in the completion, so make sure to add them to the prompt for the next iteration (default: []).
-`as_loop`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.
+ `tfs_z`: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled).
-`interactive`: It allows interacting with the completion, and the completion stops as soon as it encounters a `stop word`. To enable this, set to `true`.
+ `typical_p`: Enable locally typical sampling with parameter p (default: 1.0, 1.0 = disabled).
-`prompt`: Provide a prompt. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate.
+ `repeat_penalty`: Control the repetition of token sequences in the generated text (default: 1.1).
-`stop`: Specify the words or characters that indicate a stop. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
+ `repeat_last_n`: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx-size).
-`exclude`: Specify the words or characters you do not want to appear in the completion. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration.
+ `penalize_nl`: Penalize newline tokens when applying the repeat penalty (default: true).
-- **POST** `hostname:port/embedding`: Generate embedding of a given text
+ `presence_penalty`: Repeat alpha presence penalty (default: 0.0, 0.0 = disabled).
-*Options:*
+ `frequency_penalty`: Repeat alpha frequency penalty (default: 0.0, 0.0 = disabled);
-`content`: Set the text to get generate the embedding.
+ `mirostat`: Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0).
-`threads`: Set the number of threads to use during computation.
+ `mirostat_tau`: Set the Mirostat target entropy, parameter tau (default: 5.0).
-To use this endpoint, you need to start the server with the `--embedding` option added.
+ `mirostat_eta`: Set the Mirostat learning rate, parameter eta (default: 0.1).
-- **POST** `hostname:port/tokenize`: Tokenize a given text
+ `seed`: Set the random number generator (RNG) seed (default: -1, < 0 = random seed).
-*Options:*
+ `ignore_eos`: Ignore end of stream token and continue generating (default: false).
-`content`: Set the text to tokenize.
+ `logit_bias`: Modify the likelihood of a token appearing in the generated text completion. For example, use `"logit_bias": [[15043,1.0]]` to increase the likelihood of the token 'Hello', or `"logit_bias": [[15043,-1.0]]` to decrease its likelihood. Setting the value to false, `"logit_bias": [[15043,false]]` ensures that the token `Hello` is never produced (default: []).
-- **GET** `hostname:port/next-token`: Receive the next token predicted, execute this request in a loop. Make sure set `as_loop` as `true` in the completion request.
+- **POST** `/tokenize`: Tokenize a given text.
-*Options:*
+ *Options:*
-`stop`: Set `hostname:port/next-token?stop=true` to stop the token generation.
+ `content`: Set the text to tokenize.
## More examples
### Interactive mode
-This mode allows interacting in a chat-like manner. It is recommended for models designed as assistants such as `Vicuna`, `WizardLM`, `Koala`, among others. Make sure to add the correct stop word for the corresponding model.
-
-The prompt should be generated by you, according to the model's guidelines. You should keep adding the model's completions to the context as well.
-
-This example works well for `Vicuna - version 1`.
-
-```javascript
-const axios = require("axios");
-
-let prompt = `A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
-### Human: Hello, Assistant.
-### Assistant: Hello. How may I help you today?
-### Human: Please tell me the largest city in Europe.
-### Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia.`;
-
-async function ChatCompletion(answer) {
- // the user's next question to the prompt
- prompt += `\n### Human: ${answer}\n`
-
- result = await axios.post("http://127.0.0.1:8080/completion", {
- prompt,
- batch_size: 128,
- temperature: 0.2,
- top_k: 40,
- top_p: 0.9,
- n_keep: -1,
- n_predict: 2048,
- stop: ["\n### Human:"], // when detect this, stop completion
- exclude: ["### Assistant:"], // no show in the completion
- threads: 8,
- as_loop: true, // use this to request the completion token by token
- interactive: true, // enable the detection of a stop word
- });
-
- // create a loop to receive every token predicted
- // note: this operation is blocking, avoid use this in a ui thread
-
- let message = "";
- while (true) {
- // you can stop the inference adding '?stop=true' like this http://127.0.0.1:8080/next-token?stop=true
- result = await axios.get("http://127.0.0.1:8080/next-token");
- process.stdout.write(result.data.content);
- message += result.data.content;
-
- // to avoid an infinite loop
- if (result.data.stop) {
- console.log("Completed");
- // make sure to add the completion to the prompt.
- prompt += `### Assistant: ${message}`;
- break;
- }
- }
-}
-
-// This function should be called every time a question to the model is needed.
-async function Test() {
- // the server can't inference in paralell
- await ChatCompletion("Write a long story about a time magician in a fantasy world");
- await ChatCompletion("Summary the story");
-}
-
-Test();
-```
-
-### Alpaca example
-
-**Temporaly note:** no tested, if you have the model, please test it and report me some issue
-
-```javascript
-const axios = require("axios");
-
-let prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.
-`;
-
-async function DoInstruction(instruction) {
- prompt += `\n\n### Instruction:\n\n${instruction}\n\n### Response:\n\n`;
- result = await axios.post("http://127.0.0.1:8080/completion", {
- prompt,
- batch_size: 128,
- temperature: 0.2,
- top_k: 40,
- top_p: 0.9,
- n_keep: -1,
- n_predict: 2048,
- stop: ["### Instruction:\n\n"], // when detect this, stop completion
- exclude: [], // no show in the completion
- threads: 8,
- as_loop: true, // use this to request the completion token by token
- interactive: true, // enable the detection of a stop word
- });
-
- // create a loop to receive every token predicted
- // note: this operation is blocking, avoid use this in a ui thread
-
- let message = "";
- while (true) {
- result = await axios.get("http://127.0.0.1:8080/next-token");
- process.stdout.write(result.data.content);
- message += result.data.content;
-
- // to avoid an infinite loop
- if (result.data.stop) {
- console.log("Completed");
- // make sure to add the completion and the user's next question to the prompt.
- prompt += message;
- break;
- }
- }
-}
+Check the sample in [chat.mjs](chat.mjs).
+Run with NodeJS version 16 or later:
-// This function should be called every time a instruction to the model is needed.
-DoInstruction("Destroy the world"); // as joke
+```sh
+node chat.mjs
```
-### Embeddings
-
-First, run the server with `--embedding` option:
-
-```bash
-server -m models/7B/ggml-model.bin --ctx_size 2048 --embedding
-```
-
-Run this code in NodeJS:
-
-```javascript
-const axios = require('axios');
-
-async function Test() {
- let result = await axios.post("http://127.0.0.1:8080/embedding", {
- content: `Hello`,
- threads: 5
- });
- // print the embedding array
- console.log(result.data.embedding);
-}
-
-Test();
-```
-
-### Tokenize
-
-Run this code in NodeJS:
-
-```javascript
-const axios = require('axios');
-
-async function Test() {
- let result = await axios.post("http://127.0.0.1:8080/tokenize", {
- content: `Hello`
- });
- // print the embedding array
- console.log(result.data.tokens);
-}
+Another sample in [chat.sh](chat.sh).
+Requires [bash](https://www.gnu.org/software/bash/), [curl](https://curl.se) and [jq](https://jqlang.github.io/jq/).
+Run with bash:
-Test();
+```sh
+bash chat.sh
```
-
-## Common Options
-
-- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
-- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
-- `-ngl N, --n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
-- `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.
-- `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
-- `-lv, --low-vram`: Do not allocate a VRAM scratch buffer for holding temporary results. Reduces VRAM usage at the cost of performance, particularly prompt processing speed. Requires cuBLAS.
-- `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
-- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
-- `--port`: Set the port to listen. Default: `8080`.
-
-### RNG Seed
-
-- `-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1, < 0 = random seed).
-
-The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run.
-
-## Performance Tuning and Memory Options
-
-### No Memory Mapping
-
-- `--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance.
-
-### Memory Float 32
-
-- `--memory-f32`: Use 32-bit floats instead of 16-bit floats for memory key+value. This doubles the context memory requirement but does not appear to increase generation quality in a measurable way. Not recommended.
-
-## Limitations:
-
-- The actual implementation of llama.cpp need a `llama-state` for handle multiple contexts and clients, but this could require more powerful hardware.