diff options
author | thement <40525767+thement@users.noreply.github.com> | 2023-03-17 21:05:58 +0100 |
---|---|---|
committer | GitHub <noreply@github.com> | 2023-03-17 21:05:58 +0100 |
commit | c9f670a17755311aa28c411f5c7f3c8c05434770 (patch) | |
tree | a942b84194bc4436df9d38eb3b06175e0e849166 /main.cpp | |
parent | 4f546091102a418ffdc6230f872ac56e5cedb835 (diff) |
Implement non-greedy tokenizer that tries to maximize token lengths (#242)
* Implement non-greedy tokenizer that tries to maximize token lengths
* Insert single space in front of the prompt
- this is to match original llama tokenizer behavior
---------
Co-authored-by: Jakub Horak <jakub.horak@ibawizard.net>
Diffstat (limited to 'main.cpp')
-rw-r--r-- | main.cpp | 2 |
1 files changed, 2 insertions, 0 deletions
@@ -845,6 +845,8 @@ int main(int argc, char ** argv) { std::vector<float> logits; + // Add a space in front of the first character to match OG llama tokenizer behavior + params.prompt.insert(0, 1, ' '); // tokenize the prompt std::vector<gpt_vocab::id> embd_inp = ::llama_tokenize(vocab, params.prompt, true); |