aboutsummaryrefslogtreecommitdiff
path: root/convert-pth-to-ggml.py
diff options
context:
space:
mode:
authorRonsor <ronsor@ronsor.pw>2023-03-15 12:37:50 -0700
committerGitHub <noreply@github.com>2023-03-15 21:37:50 +0200
commit956dfda8ad8cea7961e22e0384bbc315bf79aed2 (patch)
tree57210ba963ca22ecab007fe2841f02100ad423a8 /convert-pth-to-ggml.py
parent113e685d18ac4edb20f647fd34b000941556f6a6 (diff)
Use `tokenizer.vocab_size()` instead of hardcoding 32000 in convert-pth-to-ggml.py (#142)
There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
Diffstat (limited to 'convert-pth-to-ggml.py')
-rw-r--r--convert-pth-to-ggml.py2
1 files changed, 1 insertions, 1 deletions
diff --git a/convert-pth-to-ggml.py b/convert-pth-to-ggml.py
index d255750..5c36e9c 100644
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@@ -99,7 +99,7 @@ for p in range(n_parts):
fout.write(struct.pack("i", ftype))
# Is this correct??
- for i in range(32000):
+ for i in range(tokenizer.vocab_size()):
if tokenizer.is_unknown(i):
# "<unk>" token (translated as ??)
text = " \u2047 ".encode("utf-8")