From eb34620aeceaf9d9df7fcb19acc17ad41b9f60f8 Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Tue, 21 Mar 2023 17:29:41 +0200 Subject: Add tokenizer test + revert to C++11 (#355) * Add test-tokenizer-0 to do a few tokenizations - feel free to expand * Added option to convert-pth-to-ggml.py script to dump just the vocabulary * Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests) * Added utility to load vocabulary file from previous point (temporary implementation) * Avoid using std::string_view and drop back to C++11 (hope I didn't break something) * Rename gpt_vocab -> llama_vocab * All CMake binaries go into ./bin/ now --- models/ggml-vocab.bin | Bin 0 -> 432578 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 models/ggml-vocab.bin (limited to 'models') diff --git a/models/ggml-vocab.bin b/models/ggml-vocab.bin new file mode 100644 index 0000000..aba94bd Binary files /dev/null and b/models/ggml-vocab.bin differ -- cgit v1.2.3