The token <|endoftext|> is a special token used as a document separator for OpenAI GPT models. It has become quite prevalent if you look closely:
- It has been used since GPT-2 and remains present in the OpenAI API for their latest models. Their tokenizer package, tiktoken, includes logic to process text with these special tokens.
- The markup <| and |> is widely used in the code bases of LangChain and text-generation-webui. It usually serves as a lightweight templating syntax to mark particular text sections for replacement.
- Improper handling of this special token has led to interesting results in chatbot interfaces.
The use of <| and |> might seem random. My theory is that people who introduced this convention use a coding font with programming ligatures. As shown in the figures below, it makes more visual sense with the ligatures enabled, as the cognitive load is greatly reduced when distinguishing them. I find my brain finally starting to understand what the code is trying to say.
Now, here are five popular typefaces with original ligature designs that I often use as my main coding font, displayed in regular weight (400). You can find many more alternatives in the ToxicFrog/Ligaturizer repository, where regular monospaced fonts are patched with Fira Code ligatures.
Cascadia Code
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
Fira Code
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
Hasklig
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
JetBrains Mono
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
Monaspace Argon
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
Of course, this post should end with a proper…
<|endoftext|>