data:image/s3,"s3://crabby-images/d756b/d756bf89e8acb6181e25982fa12cfa3d6e2cf7e5" alt="Parts of a typewriter. Photo by Florian Klauer."
The token <|endoftext|> is a special token used as a document separator for OpenAI GPT models. It has become quite prevalent if you look closely:
- It has been used since GPT-2 and remains present in the OpenAI API for their latest models. Their tokenizer package, tiktoken, includes logic to process text with these special tokens.
- The markup <| and |> is widely used in the code bases of LangChain and text-generation-webui. It usually serves as a lightweight templating syntax to mark particular text sections for replacement.
- Improper handling of this special token has led to interesting results in chatbot interfaces.
The use of <| and |> might seem random. My theory is that people who introduced this convention use a coding font with programming ligatures. As shown in the figures below, it makes more visual sense with the ligatures enabled, as the cognitive load is greatly reduced when distinguishing them. I find my brain finally starting to understand what the code is trying to say.
data:image/s3,"s3://crabby-images/00d1d/00d1d73154da291dd57249099fe6f34e769f6b2a" alt="Code snippet from text-generation-webui, ligatures off."
Figure 1: Code snippet from text-generation-webui, ligatures off.
data:image/s3,"s3://crabby-images/bad70/bad70e51c8a1edff999dc19029014641ed42538f" alt="Code snippet from text-generation-webui, ligatures on."
Figure 2: Code snippet from text-generation-webui, ligatures on.
Now, here are five popular typefaces with original ligature designs that I often use as my main coding font, displayed in regular weight (400). You can find many more alternatives in the ToxicFrog/Ligaturizer repository, where regular monospaced fonts are patched with Fira Code ligatures.
Cascadia Code
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
Fira Code
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
Hasklig
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
JetBrains Mono
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
Monaspace Argon
ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&
Of course, this post should end with a proper…
<|endoftext|>