You should (maybe) enable font ligatures when building with GPT models

Parts of a typewriter. Photo by Florian Klauer.
Parts of a typewriter. Photo by Florian Klauer.

The token <|endoftext|> is a special token used as a document separator for OpenAI GPT models. It has become quite prevalent if you look closely:

  • It has been used since GPT-2 and remains present in the OpenAI API for their latest models. Their tokenizer package, tiktoken, includes logic to process text with these special tokens.
  • The markup <| and |> is widely used in the code bases of LangChain and text-generation-webui. It usually serves as a lightweight templating syntax to mark particular text sections for replacement.
  • Improper handling of this special token has led to interesting results in chatbot interfaces.

The use of <| and |> might seem random. My theory is that people who introduced this convention use a coding font with programming ligatures. As shown in the figures below, it makes more visual sense with the ligatures enabled, as the cognitive load is greatly reduced when distinguishing them. I find my brain finally starting to understand what the code is trying to say.

Code snippet from text-generation-webui, ligatures off.

Figure 1: Code snippet from text-generation-webui, ligatures off.

Code snippet from text-generation-webui, ligatures on.

Figure 2: Code snippet from text-generation-webui, ligatures on.

Now, here are five popular typefaces with original ligature designs that I often use as my main coding font, displayed in regular weight (400). You can find many more alternatives in the ToxicFrog/Ligaturizer repository, where regular monospaced fonts are patched with Fira Code ligatures.

Cascadia Code

ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&

Fira Code

ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&

Hasklig

ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&

JetBrains Mono

ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&

Monaspace Argon

ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"
## :: := => <- -> == != <= >= ++ || &&

Of course, this post should end with a proper…

<|endoftext|>