← Back to glossary

Tokenization

Tokenization is the process of splitting text into smaller units for LLM models to process. Subword algorithms like Byte Pair Encoding (BPE) or WordPiece are used to split the text into smaller units, which capture both frequent and rare words.