Skip to main content
Overview

Unicode, Tokenization

October 12, 2021
1 min read

Unicode

e.g., U+AC00

  • ‘U+’: prefix denoting Unicode
  • ‘AC00’: hexadecimal code point

Python

  • ord: character to Unicode code point

  • chr: Unicode code point to character

  • Precomposed Korean (Hangul Syllables): 11,172 characters

    • len returns 2
  • Conjoining Jamo (decomposed Korean)

    • len returns 1

Tokenization

There’s growing consensus that human-defined tokenization rules have their limits. The recent trend is data-driven approaches.

  • Subword
    • Frequently occurring character combinations are treated as single units.
    • Infrequent combinations are split into subwords.
  • BPE (Byte-Pair Encoding)
    • Replaces the most frequent character-level bigram (or byte pair) with a new character.
Loading comments...