Model

A task that classifies each token of a given sentence into categories. A classifier is attached to each token.
NER
Named Entity Recognition. The process of recognizing specific meaningful words, phrases, or entities — such as person names and organization names — from documents using context.
The same word can be recognized as different entities, so understanding context is important.
https://github.com/kakaobrain/pororo
An NLP and speech-related task library developed by Kakao. It handles most Korean-processable tasks, including NER.
POS Tagging
Part-of-speech tagging.
- Splitting documents into parts of speech and morphemes.
Korean data
- kor_ner
- An NER dataset published by Korea Maritime University.
- NER datasets typically include POS tagging information, and kor_ner does too.
- Labeled with BIO tags wikidocs

Training

As mentioned above, a classifier is attached to each token for training.

Character-level tokenizing is recommended for sentence token classification. When tokenizing by morphemes or word units, the definition of an entity can become ambiguous. For example, splitting a Korean name like “Lee Sun-shin” at morpheme boundaries could break it into fragments that are nearly impossible to classify as a person name regardless of training.