Stopwords
ref: https://bkshin.tistory.com/entry/NLP-3-%EB%B6%88%EC%9A%A9%EC%96%B4Stop-word-%EC%A0%9C%EA%B1%B0 Words that carry little analytical significance. Articles like a, an, the, and pronouns like I, my fall into this category.
- spaCy provides
is_stop(boolean) on NLP object tokens. - NLTK provides a stopword dictionary.
import nltknltk.download('stopwords')print('Number of English stopwords:',len(nltk.corpus.stopwords.words('english')))Lemmatization
ref: https://wikidocs.net/21707
Words have stems and affixes.
- Stem: the part that carries the word’s meaning
- Affix: the part that adds additional meaning
Lemmatization is the process of extracting the stem.
Punctuation
Punctuation removal is the most common text normalization.
- Remove with regex
- text = re.sub(r”[^a-zA-Z0-9]”, ” ”, text)
- Replace everything except alphabets and numbers with spaces.
- Usually replace with spaces to preserve sentence structure as much as possible.
- spaCy’s token
is_puncttells you if a token is punctuation. - You can also use Python built-in functions.
- Use the punctuation list
string.punctuation.
- Use the punctuation list