NLP Preprocessing

Stopwords

ref: https://bkshin.tistory.com/entry/NLP-3-%EB%B6%88%EC%9A%A9%EC%96%B4Stop-word-%EC%A0%9C%EA%B1%B0 Words that carry little analytical significance. Articles like a, an, the, and pronouns like I, my fall into this category.

1
import nltk
2
nltk.download('stopwords')
3
print('Number of English stopwords:',len(nltk.corpus.stopwords.words('english')))

Words have stems and affixes.

Lemmatization is the process of extracting the stem.

Punctuation removal is the most common text normalization.

Remove with regex
- text = re.sub(r”[^a-zA-Z0-9]”, ” ”, text)
- Replace everything except alphabets and numbers with spaces.
- Usually replace with spaces to preserve sentence structure as much as possible.
spaCy’s token is_punct tells you if a token is punctuation.
You can also use Python built-in functions.
- Use the punctuation list string.punctuation.