sunghogigio

Sign in Subscribe

NLP

NLP 전처리

불용어(stopword)

ref: https://bkshin.tistory.com/entry/NLP-3-불용어Stop-word-제거
분석에 큰 의미가 없는 단어들. a, an, the와 같은 관사나 I, my 같은 대명사들이 해당된다.

spacy는 nlp객체의 token에서 is_stop(boolean)을 제공해준다.
nlkt는 불용어 사전을 제공해준다.

import nltk
nltk.download('stopwords')
print('영어 불용어 갯수:',len(nltk.corpus.stopwords.words('english')))

Lemmatization

ref: https://wikidocs.net/21707

단어는 어간과 접사가 있다.

어간(stem): 단어의 의미를 담고 있는 부분
접사(affix): 단어에 추가적인 의미를 주는 부분

어간을 추출하는 작업이 lemmatization이다.

Punctuation

ref: https://www.delftstack.com/ko/howto/python/how-to-strip-punctuation-from-a-string-in-python/#파이썬에서-문자열에서-구두점을-제거하기-위해-string-클래스-메서드-사용

Punctuation(구두점) 제거는 가장 흔하게 쓰이는 text normalization.

Regex로 제거하기
- text = re.sub(r"[^a-zA-Z0-9]", " ", text)
- 알파벳, 숫자 외는 모두 공백으로 변경.
- 보통 공백으로 치환해서 문장의 구조를 최대한 유지해준다.
spacy의 token에서 is_punct를 호출하면 puncutaion인지 알 수 있다.
python built in function을 써도 된다.
- punctuation list인 string.punctuation를 사용.

Read next

Agent library review with MCP

Personal review of Agent library review with MCP

OpenAI Agents SDK review

Personal review of Agents SDK made by OpenAI

m1 gpu acceleration

how use gpu acceleration in m1