NLP (Natural Language Processing)
Divided into NLU and NLG.
-
NLU (Natural Language Understanding): understanding the intent behind language
-
NLG (Natural Language Generation): teaching machines how to generate natural language
-
Major conferences: ACL, EMNLP, NAACL
Low-level Parsing
Low-level tasks for meaning extraction:
Tokenization
- Token: the smallest grammatically indivisible unit of language
- Corpus: a body of text; a text sample
- Tokenization: splitting a corpus into tokens
In other words, a sentence is understood as a sequence of tokens.
Stemming
- Stem: the root of a word
- Stemming: extracting the root
In both English and Korean, words can have various inflectional forms attached to the stem or suffix. Stemming removes these inflections to extract only the original meaning.
Word and Phrase Level
NER (Named Entity Recognition)
The process of recognizing named entities (proper nouns, etc.) composed of multiple words. Recognizing people’s names, times, companies, etc.
POS (Part of Speech) Tagging
Figuring out the part of speech of words.
Sentence Level
Sentiment Analysis
Analyzing the sentiment of sentences. Evaluating positive/negative, etc.
Machine Translation
Machine translation. Performed with careful consideration of the target language’s grammar and word order.
Multi-sentence and Paragraph Level
Entailment Prediction
Predicting logical contradiction relationships between two sentences.
Question Answering
Comprehending the meaning of sentences and providing the answer the user wants.
Dialog System
A task for handling conversations, like chatbots.
Summarization
A task for summarizing given documents.
Text Mining
-
Major conferences: KDD, The WebConf (formerly WWW), WSDM, CIKM, ICWSM
-
Extract useful information and insights from text and document data
- e.g., analyzing a public figure’s image over time, analyzing keyword frequency to gauge public reaction
-
Document clustering (topic modeling)
- Grouping terms with different meanings into the same group
- e.g., using keywords like value-for-money, durability, A/S to explore reactions to a product
-
Highly related to computational social science
- e.g., discovering social insights by analyzing social media data
Information Retrieval
Search-related technologies.
- Search tech used by Google, Naver, etc. has advanced so much that progress has slowed down.
- The most active research area is recommendation systems.
- e.g., a search engine proactively suggests content the user might look for
Trends of NLP
CV development 2-3 years ago:
- Rapid progress through new convolution layer stacking methods and GAN usage
NLP:
- Developed relatively slowly compared to CV before the Transformer.
- RNN-based models like LSTM and GRU were predominantly used.
- After the 2017 “Attention is All You Need” paper, almost all NLP models use self-attention-based transformers.
Transformer
Originally designed for machine translation. Before deep learning, machine translation required experts to manually define and map all linguistic rules.
After deep learning, RNN inputs and outputs were trained with different languages sharing the same meaning. Thanks to many techniques, RNN-based machine translation performance reached its ceiling.
The transformer showed even better performance than RNNs for machine translation. After its introduction, it’s also been applied to image processing, time-series forecasting, drug discovery, material discovery, etc.
Normally, specialized models are developed for each domain and situation. But after the transformer, large models built by stacking self-attention were trained with self-supervised learning to handle general-purpose tasks. These models, applied to specific domains via transfer learning with minimal structural changes, were proven to outperform domain-specific models.
Supervised Learning in NLP
It’s similar to fill-in-the-blank. For example, in “I study math,” you blank out “study” and train the model to infer what word goes there. In this example, the blank for “study” is where a verb should go, and verbs that can take “math” as an object would be candidates.
Models trained this way include BERT, GPT, etc.
We can see technology progressing from AI that handles only specific tasks toward more general-purpose AI.
But this self-supervised learning requires massive data and GPU resources. Even Tesla reportedly spent billions on electricity for model training alone…
Word Embedding
Representing sentences as vectors in vector space as sequences.