Question Answering
 This is arguably the area that benefited the most from self-supervised learning with models like BERT and GPT.
- A question and context are given.
- Context can be understood as the surrounding information, though the exact meaning varies slightly by domain.
- Multiple-choice: several answer candidates are provided, and the model picks one.
- Span-based: the answer is extracted from the passage. e.g., the answer lies between the 10th and 30th index of the text.
- Yes/No: binary answer to the question.
- Generation-based: the answer itself is treated as a language generation task, as in GPT.
Open-Domain Question Answering

Information is extracted from external knowledge such as knowledge bases and knowledge graphs (structured DBs) to perform QA. 
- Knowledge tuple: pairs of information-bearing nodes like [Hotel], [HasA], [Lobby].
- Knowledge graph: a structure composed of knowledge tuples.

Open-domain QA retrieves information from external knowledge. Most NLP tasks learn from natural language sequence data. A recent trend is to use structured data like knowledge bases when fine-tuning these models.
Open-domain QA has the Retriever extract from external knowledge — whether that’s a knowledge base or natural language data like Wikipedia — and then performs MRC.
Retrieval-augmented Language Model pre-training/fine-tuning

Instead of finding the answer in the passage, it uses the pre-training model’s internal knowledge plus external knowledge to find answers. A form of zero-shot learning.
Open-domain Chatbot
There’s no standardized approach for chatbots yet. They’re usually built with Seq2Seq.
- Open-domain chatbot: can converse about unstructured topics.
- Much harder than closed-domain.
- Closed-domain chatbot: designed for specific topics and purposes. Often uses human-designed models.
- Limited freedom.
- Usually classification-based.

This is the architecture of Facebook’s Blender Bot 2.0. It combines the model’s pre-existing knowledge with internet information for queries.
Unsupervised Neural Machine Translation
Standard translation tasks use labeled data. This field aims to apply translation to unlabeled data.
Back-translation

The same technique used in CycleGAN, StarGAN, etc.
- Parallel corpus: literally a parallel document set. For example, (English, Korean) pairs.
The idea behind back-translation: translate English to French, then French back to English, and check whether the original sentence is recovered. The model tries to minimize this difference. It resembles an AutoEncoder in that input and output should be similar, but unlike an AutoEncoder, we do care about the intermediate output. The French translation in between should actually be correct.
Of course, the intermediate result could be garbage while the final output still matches the input. To address this contradiction, techniques like denoising autoencoders or checking the decoder output are used.
Text Style Transfer

A task that converts a source sentence into a desired style. For example, reordering words, converting casual text to formal text, etc. 
Implemented by inserting style information between encoder and decoder, or by feeding both x and style information to the transformer. Also called a conditional model or conditional generator.
- (a) Disentanglement: context (z) and style (s) are separated.
- (b) Entanglement: context and style are not separated.
Quality Estimation

BLEU score is a metric for NLG (Natural Language Generation). But it’s just a human-designed score and not a holistic metric for models. A model with a BLEU score under 50 might still be more than adequate for production. Designing a score that captures how well a model performs a domain-specific task is very hard.
Quality Estimation aims to evaluate using diverse, non-standardized factors about sentences. It’s a challenging field because evaluating whether output sentences are good is inherently difficult.
BERTScore

Uses BERT encoding to perform evaluation. Compares the ground truth and the sentence being evaluated via similarity.
In-Context Learning
A field that aims to handle all NLP tasks as natural language generation tasks. GPT is the representative example.

This is somewhat different from few-shot in CNN. Since all tasks are treated as language generation, even the task description, examples, and prompt are natural language. It’s still few-shot in the sense that no translation-related data was used for training.
Prompt Tuning
The task of figuring out how to write the task description, examples, and prompt (from the diagram above) to best perform the desired task.

It optimizes all the text used in the query to get an answer. A separate model is built for prompt tuning, but the original model (GPT) is not fine-tuned at all.
Language Models Trained on Code
 Codex: a language model fine-tuned on publicly available Python code from GitHub. Applying In-Context Learning and Prompt Tuning to code also produced good coding results.
Multi-Modal Models
Models that combine multiple types of information.
DALL-E
 Generating images from text descriptions. A conditional generator. Images are split into n patches, and each patch is treated as an embedding vector. DALL-E collects these n patches and processes them as a single sequence. It’s transformer-based.
CLIP

A model based on the logic that if text and image are semantically similar, they should be close in embedding space.
Training data consists of images and captions. Captions from different image types are pushed far apart in embedding space, while captions from similar images are pulled closer. Also transformer-based.
Heavily used as a pre-trained model. It’s a highly versatile pre-trained model usable in both CV and NLP. It can encode text, images, or anything.
NeRF, a 3D image generation model, is said to be based on transfer learning from CLIP.
Q&A
- Have traditional RNN-family models (RNN, LSTM, GRU) been entirely replaced by transformers?
- No. RNN is still used in many domains for future information prediction.
- Fields that focus on pattern signals are representative RNN users.
- Transformers don’t have that many parameters per se, but they’re usually stacked with many layers, and intermediate products (Q, K, V) require a lot of memory. requires memory proportional to .
- So RNN-family models are used when lightweight models are needed.
- Word2Vec, GloVe, RNN, LSTM seem to be used less and less. Should I focus on recent tech?
- Recently, pre-trained models work well even with data that hasn’t gone through elaborate embedding. But Word2Vec and GloVe are arguably embedding layers too. Understanding how they work is still helpful.
- Wouldn’t it be enough if the model had all the information?
- Models aren’t built that way because of unexpected information that needs handling. Also, it’s realistically impossible for one model to know all the information in the world.