Recent trends
- Transformer and self-attention are being used beyond machine translation.
- Experiments have shown that simply stacking more transformers — 12, 24, or more instead of the 6 proposed in the original paper — improves performance, without any special architectural changes. Self-supervised learning frameworks are used to train on large-scale data for these models. e.g., BERT, GPT-3, XLNet, ALBERT, RoBERTa, Reformer, T5, ELECTRA
- After training this way, just applying transfer learning to various domains and tasks outperforms models specifically designed for those areas.
- Applications: recommender systems, drug discovery, computer vision…
- Limitation: still stuck on greedy decoding. Decoding left-to-right, making the best choice at each step.
GPT-1
Unified multiple NLP tasks.
 Stacked 12 transformers.
Standard Seq2Seq
The standard sequence training process is the same as the basic transformer covered in previous posts. To output ‘I go home’, the input starts with ‘[SOS]’ to produce ‘I’, then ‘I’ is input to produce ‘go’, and so on.
Classification
 Implemented by adding start and extract tokens at the beginning and end of the text while performing Seq2Seq.
- After transformer training, encoding vectors are formed with the same format as input.
- The value at the extract token position in the encoding vector is used for classification. e.g., sentence sentiment (positive/negative)
- Seq2Seq can continue using the remaining encoding vector values excluding the extract token.
Entailment
 A task that determines whether a premise and hypothesis have a logical entailment or contradiction relationship. In GPT-1, the premise and hypothesis are combined into a single sequence to solve the task. The extract token acts as a query within the transformer, pulling relevant information from other positions in the sequence.
- A delimiter called Delim is placed between the two sentences, and an Extract token is placed at the end.
- The extract token from the encoding vector is passed through the output layer to determine the logical relationship.
Transfer learning
A GPT-1 model trained for a specific task can be reused for other tasks. For example, suppose you want to repurpose a sentiment analysis model for topic classification.

The existing output layer is a linear neural network for sentiment classification. So you remove it and attach a new linear neural network for topic classification after the transformer.
This is the same idea as changing just the final classification layer’s number of outputs in a CNN to perform an arbitrary classification task. The pre-trained network is kept intact, and only the output layer is initialized and trained.
Self-supervised learning
GPT uses unlabeled data to train the pre-trained model on a Seq2Seq task. Since it’s a next-word prediction task, labeling isn’t needed. This is where self-supervised learning comes in.
But topic classification requires labeled data. Labeled data is usually far smaller than unlabeled data, which is disadvantageous for model training.
Since the model was pre-trained with self-supervised learning on large data, most parameters are meaningfully initialized. So just using a small amount of labeled data for transfer learning can produce a well-performing model.

The table above compares task-specific model+data combinations with GPT. Pre-training on large data followed by transfer learning shows better performance.
BERT

Bidirectional Transformers for Language Understanding. Previous attempts used LSTM for self-supervised learning on large data, but BERT performed significantly better.
Motivation
RNN-family models acquire information in only one direction. This is a very weak approach for tasks that require understanding the full context. BERT’s Masked Language Model was created to acquire information bidirectionally.
Masked Language Model (MLM)
Words in the input sequence are randomly replaced with masks. The model is trained to infer the masked words.
Hyperparameter : the probability level for masking words.
- too high: too much information is hidden, making masked data hard to infer.
- too low: training takes too long or efficiency drops.
Typically is used.
Side effect Even if we intend to mask 15% of words, replacing all of that 15% with mask tokens causes issues.
The pre-trained model gets used to 15% of data being mask tokens, but real test data likely won’t have that. This discrepancy significantly hinders transfer learning.
Solution The of data is classified as follows:
- 80% is replaced with mask tokens.
- 10% is replaced with random words. This helps the model handle strange input words.
- 10% keeps the original. This helps the model confidently assert that the original is correct.
Next Sentence Prediction
A method proposed by BERT for handling sentence-level tasks, similar to GPT’s approach.

Similar to GPT’s extract and delimiter tokens, BERT uses CLS and SEP tokens.
- SEP: separates sentences.
- CLS: holds classification information. Placed at the beginning.
- MASK: the mask used in the masked language model.
The task shown in the figure determines whether two sentences are adjacent. The CLS token holds binary data. All information in the figure is fed to the transformer at once, and the network outputs the prediction result at the CLS token.
BERT architecture
Model Architecture
- L: Layer
- H: Attention encoding vector dimension
- A: Attention heads per layer
- BERT Base: L=12, A=12, H=768
- BERT Large: L=24, A=16, H=1024
Input Representation
- WordPiece embedding: Subword-level embedding rather than word-level (30,000 WordPiece)
- Learned positional embedding  The original Transformer used sin/cos with predetermined offsets for positional embedding. BERT learns this matrix end-to-end, like learning embedding vectors in Word2Vec.
- Segment Embedding
Segment Embedding
 Positional embedding provides ordering, but it doesn’t recognize sentence boundaries.
In Next Sentence Prediction, ‘he’ after a SEP token should be treated as the first word of a new sentence, but sequence-wise it isn’t first — hence the problem. Segment embedding solves this.
The distinction between sentences before and after SEP is computed via segment embedding and simply added.
Bidirectional

GPT uses masked self-attention to prevent looking at future information — because you can’t see the next word when predicting it.
BERT sees the entire sequence since it’s masked. It needs to see the full context to predict masks. So BERT uses standard self-attention from the original transformer.
Transfer learning

Given a pre-trained BERT from self-supervised learning, the following tasks are possible. Similar to GPT.
Sentence pair classification

- Two sentences are joined with a SEP token.
- A CLS token is placed at the first index and passed through BERT.
- The CLS token from the encoding vector is fed to the output layer to get a class label.
Single sentence classification
 Same as sentence pair classification, just with one sentence and only a CLS token.
Single Sentence Tagging

Each word has an encoding vector, and each one is passed through the output layer to determine POS, morpheme, and other information.
BERT vs GPT-1
- Training size
- GPT: BookCorpus (800M words)
- BERT: BookCorpus + Wikipedia (2,500M words)
- BERT: has SEP and CLS tokens. Uses segment embedding to distinguish sentences.
- Batch size
- BERT: 128,000 words
- GPT: 32,000 words
- Larger batch size generally means more stable and better training. Using gradients computed in a single pass is better than averaging gradients from multiple training iterations.
- Task-specific fine-tuning
- GPT: used 5e-5 learning rate across tasks.
- BERT: fine-tuned learning rate per task.
MRC (Machine Reading Comprehension), Question Answering
Reading a text and answering questions about it.  As shown above, proper comprehension of subjects and actions in the document is required to answer.
SQuAD 1.1
Stanford Question Answering Dataset for testing QA model performance with MRC. Version 2.0 also exists now. There’s a leaderboard for test set scores.
SQuAD 1.1 solution process
 Typically the answer sits at a specific location in the passage; the goal is to find that location.
- Concatenate the question and the passage using a SEP token.
- Obtain the encoding vector for the concatenated data.
- Add a fully connected layer to reduce the encoding vector to a scalar for finding the start point, then apply softmax.
- Add another fully connected layer for finding the end point, then apply softmax.
- Two fully connected layers on a single encoding vector yield the start and end points of the answer.
SQuAD 2.0 solution process
In 1.1 there’s always an answer, but 2.0 includes cases where no answer exists in the passage.
So a task to determine whether an answer exists must come first. If an answer exists, proceed with the SQuAD 1.1 process.
- Concatenate question and passage, add a CLS token.
- Add a fully connected layer for binary classification using the CLS token’s value.
- Classify using cross entropy.
SWAG
A task that picks the most likely next sentence given a preceding sentence. 
- Concatenate the premise with each answer choice separately. e.g., Premise + Choice 1, Premise + Choice 2, …
- Get encoding vectors for each concatenation.
- Pass each encoding vector through the output layer to get a scalar. The same output layer is used for all.
- Apply softmax to the scalar results and pick the choice with the highest probability.
BERT: Ablation study
 BERT’s performance improved with more parameters.