XLNet
Problems with existing models:
- BERT
- Predicts [MASK] tokens independently, so it can’t learn relationships between tokens
- Embedding length limits prevent learning relationships between segments
- GPT
- Only trains in one direction
XLNet emerged to overcome these limitations.
Relative Positional Encoding
Introduced to break free from the 512-token training limit. Makes the existing positional encoding (Ref) relative.

- Existing positional encoding uses absolute positions: 0, 1, 2, 3…
- Relative positional encoding uses relative distances: 0th, 1st, 2nd…
-> No more sequence length limitation.
Permutation Language Modeling
Got rid of [MASK]. Instead, uses permutations to shuffle the order of data during training, encouraging order-agnostic learning. 
Performance

Outperformed previous models on GLUE.
RoBERTa
Same architecture as BERT, but with changes in the training method.
- Increased model training time + batch size + training data
- Removed next sentence prediction
- Unrelated to fine-tuning
- The paper argued it’s too easy, actually hurting model performance
- Added longer sentences
- Dynamic masking
- Applied 10 different masking patterns per data sample during training
BART
Apply both BERT and GPT training methods together.   Reportedly outperformed BERT and RoBERTa.
T5

Unified Transformer Encoder-Decoder LM — the best LM at the time. Masks multiple spans and reconstructs them all at once.


Best performing model on GLUE.
Meena
An LM designed specifically for conversation. 
Composed of 1 Transformer encoder and multiple Transformer decoders.
- Trained on 341GB of social media data, 2.6 billion parameters.
- Proposed SSA (Sensibleness and Specificity Average) as a chatbot evaluation metric
- Higher SSA for specific and clear responses.
- Designed to close the loophole where vague answers could still produce a functioning chatbot.

Controllable LM
LMs where human value judgments like ethics can be controlled.
PPLM (Plug and Play Language Model)

- Standard LM
- Predicts next word based on probability distribution
- PPLM
- Adjusts next-word prediction to match the developer’s intent
- Stores upcoming words in a bag of words
Suppose you want the sentence “The chicken tastes delicious.” But the model outputs “ok” as the last word. PPLM uses backpropagation to modify the vector for “chicken” so that “delicious” comes out as the final word.
Advantage: no gradient updates — just modifying output vectors to guide the desired output.
Applications
- Storing multiple category words in the bag of words can produce cross-category results.
- e.g., joy + surprise + gaming
- Generating emotions about specific categories in other languages
- e.g., replacing religious, political, or racial keywords with neutral words
- Adjusting probability distributions to create gradient anger (an internet meme)
- e.g., gradually increasing the probability of anger-related words
Necessity
An LM trained only on bias-removed data doesn’t necessarily produce bias-free output. Methods like PPLM help overcome current LM limitations.
Multi-modal LM
LXMERT
Cross-modal reasoning language model. Learning Cross-Modality Encoder Representations from Transformers. 
Image and language information are embedded separately and both fed to a cross-modality encoder that generates combined image and language information.

As shown above, it could answer natural language questions with both image and text responses.
BERT for Vision-and-Language
Same architecture as BERT. The only difference is that images and text are combined using [SEP] during training. As before, a classifier is attached to the CLS token for classification of combined image and text information.

DALL-E
The third time encountering this model at Boostcamp.  To generate images, image tokens must be learned. Even a (256, 256, 3) image is very large, so VQ-VAE is used for dimensionality reduction as shown above. This is a highly simplified diagram of VQ-VAE.
Once images can be converted to latent vectors, the rest is the same as GPT.  Just as GPT predicts the next sentence, DALL-E predicts image embeddings that follow 256 text embedding tokens, training autoregressively.