Skip to main content
Overview

GPT Language Models

October 12, 2021
2 min read
  • BERT: an embedding model
    • Uses Transformer encoder
  • GPT: a generative model
    • Uses Transformer decoder

GPT Overview

![](/assets/images/GPT 언어 모델/2460ab0e-0bd1-426d-a664-61210c94f9f3-image.png)

The language generation process is the same as what you’d typically learn about language models. It sequentially predicts the most likely next words probabilistically.

![](/assets/images/GPT 언어 모델/e9d75507-faa0-4245-adaa-10d7239d739f-image.png)

GPT-1 was designed so you could attach a classifier at the end (like BERT) and fine-tune it for specific tasks. Chronologically, GPT-1 came before BERT.

GPT-1:

  • A decoder very useful for natural language sentence classification.
  • Achieved high classification performance even with small amounts of data.
  • Achieved SOTA on various NLP tasks right away.
  • Paved the way for pre-trained language models and laid the groundwork for BERT.
  • Required supervised learning, so lots of labeled data was needed.
  • A model fine-tuned for a specific task couldn’t be used for other tasks.

The GPT researchers’ new hypothesis:

Due to the nature of language, the objective function of supervised learning is the same as that of unsupervised learning. In other words, fine-tuning is unnecessary.

This is because the labels in labeled data are also language.

Put differently: a language model trained on a sufficiently large dataset can perform all NLP tasks.

Zero-shot, One-shot, Few-shot

![](/assets/images/GPT 언어 모델/d8863c8a-706a-45a6-bc3c-fb552188cb23-image.png)

Fine-tuning to create a model for just one task was deemed unnecessary. Just as humans don’t need lots of data to learn a new task, the same approach was applied to language models — inference using zero, one, or few-shot.

That is, performing tasks without any gradient updates. To make this work, a model trained on massive datasets was developed — GPT-2.

GPT-2

![](/assets/images/GPT 언어 모델/b59b94a8-92a9-412f-bed9-7ac1a0cf90b0-image.png) Minor decoder architecture changes compared to GPT-1.

Training data grew from 11GB to 40GB.

![](/assets/images/GPT 언어 모델/9fb5cd35-efa5-4193-8b92-46c112fdae40-image.png)

  • On NLP tasks like MRC, summarization, and translation, performance was on par with typical neural network models.
  • SOTA on next-word prediction.
  • Opened new horizons for zero, one, and few-shot learning.

GPT-3

![](/assets/images/GPT 언어 모델/8fbc30df-8739-4bdd-b7f3-f6b92e8bdcb1-image.png)

  • Training data: 570GB refined from 45TB.
  • Parameters increased from 1,500M to 175,000M.

![](/assets/images/GPT 언어 모델/a207ac5b-5f92-480d-8a6a-6a6b4c5025cb-image.png)

  • Modified initialization.
  • Used Sparse Transformer.

Tasks of GPT-3

  • Writing articles
    • 52% of GPT-3’s articles were judged by evaluators to seem human-written.
  • Arithmetic
    • Addition of 2-3 digit numbers performed almost perfectly.
  • QA performance exceeded some existing models.
  • Data parsing
    • Automatically parsed data from documents into tables.

Restrictions

GPT is also a model pre-trained via NSP (Next Sentence Prediction).

  • No weight updates.
    • Cannot learn new knowledge.
  • Is just scaling model size the answer?
    • Nobody knows, but probably not.
  • Cannot use multi-modal information.
Loading comments...