The AiEdge Newsletter

Share this post

The ChatGPT Models Family

newsletter.theaiedge.io
Newsletters

The ChatGPT Models Family

The language model household that changed the public perception

Damien Benveniste
Feb 1
16
Share this post

The ChatGPT Models Family

newsletter.theaiedge.io

Thanks for reading The AiEdge Newsletter! Subscribe for free to receive new posts and support my work.

ChatGPT really changed the public perception when it comes to Large Language Models, for the better and for the worse! Let’s dig deeper into the models that led to ChatGPT and the alternative models that nobody talks about:

  1. The GPT-3 Family

  2. GPT-1 vs GPT-2 vs GPT-3

  3. ChatGPT’s competitors

GPT-3 architecture

The GPT-3 Family

There is more to GPT-3 than just GPT-3! In the OpenAI available API [1], GPT-3 represents a fleet of different models that distinguish themselves by size, the data used and the training strategy. The core GPT-3 model [2] is the source of all those derived models.

The GPT-3 models family

When it comes to size, OpenAI offers different models that balance the quality of Natural Language Generation and inference speed. The models are labeled with the names of scientists or inventors:

  • Davinci: 175B parameters

  • Curie: 6.7B parameters

  • Babbage: 1B parameters

  • Cushman: 12B parameters

I am missing "Ada", which is supposed to be faster (so I guess smaller) but they don't document its size.

The models can be fine-tuned in a supervised learning manner with different datasets. The Codex family is specifically designed to generate code by fine-tuning on public Github repos' data [3]. Most of the text generation models (text-davinci-001, text-davinci-002, text-curie-001, text-babbage-001) are actually GPT-3 models fine-tuned with human labeled data as well as with the distillation of the best completions from all of their models. OpenAI actually described those models to be InstructGPT models [4], although the training process is slightly different from the one described in the paper. Text-davinci-002 is specifically described by OpenAI as being fine-tuned with text data from the Codex model code-davinci-002, presumably performing well on both code and text data. Text-davinci-003 is a full InstructGPT model as it is the text-davinci-002 model further refined with a Proximal Policy Optimization algorithm (PPO) [5], a Reinforcement Learning algorithm.

The "GPT-3.5" label refers to models that have been trained on a blend of text and code from before Q4 2021 as opposed to October 2019 for the other models.

OpenAI has been using GPT-3 for many specific applications. For example, they trained text and code alignment models (text-similarity-davinci-001, text-similarity-curie-001) to learn embedding representations of those data [6] in a similar manner to the CLIP model powering DALL-E 2 and Stable Diffusion. They developed a model to summarize text with labeled data in a very similar manner to InstructGPT [7]. They also provide a way to extract the latent representation provided by GPT-like models (text-embedding-ada-002). And we know that ChatGPT is a sibling model to InstructGPT trained from GPT-3.5 so it is probably using Text-davinci-003 as a seed.

GPT-1 vs GPT-2 vs GPT-3

It is actually trivial to build a GPT-3 model! ~100 lines of code would do it. Training that thing is another story though! GPT-1, GPT-2 and GPT-3 are actually very similar in terms of architecture and differ mostly on the data and its size used for training and the number of transformer blocks with the number of incoming tokens.    

GPT-1 vs GPT-2 vs GPT-3
  • GPT-1 is mostly a set of 12 decoder Transformer blocks put one after the other. The text data is encoded using a Byte pair encoding [8]. The position embedding is learned instead of the typical static sinusoidal one [9]. The max length for consecutive tokens is 512. The top layer is simply a softmax layer adapted to the specific learning task.

    => 117 million parameters [10].

  • GPT-2 has basically the same architecture as GPT-1 but the biggest model contains 48 transformer blocks instead. The second normalization layer is moved to the first position in a block, and the last block contains an additional normalization layer. The weights are initialized slightly differently and the vocabulary size is increased. The number of consecutive tokens is increased to 1024.

    => 1.5 billion parameters [11].

  • GPT-3 has the same architecture as GPT-2 but the number of blocks increased to 96 in the bigger model and the context size (number of consecutive tokens) increased to 2048. The multi-head self-attention layers alternate between the typical dense ones and the sparse ones [12].

    => 175 billion parameters [2].

GPT-1 is trained in a self-supervised manner (learn to predict the next word in text data) and fine-tuned in a supervised learning manner. GPT-2 is trained in a fully self supervised way, focusing on zero-shot transfer and GPT-3 is pre-trained in a self supervised manner exploring a bit more the few-shots fine-tuning. 

  • GPT-1 is pre-trained on the BooksCorpus dataset, containing ~7000 books amounting to ~5GB of data: https://huggingface.co/datasets/bookcorpus.

  • GPT-2 is pre-trained using the WebText dataset which is a more diverse set of internet data containing ~8M documents for about ~40 GB of data: https://huggingface.co/datasets/openwebtext

  • GPT-3 uses an expanded version of the WebText dataset, two internet-based books corpora that are not disclosed and the English-language Wikipedia which constituted ~600 GB of data. 

You can find the implementation of GPT-2

  • in TensorFlow by OpenAI: https://github.com/openai/gpt-2/blob/master/src/model.py

  • and in PyTorch by Andrej Karpathy: https://github.com/karpathy/minGPT/blob/master/mingpt/model.py

ChatGPT’s competitors

I think Microsoft partnering with OpenAI might have been one of the most successful publicity stunts ever! The current public perception is that nothing can compete with ChatGPT in terms of generative AI for text. But do you remember when Blake Lemoine was fired by Google back in 2022 when he leaked information about the LaMDA model because he thought it was sentient? Google has nothing to fear when it comes to relevance in the area of text generation research, but Bing may now take a larger market share in the search engine space thanks to the clever OpenAI marketing and the way Microsoft capitalized on that perception.

Here are a few direct competitors of ChatGPT, and today is Wednesday, so there will be a couple more by the end of the weekend:

  • PEER by Meta AI - a language trained to imitate the writing process. It is trained on Wikipedia's edit history data [13]. It specializes in predicting edits and explaining the reasons for those edits. It is capable of citing and quoting reference documents to back up the claims it generates. It is a 11 B parameters Transformers with the typical encoder-decoder architecture, and it is outperforming GPT-3 on the task it specializes in [14].

  • LaMDA by Google AI - a language model trained for dialog applications. It is pre-trained on ~3 B documents and ~1 B dialogs and fine-tuned on human generated data to improve on quality, safety and truthfulness of the generated text. It is also fine-tuned to learn to call an external information retrieval system such as Google Search, a calculator, and a translator making it potentially a much stronger candidate to replace Google Search than ChatGPT. It a 135B parameters decoder only transformer [15].

  • PaLM by Google AI - The biggest of all: 540 B parameters! Breakthrough capabilities in arithmetic and common-sense reasoning. It is trained on 780 billion tokens coming from multilingual social media conversations, filtered multilingual webpages, books, GitHub repo, multilingual Wikipedia and news [16].

    ChatGPT’s competition

The more I read about those Large Language Models, the more I feel that very little has changed since 2017's "Attention is all you need" [9]! All those models follow the exact same architecture with a couple of changes here and there. The advancements are mostly happening in the scale of the data and the models and the domain specificity of the data. At those scales, the fun is a lot about how to minimize training costs. I wonder, if I were to train a HUGE XGBoost model with my own HUGE dataset, would I be able to name that model DamienBoost and publish a paper about it? 

Thanks for reading The AiEdge Newsletter! Subscribe for free to receive new posts and support my work.


  1. OpenAI API: https://beta.openai.com/docs/introduction

  2. Language Models are Few-Shot Learners by Tom B. Brown et al: https://arxiv.org/abs/2005.14165

  3. Evaluating Large Language Models Trained on Code by Mark Chen et al: https://arxiv.org/pdf/2107.03374.pdf

  4. Training language models to follow instructions with human feedback by Long Ouyang et al: https://arxiv.org/pdf/2203.02155.pdf

  5. Proximal Policy Optimization Algorithms by John Schulman et al: https://arxiv.org/pdf/1707.06347.pdf

  6. Text and Code Embeddings by Contrastive Pre-Training by Arvind Neelakantan et al: https://arxiv.org/pdf/2201.10005.pdf

  7. Learning to summarize from human feedback by Nisan Stiennon et al: https://arxiv.org/pdf/2009.01325.pdf

  8. Neural Machine Translation of Rare Words with Subword Units by Rico Sennrich et al: https://arxiv.org/pdf/1508.07909.pdf

  9. Attention Is All You Need by Ashish Vaswani et al: https://arxiv.org/pdf/1706.03762.pdf

  10. Improving Language Understanding by Generative Pre-Training by Alec Radford: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

  11. Language Models are Unsupervised Multitask Learners by Alec Radford et al: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

  12. Generating Long Sequences with Sparse Transformers by Rewon Child et al: https://arxiv.org/pdf/1904.10509.pdf

  13. Wikipedia's edit history data: https://dumps.wikimedia.org/enwiki/

  14. PEER: A Collaborative Language Model by Timo Schick et al: https://arxiv.org/pdf/2208.11663.pdf

  15. LaMDA: Language Models for Dialog Applications by Romal Thoppilan et al: https://arxiv.org/pdf/2201.08239.pdf

  16. PaLM: Scaling Language Modeling with Pathways by Aakanksha Chowdhery et al: https://arxiv.org/pdf/2204.02311.pdf

Share this post

The ChatGPT Models Family

newsletter.theaiedge.io
Previous
Next
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 AiEdge
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing