Large Language Models: The successor to Transformers

In 2022, the release of ChatGpt took the world with a storm and marked the arrival of AI application. It was then that worldwide, LLM became a common term to be used throughout organizations. But was it only in 2022 when the first time LLM released for the first time. No, apparently the first LLMs model was said to be released in 2018 from OpenAI. In this article we are going to take a look at the timeline of LLMs and how Transformers became LLMs.

Before we dive deep into Transformers and LLMs, let’s take a look at some terms and their meaning which we are going to use throughout the article.

Key terminology in AI:

Neural Network: A computational model inspired by the human brain, consisting of layers of interconnected neurons that process data and learn patterns.
Embedding: A numerical representation of words or data in a lower-dimensional space that captures semantic meaning and relationships.
Tokens: Small units of text (words, subwords, or characters) that LLMs process individually to generate or understand language.
Fine-Tuning: The process of further training a pre-trained model on a specific dataset to improve its performance on specialized tasks.
Attention Mechanism: A technique that allows models to focus on the most relevant parts of input data when making predictions, improving contextual understanding.
Encoders & Decoders: Components of Transformer models where encoders process and understand input, while decoders generate output based on encoded information.
RNN (Recurrent Neural Network): A type of neural network designed for sequential data processing, where previous outputs influence current predictions, though it struggles with long-term dependencies.

What are Transformers?

In 2017, a team at Google research, released a deep learning architecture in the paper “Attention is all you need“. This paved the way for advancements in the field of deep neural network and AI. The transformers architecture works in the form of an self attention mechanism. A mechanism, which processes and remembers all the parts of the data simultaneously in comparison to RNN’s sequential processing. Till transformers model came into the picture, Recurrent Neural Network(RNN) were established as the state of the art neural network architectures. But with longer sequential data, RNN brought memory constraints issues due to their requirement of remembering previous processed sequential data.

Let’s talk about the architecture of Transformers in detail now.

Transformers follows the same encoder and decoder neural network mechanism with the attention architecture processing parallel data. In simple words, for language processing, the encoder reads the input data and convert it into mathematical convention for the decoder to produce the output sequence.

Figure 1: From ‘Attention Is All You Need’ by Vaswani et al.

The encoder component in Transformers is the input component which consists of 6 identical layers to tranform words in embedding tokens and process the input text . The tokens are fixed size vectors( list of numbers) which are provided a sequence through Positional Encoding. The Positional encoding is established in the Transformers model to understand the order of the words. They are absent in the conventional RNNs as the sequential processing helps establish the token sequence orders. Transformers process words in parallel thus depends on Positional encoding computation to provide knowledge of the position of tokens in the sequence. This is crucial for Language applications such as Translation and response generation based on text query.

Next in line is the most important component of Transformer- The Multi-Head Self- Attention layer. The self attention mechanism allows the encoder to make relations between the token sequences by analyzing the relation of each word in a sentence with other parts irrespective of the distance between the parts. For an input sequence Y, three vector matrices are computed by the self attention:

Q queries, K keys, and V values.

These matrices are created as multiplication of sequence Y with computed weight matrices W_Q, W_K, and W_V. Finally a softmax function is applied on the set of queries matrix Q, along with the matrices K and V to compute the attention score.

Attention(Q,K,V) = SoftMax(QK^T/√d_k)V

The attention is performed multiple times to learn and establish the different types of relationships. The overall score for each word decides the importance of the focus that the model would provide while decoding.

The output from the attention layer is passed onto a feed forward network to fine tune and refine the grasped representation. Both multi head self attention layer and feed forward network has also Normalization layer applied to them for training stabilization.

Decoder:

To generate output based on the input from the encoder, the decoder uses similar layers with couple of differences. The multi head self attention in the decoder is masked to prevent the model from using future words when generating text. Another multi head attention helps the decoder understand the output from the encoder. Finally, a SoftMax is used as a final layer which predicts one word at a time using the SoftMax function.

The transformer’s self attention mechanism paved way to many popular NLP models such as BERT (RoBERTa, ALBERT, DeBERTa), T5, BART and the more recent ones – GPT.

To know the complete mathematical computation, I would recommend reading the complete “Attention is all you need” by Vaswani et al.

LLMs: Transformation over Transformers

LLMs are sophisticated transformer models trained on large dataset primarily for broader application usage apart from general text summarization or classification.

Transformers although revolutionary, had limitations that were limiting its full potential usage. The main drawback with Transformer was its lack of Generalization applicability. In the below table we see some major improvements that occurred in LLMs over transformers

Transformer Limitation	LLMs advancements
Restricted ability to compute long texts	Improved efficiency in attention mechanism extends context length
Requires different models for each NLP task	Generalized operations can be managed by same LLM model
Due to Quadratic scaling, Computation costs are high	Sparse & Flash attention reduce computation complexity and enables possibility of training of large no of tokens
Transformer models focused only on Text	Supports text, audio, and images.
Lack of feedback mechanism	Reinforcement learning improves response generation

LLMs uses the transformer architecture with a major twist. The recently released GPT(3 and 4) and LLaMa models have skipped using Encoder component of Transformer architecture. The input gets directly fetched into the decoder where a masked attention mechanism is used to hide the future words of the input to enhance its capability of predicting the future words. Some application specific LLMs like T5, BART, Gemini however may use an encoder to process non textual data such as audio and images. The efficiency was also improved in response generation through use of Sparse attention, MoE, and Flash Attention mechanism.

Other major changes introduced in LLMs were scaling up model parameters. This enabled better understanding and reasoning across diverse context.

GPT-1: 117M
GPT-2: 1.5B
GPT-3: 175B
GPT-4: (estimated over 1T parameters)

Summary:

While Transformers brought major advancements in the field of deep learning and specifically NLP( Natural Language processing), LLMs were the need of the hour for AI.

They enhanced the capabilities of transformers and enabled the use of models for generalized application usage with scaling to trillions of parameters. With what looked a distant possibility at the end of 2019, in just 3-4 years organizations have been able to build large scale optimized AI applications serving various automation tasks and bringing efficiency to organizations. This is just the starting, by the end of this decade, AI would have been already been available at a mass adaptation scale and would serve industries in various ways.

References:

Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017): https://arxiv.org/pdf/1706.03762
https://arxiv.org/abs/2005.14165v4
https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse
Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018).
“Improving language understanding with unsupervised learning”. openai.com. June 11, 2018. Archived from the original on 2023-03-18. Retrieved 2023-03-18.

AllaboutProducts