In 2022, the release of ChatGpt took the world with a storm and marked the arrival of AI application. It was then that worldwide, LLM became a common term to be used throughout organizations. But was it only in 2022 when the first time LLM released for the first time. No, apparently the first LLMs model was said to be released in 2018 from OpenAI. In this article we are going to take a look at the timeline of LLMs and how Transformers became LLMs.

Before we dive deep into Transformers and LLMs, let’s take a look at some terms and their meaning which we are going to use throughout the article.
Key terminology in AI:
- Neural Network: A computational model inspired by the human brain, consisting of layers of interconnected neurons that process data and learn patterns.
- Embedding: A numerical representation of words or data in a lower-dimensional space that captures semantic meaning and relationships.
- Tokens: Small units of text (words, subwords, or characters) that LLMs process individually to generate or understand language.
- Fine-Tuning: The process of further training a pre-trained model on a specific dataset to improve its performance on specialized tasks.
- Attention Mechanism: A technique that allows models to focus on the most relevant parts of input data when making predictions, improving contextual understanding.
- Encoders & Decoders: Components of Transformer models where encoders process and understand input, while decoders generate output based on encoded information.
- RNN (Recurrent Neural Network): A type of neural network designed for sequential data processing, where previous outputs influence current predictions, though it struggles with long-term dependencies.
What are Transformers?
In 2017, a team at Google research, released a deep learning architecture in the paper “Attention is all you need“. This paved the way for advancements in the field of deep neural network and AI. The transformers architecture works in the form of an self attention mechanism. A mechanism, which processes and remembers all the parts of the data simultaneously in comparison to RNN’s sequential processing. Till transformers model came into the picture, Recurrent Neural Network(RNN) were established as the state of the art neural network architectures. But with longer sequential data, RNN brought memory constraints issues due to their requirement of remembering previous processed sequential data.
Let’s talk about the architecture of Transformers in detail now.
Transformers follows the same encoder and decoder neural network mechanism with the attention architecture processing parallel data. In simple words, for language processing, the encoder reads the input data and convert it into mathematical convention for the decoder to produce the output sequence.

Figure 1: From ‘Attention Is All You Need’ by Vaswani et al.
The encoder component in Transformers is the input component which consists of 6 identical layers to tranform words in embedding tokens and process the input text . The tokens are fixed size vectors( list of numbers) which are provided a sequence through Positional Encoding. The Positional encoding is established in the Transformers model to understand the order of the words. They are absent in the conventional RNNs as the sequential processing helps establish the token sequence orders. Transformers process words in parallel thus depends on Positional encoding computation to provide knowledge of the position of tokens in the sequence. This is crucial for Language applications such as Translation and response generation based on text query.
Next in line is the most important component of Transformer- The Multi-Head Self- Attention layer. The self attention mechanism allows the encoder to make relations between the token sequences by analyzing the relation of each word in a sentence with other parts irrespective of the distance between the parts. For an input sequence Y, three vector matrices are computed by the self attention:
Q queries, K keys, and V values.
These matrices are created as multiplication of sequence Y with computed weight matrices WQ, WK, and WV. Finally a softmax function is applied on the set of queries matrix Q, along with the matrices K and V to compute the attention score.
Attention(Q,K,V) = SoftMax(QKT/√dk)V
The attention is performed multiple times to learn and establish the different types of relationships. The overall score for each word decides the importance of the focus that the model would provide while decoding.
The output from the attention layer is passed onto a feed forward network to fine tune and refine the grasped representation. Both multi head self attention layer and feed forward network has also Normalization layer applied to them for training stabilization.
Decoder:
To generate output based on the input from the encoder, the decoder uses similar layers with couple of differences. The multi head self attention in the decoder is masked to prevent the model from using future words when generating text. Another multi head attention helps the decoder understand the output from the encoder. Finally, a SoftMax is used as a final layer which predicts one word at a time using the SoftMax function.
The transformer’s self attention mechanism paved way to many popular NLP models such as BERT (RoBERTa, ALBERT, DeBERTa), T5, BART and the more recent ones – GPT.
To know the complete mathematical computation, I would recommend reading the complete “Attention is all you need” by Vaswani et al.
LLMs: Transformation over Transformers
LLMs are sophisticated transformer models trained on large dataset primarily for broader application usage apart from general text summarization or classification.
Transformers although revolutionary, had limitations that were limiting its full potential usage. The main drawback with Transformer was its lack of Generalization applicability. In the below table we see some major improvements that occurred in LLMs over transformers
| Transformer Limitation | LLMs advancements |
|---|---|
| Restricted ability to compute long texts | Improved efficiency in attention mechanism extends context length |
| Requires different models for each NLP task | Generalized operations can be managed by same LLM model |
| Due to Quadratic scaling, Computation costs are high | Sparse & Flash attention reduce computation complexity and enables possibility of training of large no of tokens |
| Transformer models focused only on Text | Supports text, audio, and images. |
| Lack of feedback mechanism | Reinforcement learning improves response generation |
LLMs uses the transformer architecture with a major twist. The recently released GPT(3 and 4) and LLaMa models have skipped using Encoder component of Transformer architecture. The input gets directly fetched into the decoder where a masked attention mechanism is used to hide the future words of the input to enhance its capability of predicting the future words. Some application specific LLMs like T5, BART, Gemini however may use an encoder to process non textual data such as audio and images. The efficiency was also improved in response generation through use of Sparse attention, MoE, and Flash Attention mechanism.
Other major changes introduced in LLMs were scaling up model parameters. This enabled better understanding and reasoning across diverse context.
- GPT-1: 117M
- GPT-2: 1.5B
- GPT-3: 175B
- GPT-4: (estimated over 1T parameters)
Summary:
While Transformers brought major advancements in the field of deep learning and specifically NLP( Natural Language processing), LLMs were the need of the hour for AI.
They enhanced the capabilities of transformers and enabled the use of models for generalized application usage with scaling to trillions of parameters. With what looked a distant possibility at the end of 2019, in just 3-4 years organizations have been able to build large scale optimized AI applications serving various automation tasks and bringing efficiency to organizations. This is just the starting, by the end of this decade, AI would have been already been available at a mass adaptation scale and would serve industries in various ways.
References:
- Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017): https://arxiv.org/pdf/1706.03762
- https://arxiv.org/abs/2005.14165v4
- https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse
- Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018).
- “Improving language understanding with unsupervised learning”. openai.com. June 11, 2018. Archived from the original on 2023-03-18. Retrieved 2023-03-18.






Leave a Reply