Large Language Models: The successor to Transformers

In 2022, the release of ChatGpt took the world with a storm and marked the arrival of AI application. It was then that worldwide, LLM became a common term to be used throughout organizations. But was it only in 2022 when the first time LLM released for the first time. No, apparently the first LLMs model was said to be released in 2018 from OpenAI. In this article we are going to take a look at the timeline of LLMs and how Transformers became LLMs.

LLM timelines

Before we dive deep into Transformers and LLMs, let’s take a look at some terms and their meaning which we are going to use throughout the article.

Key terminology in AI:

  • Neural Network: A computational model inspired by the human brain, consisting of layers of interconnected neurons that process data and learn patterns.
  • Embedding: A numerical representation of words or data in a lower-dimensional space that captures semantic meaning and relationships.
  • Tokens: Small units of text (words, subwords, or characters) that LLMs process individually to generate or understand language.
  • Fine-Tuning: The process of further training a pre-trained model on a specific dataset to improve its performance on specialized tasks.
  • Attention Mechanism: A technique that allows models to focus on the most relevant parts of input data when making predictions, improving contextual understanding.
  • Encoders & Decoders: Components of Transformer models where encoders process and understand input, while decoders generate output based on encoded information.
  • RNN (Recurrent Neural Network): A type of neural network designed for sequential data processing, where previous outputs influence current predictions, though it struggles with long-term dependencies.

What are Transformers?

In 2017, a team at Google research, released a deep learning architecture in the paper “Attention is all you need“. This paved the way for advancements in the field of deep neural network and AI. The transformers architecture works in the form of an self attention mechanism. A mechanism, which processes and remembers all the parts of the data simultaneously in comparison to RNN’s sequential processing. Till transformers model came into the picture, Recurrent Neural Network(RNN) were established as the state of the art neural network architectures. But with longer sequential data, RNN brought memory constraints issues due to their requirement of remembering previous processed sequential data.

Let’s talk about the architecture of Transformers in detail now.

Transformers follows the same encoder and decoder neural network mechanism with the attention architecture processing parallel data. In simple words, for language processing, the encoder reads the input data and convert it into mathematical convention for the decoder to produce the output sequence.

Figure 1: From ‘Attention Is All You Need’ by Vaswani et al.

The encoder component in Transformers is the input component which consists of 6 identical layers to tranform words in embedding tokens and process the input text . The tokens are fixed size vectors( list of numbers) which are provided a sequence through Positional Encoding. The Positional encoding is established in the Transformers model to understand the order of the words. They are absent in the conventional RNNs as the sequential processing helps establish the token sequence orders. Transformers process words in parallel thus depends on Positional encoding computation to provide knowledge of the position of tokens in the sequence. This is crucial for Language applications such as Translation and response generation based on text query.

Next in line is the most important component of Transformer- The Multi-Head Self- Attention layer. The self attention mechanism allows the encoder to make relations between the token sequences by analyzing the relation of each word in a sentence with other parts irrespective of the distance between the parts. For an input sequence Y, three vector matrices are computed by the self attention:

The output from the attention layer is passed onto a feed forward network to fine tune and refine the grasped representation. Both multi head self attention layer and feed forward network has also Normalization layer applied to them for training stabilization.

The transformer’s self attention mechanism paved way to many popular NLP models such as BERT (RoBERTa, ALBERT, DeBERTa), T5, BART and the more recent ones – GPT.

To know the complete mathematical computation, I would recommend reading the complete “Attention is all you need” by Vaswani et al.

LLMs: Transformation over Transformers

LLMs are sophisticated transformer models trained on large dataset primarily for broader application usage apart from general text summarization or classification.

Transformers although revolutionary, had limitations that were limiting its full potential usage. The main drawback with Transformer was its lack of Generalization applicability. In the below table we see some major improvements that occurred in LLMs over transformers

Transformer LimitationLLMs advancements
Restricted ability to compute long textsImproved efficiency in attention mechanism extends context length
Requires different models for each NLP taskGeneralized operations can be managed by same LLM model
Due to Quadratic scaling, Computation costs are highSparse & Flash attention reduce computation complexity and enables possibility of training of large no of tokens
Transformer models focused only on Text Supports text, audio, and images.
Lack of feedback mechanismReinforcement learning improves response generation

LLMs uses the transformer architecture with a major twist. The recently released GPT(3 and 4) and LLaMa models have skipped using Encoder component of Transformer architecture. The input gets directly fetched into the decoder where a masked attention mechanism is used to hide the future words of the input to enhance its capability of predicting the future words. Some application specific LLMs like T5, BART, Gemini however may use an encoder to process non textual data such as audio and images. The efficiency was also improved in response generation through use of Sparse attention, MoE, and Flash Attention mechanism.

Other major changes introduced in LLMs were scaling up model parameters. This enabled better understanding and reasoning across diverse context.

  • GPT-1: 117M
  • GPT-2: 1.5B
  • GPT-3: 175B
  • GPT-4: (estimated over 1T parameters)

Summary:

While Transformers brought major advancements in the field of deep learning and specifically NLP( Natural Language processing), LLMs were the need of the hour for AI.

They enhanced the capabilities of transformers and enabled the use of models for generalized application usage with scaling to trillions of parameters. With what looked a distant possibility at the end of 2019, in just 3-4 years organizations have been able to build large scale optimized AI applications serving various automation tasks and bringing efficiency to organizations. This is just the starting, by the end of this decade, AI would have been already been available at a mass adaptation scale and would serve industries in various ways.

References:


Discover more from AllaboutProducts

Subscribe to get the latest posts sent to your email.

Leave a Reply

I’m Yash

And welcome to All About Products! For over a decade, I have indulged into various Products and technological research. Through this blog, I will share my insights into the latest, trends, and innovations in the world of products and technologies. Join me as we explore everything from cutting-edge gadgets to practical tools, helping you stay informed and make the best choices in today’s ever-evolving market.”

Let’s connect

Discover more from AllaboutProducts

Subscribe now to keep reading and get access to the full archive.

Continue reading