
How LLMs Work!
As access to advanced AI models becomes a commodity, and developers such as myself find it easier than ever to build AI applications using the plethora of APIs available, it is increasingly important to understand how this technology works in order to better utilise it. This article is an attempt to go through the various resources available on the internet in order to understand the history, motivation and building blocks of modern LLMs.
There’s a fair chance that if you’ve been on the internet in 2023, you have come across or used ChatGPT and similar tools. LLMs, or Large Language Models, are the tech behind ChatGPT. They are a type of machine learning model used for natural language processing tasks such as language generation and text classification. They are capable of generating new text that is similar in style and content to the training data. As they are trained on large sets human generated data, they makes them able to mimic human intelligence
Evolution of language models
Language modeling is the task of predicting what word comes next. A system that does language modeling is called a Language model.
These models began with simple statistical approaches, such as counting word frequencies and utilizing techniques like N-grams, which consider the probabilities of word sequences based on historical data.

By understanding the statistical relationships between words and phrases, these models were able to perform tasks such as correcting grammar, suggesting next words, or even translating between languages at a rudimentary level.
As limitations of rule-based and statistical models became apparent, language models moved towards neural network-based approaches. These new models could understand nuances and context in language far beyond what early models could achieve.
Neural Networks are inspired by the human brain, mimicking the way that biological neurons signal to one another. They are comprised of a node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.
Neural networks rely on training data to learn and improve their accuracy over time.

Language models evolved towards using a specific type of neural network called the Recurrent Neural Network (RNNs), this solved a particular problem that was limiting existing language models - memory. A defining feature of RNNs is that unlike orher neural networks, in RNNs all the inputs are dependent on each other, it remembers what it knows from previous input using a simple loop which the information from previous time stamp and adds it to the input of current time stamp. This knowledge of the relationship among all the previous words helps to predict a better output.

But, vanilla RNNs were not very good at retaining these dependencies over long sequences, this was called the vanishing gradient problem. A particular type of RNN known as the Long Short-Term Memory Networks (LSTM) was introduced to bypass this problem. LSTMs store the information for the previous state and current input in a “cell” (memory) instead of in the hidden state. This enables them to learn long term dependencies from text without older states and inputs “vanishing”. LSTMs have been used for tasks like text classification, sentiment analysis, and language modelling.
However, since RNNs work by passing information along to each step sequentially, they can only run in an ordered manner. Thus they are slow. Another problem that RNNs have is that they are only getting context from previous step, even bidirectional RNNs get context only from its immediate neighbours. Due to this there is some meaning that is lost down the line. These problems provide the motivation for the next kind of neural networks that form the base of modern LLMs like GPT - Transformer Neural Networks
Transformer Neural Networks
In their famously titled “Attention Is All You Need“ paper, the authors described transformer architecture for neural networks:
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
What transformer networks aim to do is to use “attention” in order to understand the importance of each word in a sentence. Instead of getting context only from the previous encoder state as is the case with RNNs, in each state of the transformer decoder we look at all previous states of the encoder, thus it extracts all information for the whole sequence, this is what they term as “attention”. This allows the decoder to assign greater weight or importance to a certain element of the input for each element of the output. Learning in every step to focus in the right element of the input to predict the next output element. As no recurrent steps are used and only weighted sums and activations are needed here, it is very parallelizable.
Architecture
Let’s dive into the architecture of transformer neural networks and break it down:

- Encoder:
The encoder consists of a stack of identical layers (e.g., 6 layers in the original Transformer model). Each layer has two main components:
- Multi-Head Self-Attention Mechanism: This allows the encoder to consider other parts of the input sequence when encoding a particular part, providing context for understanding the relationships between words or tokens.
- Feed-Forward Neural Network: After attention, the output passes through a feed-forward neural network (the same one for each position) followed by layer normalization.
Layer Normalisation: In deep learning, normalization is like adjusting the volume levels of different instruments in a band so that one doesn't drown out the others. It helps everything work together more smoothly. Layer Normalization technique adjusts the volume of each instrument (feature) for each song individually, making sure that, within each song, no instrument is too loud or too soft compared to the others.
The above components are connected using residual connections, this helps in training deeper models.
Residual connections: They are an essential component in deep learning architectures. They allow the output of one layer to bypass one or more intermediate layers and be added directly to the output of later layers.
The encoder takes the input sequence and processes it through all its layers, producing a continuous representation that captures the context of each word in the sequence with respect to all other words.
2. Decoder:
The decoder also consists of a stack of identical layers (the same number as in the encoder), and each layer has three main components:
- Multi-Head Self-Attention Mechanism: Similar to the one in the encoder, but operates on the outputs of the previous layer of the decoder. This allows the model to focus on different parts of the input sequence as needed.
- Multi-Head Cross-Attention Mechanism: This attends to the encoder's output, allowing the decoder to consider the entire input sequence when generating each word in the output sequence.
- Feed-Forward Neural Network: Like in the encoder, the output of the attention mechanisms passes through a feed-forward neural network followed by layer normalization.
Again, residual connections are used for ease of training.
The decoder's aim is to generate the output sequence, one symbol (e.g., word or character) at a time, using the continuous representations provided by the encoder and its own previous outputs.
- Self-Attention:
Self-attention in Large Language Models (LLMs) like Transformers is a mechanism that helps the model pay attention to different parts of a sentence to better understand it. It looks at all the words in the sentence at the same time and figures out how much attention each word should pay to the other words to make sense of the whole sentence.
In simple terms, it's like the model giving a "score" to the relationships between words, so it knows which words are most relevant to each other. This helps the model understand the meaning of the sentence better and captures the relationships between words, even if they are far apart in the sentence.
This is the only operation in the whole architecture that propagates information between vectors. Every other operation in the transformer is applied to each vector in the input sequence without interactions between vectors.
- Positional Encoding
In models like the Transformer that rely solely on attention mechanisms and do not use recurrent layers, they do not inherently capture the order of the sequence, positional encoding is necessary to give the model information about the positions of the tokens in the sequence.
“I am groot” and “groot I am” will return the same results. Thus we need to create a representation of the position to the word and add it to the token embedding.
The encoding must have the same dimensionality as the embeddings so that the model can process both the content and position information together. The algorithm to calculate the encoding in the original transformer paper makes it possible to be applied to sequences of different sizes ensuring that it’s flexible to different tasks and inputs. Since the model now has both positional encoding and input encoding, it is capable of much more nuanced understanding and generation of data.