Part 6 - Transformer is the memory of the LLM

GPT in GPT-4 stands for Generative Pre-trained Transformer. We understand the Generative means it generates content. Pre-trained means that someone has already trained the LLM. But what does the “Transformer” mean? The Transformer is the recent invention from 2017 that makes the LLMs so powerful.

In part 1, we learned that ChatGPT is just completing the sentence.

In part 2, we learned that just this simple task can solve complex problems.

In part 3, we learned that LLMs create a map of words.

In part 4, we learned how LLMs calculate the distance between words in this map of words.

In part 5, we learned how embeddings are GPS coordinates in this map of words.

All this is great for individual words but now let’s move on to deal with sentences, paragraphs and documents.

We run into two problems. The LLMs don’t have the memory capacity to remember embeddings for every single word in a large document and the LLMs don’t have the processing capacity to do calculations for so many words.

LLMs solve these problems using the transformer.

Driving from San Francisco to Los Angeles

Let’s again start with an analogy that we understand: taking a journey from San Francisco to Los Angeles.

We start driving from San Francisco to Los Angeles. We’re now at the city of Monterey in the middle.

We have three choices of where to go next:

Drive straight to Los Angeles
Stop at a gas station next
Stop at a restaurant next

How do we decide which one is the best choice?

We would need to remember whether we have enough gas and whether we have eaten lunch to make this decision. (Ignore the fuel indicator and your feeling of hunger for this simplified exercise.)

Just the knowledge that we’re at Monterey is not enough to decide where to go next. We need the memory of what has happened since we left San Francisco.

It is not enough to know where you are. You also have to know where you have been.

How Our Brain Handles the Memory of Our Trip

Let’s say we had stopped at Salinas for gas and food, and our car’s manual says it goes 500 miles on a full tank. These are two pieces in our memory that can help us decide where to go next.

One option would be for our brain to remember every single detail that happened between our leaving San Francisco and arriving at Monterey. Every single stop sign, every single car we passed, every single tree etc. And every single detail in our car’s manual.

Clearly we do not have the memory capacity in our brains to remember every detail.

Even if we could remember every detail we do not have the processing capacity in our brain to think of every detail in our memory just to decide whether to drive to LA, stop for gas or stop for food.

How our brain works is by remembering only certain parts of what we see. If we are driving we will probably remember the highways we’ve taken but not every stop sign we saw. Our brain, based on past experience, is always deciding what memories to store and what to forget.

And then we think about something, the brain goes through all our stored memories and chooses the memories that are relevant to our thoughts. And then from each memory it extracts only the relevant part.

These two processes of storing only highlights of what we see and then filtering the memories based on what we’re thinking is what allows our brain to handle the task of deciding where to go from Monterey.

Since we already stopped for gas and food, and the distance from Monterey to LA is 320 miles (well within our fuel range of 500 miles on a full tank), our brain tells us we should drive straight to LA. We are not hungry and don’t need gas to make it to LA.

LLMs Have the Same Problem

LLMs run into the same two problems:

The LLMs don’t have the memory capacity to remember embeddings for every single word in a large document.
- Remember from Part 5 that the embedding of each word consists of 1,536 numbers. When you have a document with many words, you can imagine the memory required becomes very large very quickly.
The LLMs don’t have the processing capacity to do calculations for so many words.
- With approximately 40,000 words in the English language, predicting the next word in a sentence based on the last word alone presents us with 40,000 potential outcomes. This task is akin to searching for one specific individual within San Francisco’s bustling Mission District.
- Expanding our calculation to include the last two words increases our possibilities exponentially to 1.6 billion. This scenario can be likened to the daunting task of locating a single person somewhere within the vast populations of the United States and Europe combined.
- Considering three words further multiplies our search to an astonishing 64 trillion possibilities, comparable to identifying a single individual or mammal among every human and mammal that has ever existed on Earth.
- Now, imagine the computational complexity when extending this analysis to a document containing 100 words!

LLMs solve this using the transformer. The transformer allows the LLM to “look back” in the document and extract highlights. These memory highlights can then be used in calculating the next word to finish the sentence.

The transformer model is a very recent invention. It was first proposed in the landmark paper by Ashish Vaswani et al in 2017, Attention Is All You Need.

How Transformers Act As the Memory of the LLM

Let’s start with a document:

“I started driving from San Francisco in my car that goes 500 miles on a full tank of gas. I took highway 101 and stopped for lunch at noon in Salinas. I also filled up my tank. Now I am at Monterey. I should next go to _______.”

To keep it simple, let’s say the LLM could only store ten words and/or is only be able to calculate the next word using ten previous words.

One option would be to just use the shortened sentence:

“Now I am at Monterey. I should next go to _______.”

Our choice would be non-ideal here since we don’t know if the driver is hungry or if the car has enough gas.

What if we could somehow read the previous sentences and extract the essence that the driver is not hungry and has enough gas?

The sentence would become:

“Now I am at Monterey. (Not hungry, enough fuel to get to LA). I should next go to _______.”

With this additional context, we can choose a much better choice for the next word to complete the sentence.

This is what the transformer does. It looks back and brings in the essence of the past words/sentences to calculate the next word.

In the next article we will discuss how the transformer does this.

Summary

Take the example of driving from San Francisco to Los Angeles via Monterey. If you are at Monterey trying to decide where to go next, you need the memory of whether you have filled up on gas and whether we have eaten.

Our brain cannot handle remembering everything from the past or being able to process everything from our memories. Instead it stores highlights from the past and then extracts relevant parts to use.

LLMs has the same problem. They cannot remember all the words in a document nor can they process all those words.

The transformer, invented in 2017, allows LLMs to look back, extract relevant concepts and use them in figuring out the next word to complete the sentence.

Healthcare Consumer

Part 6 – Transformer is the memory of the LLM

Driving from San Francisco to Los Angeles

How Our Brain Handles the Memory of Our Trip

LLMs Have the Same Problem

How Transformers Act As the Memory of the LLM

Summary

Popular Blog Posts