In Part 6 – Transformer is the memory of the LLM we learned that the transformer allows LLMs to have memory about the past. The transformer looks back in the document, extracts the essence and uses it in figuring out the next word to complete a sentence.
But how does it do that?
Starting with an example of how our brain does the same task, we will build knowledge to explain how the transformer does the same task in an LLM.
Past Articles in this Series
In part 1, we learned that ChatGPT is just completing the sentence.
In part 2, we learned that just this simple task can solve complex problems.
In part 3, we learned that LLMs create a map of words.
In part 4, we learned how LLMs calculate the distance between words in this map of words.
In part 5, we learned how embeddings are GPS coordinates in this map of words.
In part 6, we learned how the transformer is the memory of the LM.
Driving trip from San Francisco To Los Angeles
In Part 6, we discussed an example of driving from San Francisco to Los Angeles.
We are now at the city of Monterey in the middle.
We have three choices of where to go next:
- Drive straight to Los Angeles
- Stop at a gas station next
- Stop at a restaurant next
How do we decide which one is the best choice?
The answer is dependent on (at least) four pieces of information from the past:
- Did we stop for gas on the way?
- Did we eat lunch recently?
- How far is Los Angeles from Monterey?
- What is the range of our car on a full tank of gas?
Of course, we didn’t know when we were driving from San Francisco to Monterey that we would be choosing where to go from Monterey. So we didn’t know that we had to remember these four things.
How did our brain remember the information to help us now?
We don’t have photographic memory so our brain did not remember everything it ever saw. It doesn’t remember every tree we passed, it does not remember every word in the car manual. Yet it seems to have the information to help us make this decision.
There are two main things happening here.
Selective Memory
First is that the brain remembers select things as it is observing the world. It won’t remember each tree you pass or each stop sign on the road. But it will remember whether you stopped for gas, whether you ate lunch and what highways you took.
How did the brain know to remember certain things and not other things?
One reason is that it was trained to do so. In the past, we likely tried to recall whether we stopped for gas, whether we ate lunch and what highways we took. So the brain learned that those are things it should remember in the future. We never tried to recall individual stop signs or individual trees so the brain learned to not remember them.
There are other reasons such as the brain is trained to remember anomalies and not steady state. For example, you will probably remember each stormy day but not each day when the weather was normal.
Filtered Recall
Second is that when you’re thinking, the brain recalls parts of the memory that is relevant to the current thought and not the full memory. Actually it creates an essence of that memory.
If you’re thinking about the decision of where to go next, the brain will remember that you ate. But it won’t provide you information on what you ate or how much the bill was. Of course, if you start thinking about those questions instead, the brain will now bring you those parts of your memory from lunch.
Let’s look at the example of reading the car manual and remembering the fuel range. There is a lot of other information in the manual but your brain remembered this fact only. Again this is based on training. You’ve likely wondered about the fuel range often in the past but not about the number of bolts on your seat. So your brain has learned to remember one and not the other.
If you are a mechanic then you think about different things and your brain is trained to remember different parts of the car manual. A mechanic will recall different things from the manual than the fuel range.
How the Transformer works?
The transformer works very similar to how our brain worked in the driving trip example above.
Let’s start with a text document:
“I started driving from San Francisco in my car. I took highway 101 and stopped for lunch at noon in Salinas. I also filled up my tank. Now I am at Monterey. I should next go to _______.”
To keep it simple, let’s say the LLM could only store ten words and/or is only be able to calculate the next word using ten previous words.
Without the transformer, we would get this shortened sentence:
“Now I am at Monterey. I should next go to _______.”
Our choice would be non-ideal here since we don’t know if the driver is hungry or if the car has enough gas. If we decide to go to LA we may run out of gas or have to go out of our way to eat.
What we need is a way to package up the previous sentences and include it in the context to make our decision.
Attention Heads
A transformer does this by using “attention heads”. An attention head looks at a portion of the text and extracts the essence of it.
What is an attention head?
One way to understand it is to imagine a student reading that text and writing a cheat sheet to remember for their exam.
In our driving analogy above, we would have one attention head that reads the car manual. The attention head has been trained to extract the essence of that text. So it remembers the fuel range.
Another attention head reads “I took highway 101 and stopped for lunch at noon in Salinas.” and remembers that I had lunch recently.
Another attention head reads “I also filled up my tank.” and remembers that I have a full tank of gas.
So once the attention heads of the transformer have run they have created snippets that can be included when the LLM is finding the next word to complete the sentence.
“Now I am at Monterey. [Not hungry, enough fuel to get to LA]. I should next go to _______.”
With this extra context included in brackets, the LLM can finally make a good decision.
The transformer looked at the past text, extracted the essence of it that relates to the current task and included that in finding the next word. This is the role of the transformer.
How do attention heads know what to extract?
The next question may be how the attention head knew what to extract from the piece of text it read.
Attention heads are AI models so they are trained. In the training process they learn what to extract.
Thinking back to the analogy of the student writing a cheat sheet before an exam. When the student takes the exam, he or she learns what information from the cheat sheet was useful in the exam and what was not.
Next time that student creates a cheat sheet he or she will include things similar to ones they found useful in the last exam. Repeat this process for a few exams and the student gets better at extracting the essence from large amounts of text.
The attention heads learn the same way.
Are you up for reading the technical explanation of this process?