A Deeper Understanding of the LLM Transformer

In part 6 and part 7 we learned about what the transformer is and how it works. If you haven’t read those please read those first and then this article will make more sense.

In this section we will get into the technical details about how this works.

The process consists of three main steps:

  1. Convert Words to Terms
  2. Create the Embedding Vector
  3. Use the Transformer to create a final embedding vector

Convert Words to Terms

black typewriter machine typing on white printer paper
Photo by Suzy Hazelwood on Pexels.com

The first step is for the LLM to take the provided text and convert it to “terms”. The goal is to reduce the complexity of the text and to standardize words so that different variations of a word are recognized as the same term.

This process can include tokenization, normalization, stemming, and lemmatization.

Let’s start with the sentence, “The boys running the race achieve their personal bests.” and see how this works:

Tokenization

First, the raw text is split into individual words (tokens) using tokenization. This step isolates words from the surrounding punctuation and spaces.

The above sentence would be turned into [“The”, “boys”, “running”, “the”, “race”, “achieve”, “their”, “personal”, “bests”]

Normalization

Normalization involves converting all the text to a consistent case (usually lowercase) to ensure that words are recognized as the same term regardless of their case. For instance, “Apple,” “APPLE,” and “apple” would all be normalized to “apple.”

The above tokens would be normalized into [“the”, “boys”, “running”, “the”, “race”, “achieve”, “their”, “personal”, “bests”]

Stemming

Stemming reduces words to their root form, chopping off the end of words in a somewhat heuristic manner. The purpose is to group related words together even if they don’t have exactly the same spelling. For example:

  • “running”, “runs”, “runner” → “run”

The stem might not always be a valid word itself but serves to consolidate variations of a word.

Stemming would transform the normalized tokens into [“the”, “boy”, “run”, “the”, “race”, “achiev”, “their”, “person”, “best”]

Lemmatization

Lemmatization is a more sophisticated approach than stemming and involves using vocabulary and morphological analysis of words to remove inflectional endings only and to return the base or dictionary form of a word, known as the lemma. Unlike stemming, lemmatization ensures that the resulting form is a valid word in the language.

  • “am”, “are”, “is” → “be”
  • “mice” → “mouse”

Lemmatization typically requires more computational resources than stemming because it involves understanding the context and the morphological analysis of words.

Lemmatization would give us the terms as [“the”, “boy”, “run”, “the”, “race”, “achieve”, “their”, “personal”, “best”]

Embeddings

As discussed in part 5, LLMs convert text (set of terms) into embedding vectors. 

An embedding encodes a word like “Elephant” into an array of numbers where each number is the coordinate of the word in a particular dimension.  

For more info on embedding, see Embeddings are the GPS Coordinates of the LLM.

For example, an embedding for the word “Elephant” may look like an array of numbers { 22, 55, 23, 14 }.

Each number in this array can be used to find words that are close to the text in that dimension.  

For example, an embedding {23, 77, 45, 19 } is close to the previous embedding in the first dimension because the values in the first dimension 22 and 23 are close to each other.

To compare text holistically we have to compare the coordinates in all the dimensions.  This is essentially a vector distance calculation between the vector for the first word and the vector for the second word.  Many such vector distacnce calculations exist but cosine similarity is a popular one.

Let’s take a small sentence like “Elephants have trunks”.

There are three words in this sentence:

  1. “Elephants”
  2. “have”
  3. “trunks”

Each of these words has an embedding vector.  This embedding consists of an array of numbers where each number is the position of this word in that dimension.

LLM can now combine all three embeddings into a single embedding vector.

Positional Encoding

persons in black shirt and pants
Photo by Haste LeArt V. on Pexels.com

If we stop here then a sentence like “Trunks have elephants” would result in the same embedding vector as “Elephants have trunks” since both have the same set of words.

The LLM would have to treat them the same.  Clearly the sentence “Trunks have elephants” has a different meaning than “Elephants have trunks”.  So how does the LLM know they are different?

This is where the second part of the embedding vector comes in.  The LLM creates another embedding that captures the POSITION of each word in the sentence. 

This new positional embedding is then combined with the word-by-word embedding vector to create the final vector.  (We don’t currently know why it is ok to just combine the two vectors even though they mean different things but it seems to work…. Maybe one day someone will theoretically show why this works!)

Ok so now we have an embedding vector that is a combination of the word-by-word embedding vector and the positional embedding vector.

The Transformer

colorful high voltage power transformer
Photo by Dids . on Pexels.com

Now we run this embedding vector through the transformer.

The transformer has a set of attention blocks and each attention block has a set of attention heads.

The attention heads in the transformer focus on different parts of the embedding vector.  One attention head may look at one chunk of the embedding vector while another attention head looks at another chunk of the embedding vector. (Why is it ok to have attention heads look at different chunks? Again we don’t completely know but it seems to work).

The attention heads then read their chunk of the embedding vector, run their AI model to output another embedding vector which captures the essence of the original embedding vector.   Basically attention heads are a way of looking back in the sequence of tokens and packaging up the past in a form that’s useful for finding the next token. 

The final embedding vector after going through all the attention heads is then used by the LLM to find nearby words to complete the sentence.

Attention Heads

curious ethnic child examining chemical instruments in studio
Photo by Monstera Production on Pexels.com

Attention heads have some key abilities:

  • Multiple Attention Heads: Each attention head in a Transformer layer computes a version of self-attention independently. Self-attention allows the model to weigh the importance of different words in the input sequence relative to each other. For a given token, this means considering how it relates to every other token in the sequence when generating its representation.
  • Query, Key, Value Vectors: Inside an attention head, the embedding vectors are transformed into three new sets of vectors: queries (Q), keys (K), and values (V). These transformations are done using learned weights. Essentially, for each token’s embedding vector, a separate query vector, key vector, and value vector are created.
    • The query vector represents the token for which we are trying to compute attention.
    • The key vectors are what the query vector is compared against to determine how much focus to put on other parts of the input.
    • The value vectors are what we actually sum up to compute the output of the attention mechanism, weighted by the attention scores.
  • Attention Scores and Weighted Sum: Attention scores are calculated by comparing query vectors with key vectors, typically using a dot product. These scores determine how much each token should attend to every other token. The scores are then used to create a weighted sum of the value vectors, which becomes the output of the attention head for each token.

Effect of Attention Heads on the Embedding Vector

what is this is all real text with yellow background
Photo by Aleksandar Pasaric on Pexels.com
  • Combining Attention Heads: The outputs from all attention heads in a layer are concatenated and then linearly transformed back to the original embedding size. This ensures that the output of the multi-head attention layer can be processed by subsequent layers, which expect inputs of a consistent size.
  • Size of the Embedding Vectors: The size of the embedding vectors (i.e., the dimensionality of the embeddings) does not change as a result of passing through the attention heads. What changes is the information contained within these vectors. Through the attention mechanism, each embedding vector is updated to contain information that reflects the context provided by the entire sequence.
  • Information Integration: As embedding vectors pass through successive Transformer layers, they are enriched with more context, allowing the model to generate responses based on a complex understanding of the input text. Each layer’s attention heads and feed-forward networks further refine the representations, keeping the size of the vectors constant but enhancing their informational content.

For Even More Details

I’ve tried to keep this article simple so it can be understood. If your thirst is not quenched, you can read the original paper that invented the transformer (Attention Is All You Need) or the paper on GPT-4.

Summary

Text provided to the LLM goes through a three step process: first the text is converted to a set of terms then these terms are converted into a combined vector that includes the word embedding vector and a positional encoding vector. Finally this embedding vector is fed into a transformer.

The transformer uses attention heads that focus on different chunks of the embedding vector. The attention heads using an AI model transform the input vector into an output vector.

The output of all the attention heads is combined in a single vector that is fed into the LLM to find the next word.


Popular Blog Posts