How ChatGPT Works (Part 3) - Creating a map of words (Data Scientists)

Explanation for Technical People

This explanation is for data scientists. For others, use these links:

If you’re already comfortable with vector spaces and text embeddings then you only need to scan this part in the series otherwise read on to learn about these concepts explained in a simple way.

Creating a map of words

We learned in Part 1 – All ChatGPT does is complete sentences that ChatGPT and LLM, at its core, is just completing a sentence . In Part 2 – Many tasks are simpler than we thought, we learned that, somewhat surprisingly, many complex tasks can actually be reduced to the simpler task of completing a sentence.

We discussed how ChatGPT completes a sentence by analyzing 250 billion sentences from the internet and digitized books.

The question now is how it is able to use all those sentences to complete the sentence.

The first step is to create a map of words.

Map of addresses

person wearing beige sweater holding map inside vehicle — Photo by Dominika Roseclay on Pexels.com

Let’s first start with a concept we are all very familiar with. A map of addresses that we use to navigate every day.

Why do we need to create a map? Why can’t we just use a list of addresses?

For example, a list of addresses would be:

1523 Pine Street, San Francisco, CA 94109
100 Broadway, Oakland, CA 94607
1180 Oak Grove Road, Walnut Creek, CA 94598

A list of addresses doesn’t tell us which addresses are close to each other and which are farther from each other. It doesn’t tell us the path you would have to take to get from one address to another.

So if our goal is to find addresses that are close to each other then a list of addresses is not that useful.

We use a map instead of a list because we need to understand the relationships between addresses and the paths that can be taken to get from one address to another.

The map below shows the three addresses above plotted on a map.

Notice how we can now tell that the first two addresses are really close to each other while the third address is pretty far from those two.

And we can see the paths (i.e., roads) we can take to get from one address to another.

Converting an address to a position on the map

To place an address on the map, we need the location of the address defined in a way that we can position it on a place on the map.

We use longitude and latitude for this. If you remember your (middle) school science classes, you may remember that we can define the position of anything on the earth by using two numbers: longitude and latitude.

For more on latitude and longitude: https://www.techtarget.com/whatis/definition/latitude-and-longitude

So now we can define each address by just two numbers:

1523 Pine Street, San Francisco, CA 94109-1234 (lat=37.79, lon= -122.42)
100 Broadway, Oakland, CA 94607 (lat=37.79, lon=-122.27)
1180 Oak Grove Road, Walnut Creek, CA 94598-7890 (lat=37.93, lon=-122.02)

Notice that now just given the longitude and latitude numbers for ANY two addresses, we can easily tell whether the two addresses are close to each other or not.

The address in San Francisco and the address in Oakland have the same latitude (37.79) hence we know neither is north or south of each other.

The address in Walnut Creek however has a different latitude (37.93). Subtracting the latitude of 1180 Oak Grove in Walnut Creek from the latitude of 100 Broadway in Oakland:

37.93 – 37.79 = 0.14

This means the address in Walnut Creek is 0.14 latitudes more north. Since each degree of latitude is approximately 69 miles we can calculate that the address in Walnut Creek is 9.6 miles north of the address in Oakland.

When we look at the longitude, we see that the address in San Francisco is the most west (-122.42) while the address in Walnut Creek is the most east (-122.02). The address in Oakland is in the middle (-122.27).

The difference in longitude is:

-122.02 + 122.27 = 0.25

This means the Walnut Creek address is 17.25 miles to the east.

Thus we can tell the orientation and distance of each address from the others solely by the two numbers: latitude and longitude.

Right now we’re working with a 2D map. We can add the altitude above sea level for each address and end up with a 3D map.

Map of addresses as a vector space

We can think of the map of addresses as a vector space. A flat 2D map is a vector space in two dimensions. A 3D map is a vector space in three dimensions.

Between any two points, there are component vectors in each dimension.

We can use various distance calculations (e.g., Euclidean or Manhattan) to aggregate the component vectors between two points into a single vector in multi-dimensional space.

Now we can just use vector math functions to calculate distances between addresses in any number of dimensions.

Map of words

close up of text on white background — Photo by Leah Newhouse on Pexels.com

A map of addresses is much more useful than a list of addresses because the longitude and latitude numbers allow us to figure out the relationships between the addresses.

The same concept can be applied to words. We can lay out words on a map such that words that are related to each other are closer to each other.

Let’s start with a list of words:

Apple
Tree
Elephant
Trunk

Now let’s say we scanned lots of sentences and if we found two words in the same sentence we put them closer on the map of words.

We would find that “Apple” and “Tree” are quite frequently in the same sentence so we would put “Apple” and “Tree” close on our map of words.

On the other hand there are probably very few sentences that have “Apple” and “Elephant” in the same sentence. So we would put “Apple” and “Elephant” very far from each other in our map of words.

“Trunk” is an interesting one. There are probably lots of sentences that include “tree” and “trunk” but there are also lots of sentences that include “elephant” and “trunk”. Hence the word “trunk” is close to both “tree” and “elephant”.

Note that I’ve laid these out in 2D space to make them easier to see. Technically, right now this is a one dimensional vector space so it should really be line.

If you are familiar with the concept of vector spaces and embeddings then this is exactly that.

In the next part in this series we will explain how distance is calculated between words on this map of words.

Once we calculate the distances we can plot these on the map of words.

Notice that now we can use the same math we were using in a map of addresses to find distances and nearby words.

Also notice that the units of distance does not matter as long as all distances are in the same unit. Since our main function is to find nearest words.

In the example above we used a 1D model since there is only one dimension: how often the two words appear in the same sentence.

We can add additional dimensions to this map e.g.,

How likely are the two words antonyms?
Inverse of how often the two words never appear in a sentence
How often one word appears BEFORE another word in a sentence
etc etc

Now you get distances (component vectors) in each dimension just like in a map of addresses you have distances in North-South and East-West directions.

Now you can just do simple vector math to calculate a total distance between two words that aggregates the component vectors in each dimension.

How a map of words enables LLM to complete the sentence

Now that we’ve laid out all our words on a map of words, let’s start the task of completing a sentence.

Let’s say the sentence we’ve been given is “I want to grow an apple ___”.

We look in the map of words for what words are closest to “apple” and we find the word “tree”. So now we can use that word to fill in the blank.

“I want to grow an apple tree”.

So a simplified algorithm would take the last word before the blank “apple” and find all words in the map of words that have a link to the word “apple”. It would sort this list in an ascending order of distance. And then use the top choice.

In the above map, “tree” is at distance 25 which is the shortest distance between “apple” and any other word on the map.

From Words to N-Grams

This is, of course, a very simplified vector space.

As we’ll discuss in future parts in this series, LLMs actually use tokens instead of words, use n-grams instead of tokens and use the concept of attention heads to focus on certain segments of the text.

Concept of distance

one black chess piece separated from red pawn chess pieces — Photo by Markus Spiske on Pexels.com

As we learned above, the goal of a map is to put things together that are closer to each other and put things apart that are far from each other.

Closer and Farther imply a difference in distance.

For a map of addresses, the distance is the physical distance between two points on earth.

What is the distance between two words? Read the next part to learn.

Healthcare Consumer

How ChatGPT Works (Part 3) – Creating a map of words (Data Scientists)