In Part 3 – Creating a Map of Words and Part 4 – Finding Nearby Words By Distance, we used an example of how a GPS map allows us to find what’s nearby and how to get there.
Similarly, a Map of Words is used by LLMs to find which words are near each other and use nearby words to complete a sentence.
We use the concept of embeddings to do this in a LLM. In this post, we continue the example of a GPS map to explain what an embedding is and how this concept is used in LLMs to find nearby words.
Note that to keep it simple, we will use “words” in this article. Technically LLMs work on “tokens”. A single token can represent different variations of a word e.g., “eat”, “eats”, “ate” etc.
The Coordinate System in a GPS Map
Let’s first start with how a coordinate system is created for a GPS Map; the map we used every day.
Below is a map of three addresses:
- 1523 Pine Street, San Francisco, CA 94109
- 100 Broadway, Oakland, CA 94607
- 1180 Oak Grove Road, Walnut Creek, CA 94598
In this map, if we want to use some numbers to represent the location of any address, we can use latitude and longitude.
Middle school science flashback: Latitude is a measurement of a location north or south of the Equator. Longitude is a measurement of location east or west of the prime meridian at Greenwich. Latitude and longitude together can describe the exact location of any place on Earth.
We can find the latitude and longitude of our three addresses using Google Map:
- 1523 Pine Street, San Francisco, CA 94109-1234 (latitude=37.79, longitude= -122.42)
- 100 Broadway, Oakland, CA 94607 (latitude=37.79, longitude=-122.27)
- 1180 Oak Grove Road, Walnut Creek, CA 94598-7890 (latitude=37.93, longitude=-122.02)
How the Coordinate System Helps Us
Now just given the longitude and latitude numbers for ANY two addresses, we can easily tell whether the two addresses are close to each other or not.
The address in San Francisco and the address in Oakland have the same latitude (37.79) hence we know neither is north or south of each other. The address in Walnut Creek however has a different latitude (37.93) so we know Walnut Creek is north of the other two addresses.
When we look at the longitude, we see that the address in San Francisco is the most west (-122.42) while the address in Walnut Creek is the most east (-122.02). The address in Oakland is in the middle (-122.27).
Thus we can tell the orientation and distance of each address from the others solely by the two numbers: latitude and longitude.
Embedding in a GPS Map
An embedding is a way to represent real things in a coordinate system that captures their underlying relationships and patterns. Embeddings are often used to represent complex data types, such as images, text, or audio, in a way that machine learning algorithms can easily process.
As we discussed above, in a map of addresses, an address can be represent as two numbers: latitude and longitude.
So when we convert an address to its latitude and longitude value we have actually created an embedding. For the address “1523 Pine Street, San Francisco, CA 94109”, the embedding is (lat=37.79, lon= -122.42).
We have taken something real (an address) and converted it to numbers (latitude and longitude) in a coordinate system so we can calculate the distance relationship between them.
The Actual Coordinates Don’t Matter
It’s important to understand that what matters is the distance and direction between two addresses, not the actual coordinates.
We could decide to measure longitude from San Francisco instead of from Greenwich. We could decide to measure latitude from North Pole instead of from the Equator. It does not matter.
We could change the units of our measurement. We could measure in latitude in inches. It does not matter.
All that matters is that we can compare coordinates of two addresses to calculate distance and direction from one to the other.
An Embedding Is a Vector
Above we said that the only thing that matters is distance and direction between two addresses.
Some of you may have raised your hand to say “Isn’t that the definition of a vector?”
Yes, you’re right.
A vector is a measure that has both magnitude and direction. Hence an embedding is a vector.
Adding in Altitude to the Coordinate System
Now if we were also interested in the altitude, we can enhance the embedding by adding a third number: altitude above sea level.
So the embedding for “1523 Pine Street, San Francisco, CA 94109” would become (lat=37.79, lon= -122.42, alt=200) where 200 is the altitude above sea level.
Adding a new number to the embedding turns the embedding from a two dimensional (2D) value to a three dimensional (3D) value.
As you can see more and more numbers can be added to an embedding to capture more and more information. And every time we add a new number to the embedding we increase the number of dimensions.
The Coordinate System in a Map of Words
Now that we understand the coordinate system for addresses in a GPS Map and how to create an embedding for a given address, let’s shift our discussion to doing the same for a map of words.
We discussed in Part 3 – Creating a map of words that ChatGPT and Part 4 – The concept of distance that we can create a map of words where the distances between each word is a measure of how often they appear in the same sentence in the 250 billion sentences on the internet and digitized books.
Following the process defined in Part 4 – The concept of distance, once we have finished going through the 250 billion sentences (from the internet and digitized books) we will have figured out the distances between all the words.
English has about 40,000 commonly used words for reference. So our map will have about 40,000 words.
Once we have laid out all the words and the distances between them, we can now create a coordinate system.
There are many ways to create a coordinate system but we can keep it simple and use a distance from the left of the page and from the top of the page. Let’s call the first number “x” and the second number “y”.
Embedding in a map of words
Now that we’ve created a coordinate system of x (distance from left of page) and y (distance from top of page), we can define the embedding of each word as a combination of these two numbers.
For example, “orange” would be (x=10, y=5) and so on.
Notice that just having the x and y values of two addresses allows us to figure out the distances between one word and another.
What Are The Units?
A common question is “What are the units in this coordinate system?”
The units do not matter!
We covered in the section above, The Actual Coordinates Don’t Matter, that the actual units do not matter. All that matters is that you can calculate the distance and direction from one word to another.
How Do We Calculate Embedding For a Sentence?
The above discussion has been focused on words. The same process can be followed for a whole sentence.
An embedding for a whole sentence is a combination of two things:
- Embeddings for each word in the sentence
- Embedding that captures the sequence of each word in the sentence
By combining these two we can get an embedding that captures a sentence.
The first one is pretty clear from our discussion above. We will cover the second one in a future part of this series.
Adding more distance metrics
We saw in the example with the map of addresses that adding altitude to latitude and longitude turned our embedding from 2D to 3D.
Similarly in the map of words we can add additional distance metrics such as the antonym distance (how often are the two words used as opposites). Each of these metrics will increase the dimensions of the embedding.
In ChatGPT, the embeddings typically contain between 1,536 to 3,072 dimensions. That’s a LOT of dimensions!
Summary
An embedding is a way to represent real things in a coordinate system that captures their underlying relationships and patterns.
In a map of addresses, the embedding may be a set of latitude, longitude and altitude values.
In a map of words, the embedding is a set of values that denote various relationships between two words such as how often they appear in the same sentences.
ChatGPT and LLMs use embeddings to find distances between words and choose words to complete the sentence.