How ChatGPT Works (Part 4) – Finding nearby words using distance

We learned in Part 3 – Creating a map of words that ChatGPT creates a map of words.

The goal of a map is to put items close to one another if the distance between them is small and far from one another if the distance is large.

What does a distance between words mean? And how does that help us find nearby words?

Distance in a GPS Map

In Part 3 – Creating a map of words we discussed how a map of words is similar to a map of address (i.e., a GPS Map).

In a GPS map, we represent each location by GPS coordinates i.e., latitude and longitude.

To calculate distance between two sets of GPS coordinates, we can just use a simple mathematical formula (such as the Flat Earth Approximation Formula or the more accurate Haversine Formula).

This distance then tells us which addresses are near our current address and which addresses are far.

What is a distance between words

We all understand the concept of distance in a GPS Map. Distance tells us which addresses are close to each other and which ones are far.

Distance in a map of words is similarly used to figure out which words are close to each other and which words are far from each other.

But what does distance mean in a map of words? In Part 3 – Creating a map of words we explained that a map of words is a map where we place words in relationship to each other.

In Part 1, we covered how the main function of ChatGPT and LLMs is to find a reasonable word to complete the sentence.

The way that ChatGPT and LLMs do this is by finding the word nearest to the previous words in the sentence.

So if we are able to calculate distance between words we can find the nearest word.

Calculating distance between words

So how can we calculate the distance between words?

We know that in English there are about 40,000 words that comprise almost all conversations (other than very subject specific conversations).

So all we need to do is figure out the distance between each of these words. Then we can plot them all on a map of words and find nearest words.

The way ChatGPT and LLMs do this is by reading 250 billion sentences from the public internet and digitized books.

Let’s say the first sentence we ever saw was:

“I have an apple tree”

We would put “apple” and “tree” next to each other since we found them in the same sentence.

Next sentence we saw was:

“I have an orange tree”.

So now we would put “orange” next to “tree” but NOT next to “apple” since we have not seen both “apple” and “orange” in a sentence together.

Now we read the sentence:

“Elephants have a trunk”

We have not seen “elephant” or “trunk” in any sentences with “apple”, “tree” or “orange” so we put them far away.

Now we read the sentence:

“This tree has a large trunk”

We already have both “tree” and “trunk” in our map but now that we found a sentence that includes both of them we have to move them closer.

Introducing a numerical distance metric

Let’s now introduce a number that indicates how many times the two words appear in the same sentence. The more times the two words appear together the shorter the distance between them.

Right now all the distances are 100 since we have only seen these two words in one sentence each. Every time we see two words appear in the same sentence we will reduce the distance between them by half.

Let’s now saw we read these sentences next:

  1. “Apples don’t fall far from the tree”
  2. “Apple trees are great”
  3. “Trees have trunks of varying lengths”

So now we’ve seen “apple” and “tree” together in three sentences so we will reduce the distance between “apple” and “tree” by half and then again by half. That distance becomes 25 now.

In addition we see “tree” and “trunk” in two sentences together. So “tree” and “trunk” are moved closer by half i.e., 50.

Now if we repeat this process with billions of sentences you can see how we would end up with a more and more accurate map. A map where words that appear in the same sentence are closer than words that do not appear in the same sentence.

More metrics

In this article we discussed one metric that is a measure how often two words appear in the same sentence in the 250 billion sentences on the internet and in digitized books.

This is just one of the many metrics used by LLMs.

LLMs use many other metrics like:

  1. How many words are in between the two words we’re analyzing?
  2. What is the role of the words in the sentence e.g., subject vs object vs noun?
  3. etc etc.

In a future part we will explore some of these in more detail.


Popular Blog Posts