Part 1: All ChatGPT does is complete the sentence

Explanation for Data Scientists

This explanation is for data scientists. For others, use these links:

Non-Technical | Technical

Main Idea

The fundamental idea of LLMs (Large Language Models) is to complete the sentence. Given an unfinished sentence they can choose the next word to add to the sentence. This simple task, it turns out, can enable all kinds of complex language tasks as I will show in the next chapter.

MadLibs

Many of us played MadLibs when we were kids.

MadLibs provides sentences with missing words and we are able to write in words to complete the sentence.

How ChatGPT does this?

ChatGPT scans the available text on the internet. Current estimate is that there are 140 billion sentences on the public internet. If we include digitized books that adds another 110 billion sentences.

Given a phrase like “There are many ___ ways to choose…” it can find all the sentences like this and pick out the word used in the blank space.

ChatGPT can calculate a probability score from this. There are various algorithms to calculate this.

A common one in NLP is TF-IDF (Term Frequency – Inverse Document Frequency). This calculates how often a term (e.g., word) is found in the given document vs how often it is found in other document.

While there are many other algorithms, for simplicity, let’s assume that ChatGPT uses this algorithm. (When we get into the details of ChatGPT we will go deeper in how ChatGPT ACTUALLY does this.)

So for each choice we can calculate a TF-IDF score.

TF-IDF(t,d) = TF(t,d) x IDF(t,d)

TF(t,d) = Total number of terms in document d / Number of times term t appears in document d

IDF(t,d) = log( Number of documents with term t in it / Total number of documents D)

Now let’s say we end up with the choices:

  1. “Good” (TF-IDF= 0.1)
  2. “Other” (TF-IDF= 0.05)

Now let’s say we also have a list of antonyms and we include those in the choices with a score of 1/10th of the original word’s TF-IDF. (Again the actual calculation in ChatGPT is different; I’m simplifying)

  1. “Good” (TF-IDF= 0.1)
  2. “Other” (TF-IDF= 0.05)
  3. “Bad” (TF-IDF=0.1, Antonym weight = 0.1, Total = 0.01)

Now it is a simple matter of choosing one of the choices.

How to be creative

The simple path would be to choose the one with the highest score. So we would always choose “Good” to complete this sentence. While correct this will get boring since all our output will look similar.

What if we introduce a bit of noise into the choice. Let’s say we calculate a random number between 1 and 3. If it is 1 then we choose “Good”, it it is 2 then we choose “Other” and if it is 3 then we choose “Bad”.

Well the problem now is that we will result in a lot of less-than-ideal answers. Since the probability of choosing either choice is the same.

We can fix that by using a weighted random choice algorithm. We can use the score for each choice as the weighted probability (again not the correct calculation; just simplifying):

  1. “Good” = 0.1/(0.1 + 0.05 + 0.01)
  2. “Other” = 0.05/(0.1 + 0.05 + 0.01)
  3. “Bad” = 0.01/(0.1 + 0.05 + 0.01)

This way when we choose we will choose the higher scored items much more frequently than lower scored ones.

We want some control over this so let’s say we add a configurable parameter called “temperature” that controls how often we pick a higher scored item vs a lower scored item.

And that’s it. Now we can complete any sentence with a value that is the best match most of the time but once in a while we can choose a lesser match to make the output interesting.

Summary

The main function of ChatGPT (or LLMs) is to predict the next term given a set of terms.

ChatGPT scans all the sentences in the text that is available to it. It can use that to find choices for the next term and assign a score to each choice. It then uses a noise function to choose from the choices such that most of the time it will choose the choice with the higher scores but once in a while it will choose a lower scored choice.

This article explained the function in a highly simplified and inaccurate manner. In later chapters I will cover in more detail how this process works.


Popular Blog Posts