Build your own GPT

We are going to write a mini GPT from scratch in javascript. One file including training and inference, no dependencies, following the GPT2 architecure, based on Andrej Karpathy's microgpt.

Coming soon:

  1. Backpropagation
  2. Softmax & Log loss
  3. Embeddings
  4. Hidden layers
  5. Attention
  6. and more

Tokens

We are building a text model. Text models instead of seeing text as a sequence of chars or words, they see a sequence of tokens. Tokens are the atomic parts that the model reads and writes. They could be characters, words or something in between.

To keep it simple, we’ll start with six tokens: ["a", "i", "l", "e", "n", "."].

A model like GPT4 uses around 100k different tokens. You can try OpenAI’s tokenizer to understand better how each model tokenize text.

We are building a generative model. Given a sequence of tokens it produces one new token to continue the sequence. And then repeat that in a loop until the model produces the END OF SEQUENCE token. In our case, that’s ..

With six equally likely tokens, this is the same as rolling a die over and over and mapping each face to a character:

a
i
l
e
n
.
Click to roll

The rest of this post is about making that next-token prediction less naive.

TODO: improve dice, define document

Logits

Instead of choosing a token directly, the model could be more flexible if it assigns a score to each token, we call that list of scores logits.

Once we have the logits, we want to convert them to probabilities: we normalize them by dividing each one by the total sum.

After that transformation, every value in probs is between 0 and 1, and together they add up to 1.

Now we need to pick one of the tokens based on the probability distribution from probs.

We do it in the pickIndex() function. For example:

probs = [0.3, 0.1, 0.1, 0.1, 0.2, 0.2]
// we calculate the cumulative sums:
sums = [0.3, 0.4, 0.5, 0.6, 0.8, 1]
// we get a random target between 0 and 1:
target = 0.53
// index of first sum greater than target:
tokenIndex = 3
// use the token with that index:
token = "l"

Generating feels like spinning a wheel of fortune, every token gets a slice, and bigger probabilities get bigger slices:

ailen.
Click to spin

TODO: better UI for the pickIndex trace

Weights

If we move the logits generation to its own gpt() function, the interface now matches the shape of a real GPT. That lets the rest of the loop stay the same as the model gets smarter.

  • we moved the logits outside of the function so they are stable and we can train
  • we call them weights, the values that we can tweak to make the model better
  • gpt-4 has 3 billion weights, so far we have 6

The model is still tiny and still ignores context, but now it has stable behavior. That is the minimum we need before we can improve it.

Before we can improve the model, we need a way to measure how well it is doing.

Let’s say we want a model that generates names using the five letters from vocab.

To evaluate the model, we test a real name one position at a time, and check how much probability the model gives to each token. That tells us how likely the model thinks the name is.

For example, for the name “ann”:

a
i
l
e
n
.
gpt("")
a
0.18
0.14
0.26
0.20
0.05
0.16
gpt("a")
0.18
0.14
0.26
0.20
n
0.05
0.16
gpt("an")
0.18
0.14
0.26
0.20
n
0.05
0.16
gpt("ann")
0.18
0.14
0.26
0.20
0.05
.
0.16

High probabilities mean the model expects that name, so it is doing well. Low probabilities mean the name feels unexpected, so the model is doing poorly.

Loss

We want to capture that unexpectedness into one value, so we can later work on minimizing it.

We call that value the loss: a single score that tells us how badly the model predicted the name.

At each position, we turn the target token probability into a penalty using 1 - p, then we average those penalties across the whole name.

a
i
l
e
n
.
loss
gpt("")
a
0.18
0.14
0.26
0.20
0.05
0.16
1 - 0.18 = 0.82
gpt("a")
0.18
0.14
0.26
0.20
n
0.05
0.16
1 - 0.05 = 0.95
gpt("an")
0.18
0.14
0.26
0.20
n
0.05
0.16
1 - 0.05 = 0.95
gpt("ann")
0.18
0.14
0.26
0.20
0.05
.
0.16
1 - 0.16 = 0.84
"ann" loss =0.89

Gradients

To improve the model we need to minimize the loss.

Let’s try modifying one weight at a time, by adding a little epsilon, and recalculate the loss to see if it’s better or not.

let weights = [0.62, 0.45, 0.85, 0.67, 0.17, 0.53]
let oldLoss = calcLoss("ann") // 0.8868
weights[0] += 0.01
let newLoss = calcLoss("ann") // 0.8864
let gradient = (newLoss - oldLoss) / 0.01 // -0.04

Training

If a tiny change in a weight causes a big change in the loss, that weight has a strong effect on the model. The quantity that measures that sensitivity is called the gradient.

In practice, we estimate it by taking the change in loss and dividing it by the tiny change we made to the weight. Repeating that process for every weight tells us which directions reduce the loss, even though it is still a slow and approximate method.

a
i
l
e
n
.
loss
gradient
base
0.62
0.45
0.85
0.67
0.17
0.53
0.8868
a
0.63
0.45
0.85
0.67
0.17
0.53
0.8864
-0.04
i
0.62
0.46
0.85
0.67
0.17
0.53
0.8871
+0.03
l
0.62
0.45
0.86
0.67
0.17
0.53
0.8871
+0.03
e
0.62
0.45
0.85
0.68
0.17
0.53
0.8871
+0.03
n
0.62
0.45
0.85
0.67
0.18
0.53
0.8856
-0.12
.
0.62
0.45
0.85
0.67
0.17
0.54
0.8864
-0.04
new
0.624
0.447
0.847
0.667
0.182
0.534
0.8847
1 / 75

The escape hatch is to stop thinking in terms of one giant formula. Instead, we break the forward pass into tiny operations like add, divide, and subtract, and let each one report its own local derivative.

This step introduces the Value class to represent numbers together with the graph edges and local gradients needed for that bookkeeping. Training still uses the handwritten gradients from the previous step, but the forward pass now carries the information that backpropagation will need next.

Coming soon:

  1. Backpropagation
  2. Softmax & Log loss
  3. Embeddings
  4. Hidden layers
  5. Attention
  6. and more