S I G N E T   L L M

Watch a language model learn.

This is a tiny AI that learns to write. Pick some text, hit Train, and watch it figure out language from scratch.

Training Data ?This is the text the model will learn from. It breaks it into individual characters and tries to learn which characters tend to follow which. Pick a preset or paste your own text — longer text gives the model more to work with.

Architecture (resets model) ?Heads: How many things the model can pay attention to at once. More heads = more patterns noticed simultaneously.

Layers: How many times the model re-processes the text. More layers = deeper understanding, but slower.

Embed: The size of each character's internal representation. Bigger = more nuance, but more to learn.

Context: How many characters the model can look back at when predicting the next one.

Training ?Learning rate: How big a step the model takes each time it learns from a mistake. Too high = unstable, too low = slow progress.

Temperature: Controls randomness when the model writes. Low = safe, repetitive choices. High = creative, risky choices.

Speed: How many training steps to run per animation frame. Higher = faster training, lower = smoother visualisations.

3e-4
0.4
100
Step 0 ?Each step, the model reads a chunk of training text, predicts the next character, checks how wrong it was, and adjusts its weights. Thousands of steps are needed before it learns anything useful. Loss ?Loss measures how wrong the model's predictions are. High loss = mostly guessing. Low loss = confident and correct. Watch it drop as the model learns. Small spikes are normal — some text chunks are harder than others. 0 parameters

What the model writes ?Every 200 steps, the model tries to write new text by predicting one character at a time. Early on it's pure gibberish. As training progresses, you'll see real words and phrases emerge. Character colour shows confidence: white = sure, blue = moderate, grey = guessing.

Loss ?The loss curve shows the model's prediction error over time. It should trend downward as the model learns. The blue line is smoothed to show the trend; occasional spikes are normal and mean the model hit a tricky section of text.

Next token ?This shows the model's confidence for each possible next character, right now. Tall bars mean high confidence. Early on, the bars are roughly equal (random guessing). As it learns, a few characters will dominate — the model has opinions about what comes next.

Character embeddings ?Each character gets an internal representation (a list of numbers). This plot projects those representations to 2D so you can see which characters the model considers similar. Characters that appear in similar contexts drift together. Colours: blue = vowels, green = consonants, amber = whitespace, purple = punctuation.

Learned bigrams ?A bigram is a pair of characters that appear next to each other (like "t" followed by "h"). This heatmap shows which character pairs the model thinks are likely. Brighter cells = stronger associations. The sentences on the right describe the top patterns in plain English.

Attention ?Attention is how the model decides which earlier characters matter when predicting the next one. Each arc connects two characters — thicker arcs mean stronger connections. Different heads (colours) can learn different patterns: one might focus on the previous character, while another looks further back for context.