Toy model (n-grams + heuristics), but more “LLM-like”: readable subword-ish tokens (stems + -ing),
softmax-style sampling controls, and a toy attention strip. Predictions update as you type.
Analyze Text
Next-word Predictor
1) Paste training text
Walkthrough highlights current token (blue), then +1 (green), +2 (yellow), +3 (red). It increments 1–4 token phrases and stores
next-token maps for contexts of length 1, 2, 3, and 4. The loss plot is a toy online NLL under an interpolated n-gram model.
Animated walkthroughIdle
Step — / —
Current: —
+1: —
+2: —
+3: —
Live counts
Vocab: —
Unique bigrams: —
Unique trigrams: —
Unique 4-grams: —
Just added
1: —
2: —
3: —
4: —
Toy loss (NLL)
avg: —
Lower = higher probability on the true next token. This is not backprop—just “counts getting better”.
final view only
for heatmap + toy attention
optional
final heatmap
walkthrough
human-friendly subwords
2) What the model learned
No model yet
Tokens
—
Vocabulary
—
Unique bigrams
—
Unique trigrams
—
Unique 4-grams
—
Top tokens (click to highlight)count
Token–token correlation heatmap
lowhigh
Click a top token to highlight its row/column in the heatmap.
Top 2-token phrases
Top 3-token phrases
Top 4-token phrases
Note
Predictor shows a single probability-ordered list, but still includes a “by context length” breakdown below for intuition.
Next-token predictor (updates as you type)
Needs a model
We compute candidate next tokens from contexts of length 4, 3, 2, and 1.
We blend them (or choose strictly) based on your settings (right column), then list predictions ordered by probability.
If you are mid-word, we treat the last fragment as a prefix filter (token completion).
No model yet — go to “Analyze Text” first.
Predictions (ordered by probability)—
Toy “attention” over recent contextrecency + co-occurrence with last token
Settings
higher = flatter distribution
used for “Insert top / Resample”
0 = off
1 = off
4→1 contexts
keeps list readable
What this still misses
Real LLMs learn patterns in weights, not explicit n-gram tables.
Real tokenizers are learned (BPE/Unigram). Here we use a readable suffix splitter.
Attention here is a heuristic, not transformer attention.
Training is done via gradient descent on huge corpora; this tool is “count-and-lookup”.