2/16/2026 at 9:23:05 AM
Amazing work! Reminded me of LLM Visualization (https://bbycroft.net/llm) except this is a lot easier to wrap my head around and that I can actually run the training loops, which makes sense given the simplicity of the original microgpt.To give a sense of what the loss value means, maybe you can add a small explainer section as a question and add this explanation from Karpathy’s blog:
> Over 1,000 steps the loss decreases from around 3.3 (random guessing among 27 tokens: −log(1/27)≈3.3) down to around 2.37.
to reiterate that the model is being trained to predict the next token out of 27 possible tokens and is now doing better than the baseline of random guess.
by kengoa