4/20/2026 at 10:23:06 PM
> 25K parameters is about 70 million times smaller than GPT-4. It will produce broken sentences. That's the point - the architecture works at this scale.Since it seems to just produce broken and nonsensical sentences (at least based on the one example given) I'm not sure if it does work at this scale.
Anyway, as written this passage doesn't really make a whole lot of sense (the point is that it produces broken sentences?), and given that it was almost certainly written by an AI, it demonstrates that the architecture doesn't work especially well at any scale (I kid, I kid).
by wk_end
4/20/2026 at 10:38:03 PM
How does it compare to a Markov chain generator I wonder.by forinti
4/21/2026 at 12:20:36 AM
The Transformer is the more powerful model than Markov chain, but on such a weak machine as the C64, a MC could output text faster - but it surely would sound "psychedelic", as the memory limits a MC to a first-order or second-order model, so to predict one word, only the two words before would be taken into account as context (and no attention).On a plain vanilla C64, the Transformer cannot really show what it's capable of doing. An implementation using 2 bit per weight (vectorized) could be slightly better, perhaps.
by jll29
4/21/2026 at 8:47:22 AM
You can build an unlimited-order Markov chain by, instead of pre-computing a table of counts for all possible contexts, using a substring-search index on the training data to count possible continuations on the fly: https://arxiv.org/abs/2401.17377 That paper uses suffix arrays, but more compact indices are possible: https://arxiv.org/abs/2506.12229by yorwba
4/20/2026 at 11:36:25 PM
[dead]by pizza234