2/22/2026 at 8:06:59 AM
8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
by thesz
2/22/2026 at 11:45:26 AM
I'm looking forward to the model.toVHDL() method in PyTorch.by amelius
2/22/2026 at 5:50:36 PM
Ugh, quick, everyone start panic-buying FPGAs now.by sowbug
2/22/2026 at 7:50:40 PM
largest FPGAs have on the order of tens of millions of logic cells/elements. They’re not even remotely big enough to emulate these designs except to validate small parts of it at a time and unlike memory chips or GPUs, companies don’t need millions of them to scale infrastructure.(The chips also cost tens of thousands of dollars each)
by throwup238
2/22/2026 at 8:10:54 PM
they also arent power friendlyby 8note
2/23/2026 at 4:34:21 AM
Pretty close to what you describe: https://github.com/fastmachinelearning/hls4mlby p0u4a
2/22/2026 at 3:07:13 PM
Deep Differentiable Logic Gate Networksby Simboo
2/22/2026 at 10:28:21 PM
I see you and I raise approximate logic synthesis [1] [2].[1] https://www.sciencedirect.com/science/article/pii/S138376212...
[2] https://arxiv.org/abs/2506.22772
You can synthesize a logic circuit that is as complex as it gets to have a certain accuracy.
Deep differentiable logic networks, in my experience, do not scale well for larger (more inputs) logic elements. One still has to apply logic optimization and synthesis afterwards. So why not to synthesize ones own approximate circuit to the accuracy one's desire?
by thesz
2/22/2026 at 2:33:40 PM
Is this a thing?by androiddrew
2/22/2026 at 7:35:50 PM
I gave a short talk about compiling PyTorch to Verilog at Latte '22. Back then we were just looking at a simple dot product operation, but the approach could theoretically scale up to whole models.by mikeurbach
2/22/2026 at 4:37:22 PM
They mentioned that they using strong quantization (iirc 3bit) and that the model was degradeted from that. Also, they don't have to use transistors to store the bits.by cpldcpu
2/22/2026 at 6:41:43 PM
I think they are talking about the transistors that apply the weights to the inputs.by amelius
2/22/2026 at 8:24:17 PM
gpt-oss is fp4 - they're saying they'll next try mid size one, I'm guessing gpt-oss-20b then large one, i'm guessing gpt-oss-120b as their hardware is fp4 friendlyby mirekrusin
2/22/2026 at 4:41:40 PM
Whats the theoretixal full wafer scale model they could produce?by cyanydeez