5/16/2026 at 5:06:26 PM
I would love for the standard to be to ALWAYS report the required amount of memory to load and run a model in bytes of RAM alongside any other metrics. I'd love to see time to first token, token throughput, token latency as well but I'd settle for memory size as described above.Essentially, many people want to know what the minimum amount of memory is to run a particular model.
Parameter count obscures important details: what are the sizes of the parameters? A parameter isn't rigorously defined. This also gets folks into trouble because a 4B param model with FP16 params is very different from a 4B param model with INT4 params. The former obviously should be a LOT better than the second.
This would also help with MOE models: if memory is my constraint, it doesn't matter if the (much larger RAM required) MOE version is faster or has better evals.
I'm waiting for someone in anger to ship the 1 parameter model where the parameter according to pytorch is a single parameter of size 4GB.
by djoldman
5/17/2026 at 8:13:07 AM
As a proxy for the total size of the parameters, you can just look at the download size of a model on Huggingface.co.Because for most models the weights are provided in many *.safetensors files of approximately the same size, you can estimate the total size without adding all file sizes by multiplying the number of *.safetensors files with the approximate size of one file.
For quantized models, estimating the size is simpler, because there is just one GGUF file, which also includes metadata, but most of the file is occupied by the parameters.
While there are models where the native size of all parameters is BF16, there are also models that use multiple parameter sizes, e.g. a large number of parameters with a small size, even down to 4 bits, together with a small number of parameters with a bigger size, up to FP32. Therefore, as you say, the number of parameters is much less informative about memory requirements than the file sizes.
While the download size of the *.safetensors files or GGUF files is not the same as the total memory requirement, it can give an approximate estimate and it can be used to assess which of 2 models will need more memory. It becomes more complicated when you must use multiple kinds of memory, e.g. GPU memory and CPU memory, or even SSDs, when you must know more about the structure of the model to determine how much of each kind of memory is needed.
by adrian_b
5/17/2026 at 11:32:48 AM
The KV cache size is a joker though. Different models use very different amounts of memory per token in the KV cache. The VRAM requirements for say 64k context can vary almost by an order of magnitude. So the download size might indicate you should have room for the model, how much context you can fit in the leftover VRAM budget is harder to predict at a glance.That some models like Qwen3.6 27B seems to not be very affected by Q8 quantized KV cache while others degrade heavily doesn't make it easier.
by magicalhippo