Sam's previous posts are well worth digging up too. This one is outstanding, but they're all good. I really enjoyed this and learned a lot.
I'm a bit envious of his job. Learning to teach others, and building out such cool interactive, visual documents to do it? He makes it look easier than it is, of course. A lot of effort and imagination went into this, and I'm sure it wasn't a walk in the park. Still, it seems so gratifying.
I read the entire thing top-to-bottom, as a visual learner this is superb.
One nitpick -- in the "asymmetric quantification" code, shouldn't "zero" be called "midpoint" or similar? Or is "zero" an accepted mathematics term in this domain?
“Zero point” is how I saw it referred to in the literature, so that’s what I went with. I personally prefer to think of it as an offset, but I try to stick with terms folks are likely to see in the wild.
Fair enough, thanks!
You’re welcome! Thanks so much for the kind words.
My word... samwho is doing some of the best technical explainers on the internet right now.
Leading to my question: Ok keeping a zero and a minus-zero does make sense for some limits calculations...
But when all you have is 4 bits, is this not quite wasteful? Would using the bits for eg. a 2.5 not improve the model?
Oh well that's a rabbit hole: NVIDIA Blackwell has this, also GGUFs sidestep this with Qi_j / Qi_K... Great article, spikes curiosity!
Heartily second that! It was cool to see a combination of DOM, SVG, and canvas visualization all in use for this post.
Quantization is important for me because it's the only way out I can see for a future of programming that doesn't involve going through a giant bigco who can run, as the article says, a machine with 2TB of memory. And not just memory, but my understanding is that for the model to be performant, it has to be VRAM to boot.
This comes as the latest concern of mine in a long line around "how software gets written" remaining free-as-in-freedom. I've always been really uneasy about how reliant many programming languages were on Jetbrains editors, only vaguely comforted by their "open-core" offering, which naturally only existed for languages with strong OSS competition for IDEs (so... java and python, really). "Intellisense" seemed very expensive to implement and was hugely helpful in writing programs without stopping every 4 seconds to look up whether removing whitespace at the end of a line is trim, strip, or something else in this language. I was naturally pleased to see language servers take off, even if it was much to my chagrin that it came from Microsoft, who clearly was out of open standards to EEE and decided to speed up the process by making some new ones.
Now LLMs are the next big worry of mine. It seems pretty bad for free and open software if the "2-person project, funded indirectly by the welfare state of a nordic or eastern-european nation" model that drives ridiculously important core libre/OSS libraries now is even less able to compete with trillion dollar corporations.
Open-weight, quantized, but still __good__ models seem like the only way out. I remain somewhat hopeful just from how far local models have come - they're significantly more usable than they were a year ago, and we've got more tools like LM Studio etc making running them easy. But there's still a good way to go.
I'll be sad if a "programming laptop" ends up going from "literally anything that can run debian" to "yeah you need an RTX 7090, 128GB of VRAM, and the 2kW wearable power supply backpack addon at a minimum".
I've been watching the drizzle of LLM papers come through, and I think we're going to hit a 1T param MoE on consumer hardware before this year is out. It'll still be behind the bigco models, but it'll be a force multiplier. Ideally, we'd get these models to run on a CPU. MS BitNet is one way to do this. You can already run ternary LLMs on consumer CPUs with a decent tps.
You can still continue to master actual software engineering while others spend their time turning their minds into a palimpsest of tricks and lessons of how to convince one model after another after another after another into giving reasonable output. That you'd still have to vet yourself anyway.
[dead]
This is beautifully written and visualised, well done! The KL divergence comparisons between original and different quantisation levels is on-point. I'm not sure people realize how powerful quantisation methods are and what they've done for democratising local AI. And there are some great players out there like Unsloth and Pruna.
Thank you! I was really surprised how robust models are to losing information. It seems wrong that they can be compressed so much and still function at all, never mind function quite closely to the original size.
Think we're only going to keep seeing more progress in this area on the research side, too.
5-10% accuracy is like the difference between a usable model, and unusable model.
Definitely could be, but in the time I spent talking to the 4-bit models in comparison to the 16-bit original it seemed surprisingly capable still. I do recommend benchmarking quantized models at the specific tasks you care about.
Yes I was wondering why they mentioned those numbers without mentioning their practical significance.
something I have been wondering about is doing regressive layer specific quantization based on large test sets. ie reduce very specifically layers that don't improve general quality.
Sam's previous posts are well worth digging up too. This one is outstanding, but they're all good. I really enjoyed this and learned a lot.
I'm a bit envious of his job. Learning to teach others, and building out such cool interactive, visual documents to do it? He makes it look easier than it is, of course. A lot of effort and imagination went into this, and I'm sure it wasn't a walk in the park. Still, it seems so gratifying.
I read the entire thing top-to-bottom, as a visual learner this is superb.
One nitpick -- in the "asymmetric quantification" code, shouldn't "zero" be called "midpoint" or similar? Or is "zero" an accepted mathematics term in this domain?
“Zero point” is how I saw it referred to in the literature, so that’s what I went with. I personally prefer to think of it as an offset, but I try to stick with terms folks are likely to see in the wild.
Fair enough, thanks!
You’re welcome! Thanks so much for the kind words.
My word... samwho is doing some of the best technical explainers on the internet right now.
Leading to my question: Ok keeping a zero and a minus-zero does make sense for some limits calculations... But when all you have is 4 bits, is this not quite wasteful? Would using the bits for eg. a 2.5 not improve the model?
Oh well that's a rabbit hole: NVIDIA Blackwell has this, also GGUFs sidestep this with Qi_j / Qi_K... Great article, spikes curiosity!
Heartily second that! It was cool to see a combination of DOM, SVG, and canvas visualization all in use for this post.
Quantization is important for me because it's the only way out I can see for a future of programming that doesn't involve going through a giant bigco who can run, as the article says, a machine with 2TB of memory. And not just memory, but my understanding is that for the model to be performant, it has to be VRAM to boot.
This comes as the latest concern of mine in a long line around "how software gets written" remaining free-as-in-freedom. I've always been really uneasy about how reliant many programming languages were on Jetbrains editors, only vaguely comforted by their "open-core" offering, which naturally only existed for languages with strong OSS competition for IDEs (so... java and python, really). "Intellisense" seemed very expensive to implement and was hugely helpful in writing programs without stopping every 4 seconds to look up whether removing whitespace at the end of a line is trim, strip, or something else in this language. I was naturally pleased to see language servers take off, even if it was much to my chagrin that it came from Microsoft, who clearly was out of open standards to EEE and decided to speed up the process by making some new ones.
Now LLMs are the next big worry of mine. It seems pretty bad for free and open software if the "2-person project, funded indirectly by the welfare state of a nordic or eastern-european nation" model that drives ridiculously important core libre/OSS libraries now is even less able to compete with trillion dollar corporations.
Open-weight, quantized, but still __good__ models seem like the only way out. I remain somewhat hopeful just from how far local models have come - they're significantly more usable than they were a year ago, and we've got more tools like LM Studio etc making running them easy. But there's still a good way to go.
I'll be sad if a "programming laptop" ends up going from "literally anything that can run debian" to "yeah you need an RTX 7090, 128GB of VRAM, and the 2kW wearable power supply backpack addon at a minimum".
I've been watching the drizzle of LLM papers come through, and I think we're going to hit a 1T param MoE on consumer hardware before this year is out. It'll still be behind the bigco models, but it'll be a force multiplier. Ideally, we'd get these models to run on a CPU. MS BitNet is one way to do this. You can already run ternary LLMs on consumer CPUs with a decent tps.
You can still continue to master actual software engineering while others spend their time turning their minds into a palimpsest of tricks and lessons of how to convince one model after another after another after another into giving reasonable output. That you'd still have to vet yourself anyway.
[dead]
This is beautifully written and visualised, well done! The KL divergence comparisons between original and different quantisation levels is on-point. I'm not sure people realize how powerful quantisation methods are and what they've done for democratising local AI. And there are some great players out there like Unsloth and Pruna.
Thank you! I was really surprised how robust models are to losing information. It seems wrong that they can be compressed so much and still function at all, never mind function quite closely to the original size.
Think we're only going to keep seeing more progress in this area on the research side, too.
You can even train in 4 & 8 bits with newer microscaled formats! From https://arxiv.org/pdf/2310.10537 to gpt-oss being trained (partially) natively in MXFP4 - https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-...
To Nemotron 3 Super, which had 25T of nvfp4 native pretraining! https://docs.nvidia.com/nemotron/0.1.0/nemotron/super3/pretr...
5-10% accuracy is like the difference between a usable model, and unusable model.
Definitely could be, but in the time I spent talking to the 4-bit models in comparison to the 16-bit original it seemed surprisingly capable still. I do recommend benchmarking quantized models at the specific tasks you care about.
Yes I was wondering why they mentioned those numbers without mentioning their practical significance.
something I have been wondering about is doing regressive layer specific quantization based on large test sets. ie reduce very specifically layers that don't improve general quality.
This is a thing! For example, https://arxiv.org/abs/2511.06516
that's brilliant, I wonder why we haven't seen much use of it to do very heavy quantization
Man what a brilliant technical essay.. hat's off to the writer for clarity and visualizations.
Thank you!
[dead]