Glossary

What is quantization?

Quantization shrinks a model by storing its weights at lower numeric precision — for example 4-bit instead of 16-bit — cutting memory and speeding inference with a small quality trade-off.

Quantization is what lets large open models run on modest hardware. By representing weights with fewer bits, a model that needed a server can fit on a consumer GPU or a laptop. Formats like GGUF, GPTQ, and AWQ package quantized weights for different runtimes.

The trade-off is size and speed versus accuracy: aggressive quantization saves the most memory but can degrade quality, so runtimes offer several levels. It is a core enabling technique for the local-LLM ecosystem.

Best local LLM tools →

Trending quantization projects

AlexsJones/llmfit
Hundreds of models & providers. One command to find what runs on your hardware.
★ 26.7K+24 · 24hmomentum 9Rust
MAC-AutoML/MindPipe
A powerful model compression framework for LLMs and LVLMs, adapted for NVIDIA GPUs and Huawei Ascend NPUs.
★ 1K0 · 24hmomentum 2Python
Michael-A-Kuykendall/shimmy
⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
★ 5.3K0 · 24hmomentum 2Rust

▌ quantization — FAQ

What is quantization?

Quantization shrinks a model by storing its weights at lower numeric precision — for example 4-bit instead of 16-bit — cutting memory and speeding inference with a small quality trade-off. Quantization is what lets large open models run on modest hardware. By representing weights with fewer bits, a model that needed a server can fit on a consumer GPU or a laptop. Formats like GGUF, GPTQ, and AWQ package quantized weights for different runtimes.

Trending quantization projects

AlexsJones/llmfit

MAC-AutoML/MindPipe

Michael-A-Kuykendall/shimmy

▌ quantization — FAQ