Glossary

What is LLM inference?

Inference is the process of running a trained model to generate output — for an LLM, turning a prompt into tokens — and the stage where latency, throughput, and cost are decided.

Inference is the serving side of a model, distinct from training. For LLMs it is dominated by how fast tokens are generated and how many requests can be batched together; techniques like KV caching, continuous batching, and streaming responses are what make it usable at scale.

Open-source inference engines optimise throughput and memory so capable models run on the hardware you have. Quantization is often paired with inference to fit a model onto a single GPU or a laptop.

Best local LLM tools →

Trending LLM inference projects

jundot/omlx
LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
★ 15.2K+92 · 24hmomentum 12Python
antirez/ds4
DeepSeek 4 Flash local inference engine for Metal and CUDA
★ 11.9K+69 · 24hmomentum 12C
raullenchai/Rapid-MLX
The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider.
★ 2.4K0 · 24hmomentum 6Python
lucienhuangfu/eLLM
eLLM can infer LLM on CPUs faster than on GPUs
★ 4170 · 24hmomentum 5Rust
cheahjs/free-llm-api-resources
A list of free LLM inference resources accessible via API.
★ 22.3K+56 · 24hmomentum 4Python
Light-Heart-Labs/DreamServer
Turn your PC, Mac, or Linux box into a private AI server. LLM inference, chat UI, voice, agents, workflows, RAG, and image generation.
★ 1.8K+56 · 24hmomentum 4Python
Michael-A-Kuykendall/shimmy
⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
★ 5.3K0 · 24hmomentum 2Rust

▌ LLM inference — FAQ

What is LLM inference?

Inference is the process of running a trained model to generate output — for an LLM, turning a prompt into tokens — and the stage where latency, throughput, and cost are decided. Inference is the serving side of a model, distinct from training. For LLMs it is dominated by how fast tokens are generated and how many requests can be batched together; techniques like KV caching, continuous batching, and streaming responses are what make it usable at scale.

Trending LLM inference projects

jundot/omlx

antirez/ds4

raullenchai/Rapid-MLX

lucienhuangfu/eLLM

cheahjs/free-llm-api-resources

Light-Heart-Labs/DreamServer

Michael-A-Kuykendall/shimmy

▌ LLM inference — FAQ