Self-Hosting LLMs: A Practical Guide
How to run large language models locally — hardware requirements, quantization, and serving with vLLM.
Running your own LLMs gives you privacy, cost control, and zero rate limits. Here’s how to get started.
Why Self-Host?
- Privacy — data never leaves your network
- Cost — no per-token billing after hardware investment
- Control — custom models, fine-tuning, no censorship
- Latency — local inference can be faster than API calls
Hardware Requirements
| Model Size | VRAM Needed | Example GPU |
|---|---|---|
| 7B (Q4) | 6GB | RTX 3060 |
| 13B (Q4) | 10GB | RTX 3080 |
| 70B (Q4) | 40GB | A100 / 2x 3090 |
| 70B (FP16) | 140GB | Multi-GPU setup |
Quantization
GGUF format with llama.cpp is the easiest path:
- Q4_K_M — best balance of quality vs size
- Q5_K_M — slightly better quality
- Q8_0 — near-FP16 quality
Serving Options
- llama.cpp — CPU/GPU inference, great for single user
- vLLM — high-throughput GPU serving, OpenAI-compatible API
- Ollama — easiest setup, good for local dev
- TGI — Hugging Face’s production serving solution
My Setup
I run DeepSeek V4 via vLLM behind a reverse proxy, with a 9Router instance load-balancing between multiple providers. Works great for daily coding assistance.
Start with Ollama for experimentation, graduate to vLLM when you need production throughput.