June 25, 2026

Self-Hosting LLMs: A Practical Guide

How to run large language models locally — hardware requirements, quantization, and serving with vLLM.

#ai#llm#self-hosting#infrastructure

Running your own LLMs gives you privacy, cost control, and zero rate limits. Here’s how to get started.

Why Self-Host?

Privacy — data never leaves your network
Cost — no per-token billing after hardware investment
Control — custom models, fine-tuning, no censorship
Latency — local inference can be faster than API calls

Hardware Requirements

Model Size	VRAM Needed	Example GPU
7B (Q4)	6GB	RTX 3060
13B (Q4)	10GB	RTX 3080
70B (Q4)	40GB	A100 / 2x 3090
70B (FP16)	140GB	Multi-GPU setup

Quantization

GGUF format with llama.cpp is the easiest path:

Q4_K_M — best balance of quality vs size
Q5_K_M — slightly better quality
Q8_0 — near-FP16 quality

Serving Options

llama.cpp — CPU/GPU inference, great for single user
vLLM — high-throughput GPU serving, OpenAI-compatible API
Ollama — easiest setup, good for local dev
TGI — Hugging Face’s production serving solution

My Setup

I run DeepSeek V4 via vLLM behind a reverse proxy, with a 9Router instance load-balancing between multiple providers. Works great for daily coding assistance.

Start with Ollama for experimentation, graduate to vLLM when you need production throughput.