Blog Portfolio About

Self-Hosting LLMs: A Practical Guide

How to run large language models locally — hardware requirements, quantization, and serving with vLLM.

#ai#llm#self-hosting#infrastructure

Running your own LLMs gives you privacy, cost control, and zero rate limits. Here’s how to get started.

Why Self-Host?

  • Privacy — data never leaves your network
  • Cost — no per-token billing after hardware investment
  • Control — custom models, fine-tuning, no censorship
  • Latency — local inference can be faster than API calls

Hardware Requirements

Model SizeVRAM NeededExample GPU
7B (Q4)6GBRTX 3060
13B (Q4)10GBRTX 3080
70B (Q4)40GBA100 / 2x 3090
70B (FP16)140GBMulti-GPU setup

Quantization

GGUF format with llama.cpp is the easiest path:

  • Q4_K_M — best balance of quality vs size
  • Q5_K_M — slightly better quality
  • Q8_0 — near-FP16 quality

Serving Options

  1. llama.cpp — CPU/GPU inference, great for single user
  2. vLLM — high-throughput GPU serving, OpenAI-compatible API
  3. Ollama — easiest setup, good for local dev
  4. TGI — Hugging Face’s production serving solution

My Setup

I run DeepSeek V4 via vLLM behind a reverse proxy, with a 9Router instance load-balancing between multiple providers. Works great for daily coding assistance.

Start with Ollama for experimentation, graduate to vLLM when you need production throughput.