~/local-llms-ai-ollama-self-hosting-hardware

The Self-Hosted AI Stack in 2026: MoE Models, Ollama v0.22, and the Hardware That Actually Runs Them

MoE architectures have made frontier-class open-weight models runnable on a single GPU. Here's your complete guide to the 2026 local AI landscape — from Ollama v0.22 setup and model selection to honest hardware sizing and the full self-hosted stack.

Ivan Horvatić 01 May 2026 10 min read

The Self-Hosted AI Stack in 2026: MoE Models, Ollama v0.22, and the Hardware That Actually Runs Them

Something fundamental shifted in the local LLM world over the past six months, and most people haven't caught up yet. The headline isn't that models got smarter — they did, but that's not the story. The real story is that frontier-class open-weight models now fit on a single GPU, thanks to mixture-of-experts architectures that activate only a fraction of their total parameters per token.

Let me put this in concrete terms: GLM-5 has 744 billion total parameters but only activates 40B per inference. Llama 4 Scout has 109B total but runs with 17B active. Qwen 3.6 Plus, gpt-oss-120b, and Mistral Small 4 all follow the same pattern. What this means in practice is that a model which would have required an eight-GPU cluster a year ago now runs on the same hardware you'd use for Llama 3 70B — and it runs better.

If you've been waiting for local AI to cross the "good enough" threshold for real work, that moment is now. Let's walk through what's changed, which models to actually use, and the hardware you need to run them — with real numbers, not marketing fantasies.

The MoE Revolution: Why Everything Changed

Mixture-of-experts isn't a new idea — it's been floating around machine learning research for years. But 2026 is the year it became the default architecture for open-weight models, and that changes the self-hosting calculus entirely.

How MoE Works (In 30 Seconds)

A traditional dense model activates every parameter for every token. A 70B dense model uses all 70B parameters on every single word it generates. An MoE model, by contrast, has a large pool of "expert" sub-networks but routes each token through only a small subset. The result:

Total parameters are high (giving the model broad knowledge)
Active parameters are low (keeping inference costs manageable)
Quality approaches what you'd expect from the full parameter count

Here's where the major open models stand as of April 2026:

Model	Total Params	Active Params	Architecture	Context Window	License
Llama 4 Scout	109B	17B	MoE	10M	Meta Community
Mistral Small 4	119B	6.5B	MoE	128K	Apache 2.0
gpt-oss-120b	117B	5.1B	MoE	128K	Apache 2.0
GLM-5	744B	40B	MoE	128K	Apache 2.0
Qwen 3.6 Plus	397B	MoE+Linear	MoE + Linear Attention	128K	Apache 2.0
Gemma 4	Varies	Dense	Dense	128K	Apache 2.0

Notice the licenses. Five of these six models ship under Apache 2.0 — a massive shift from the restrictive, confusing license terms that plagued open-weight models even a year ago. Only Meta's Llama 4 retains a custom community license (with a 700M monthly active user threshold that most of us will never hit). This licensing liberalization removes the last major barrier to enterprise adoption.

What This Means for Your Homelab

Last year, running anything competitive with GPT-4 class models locally meant multi-GPU setups costing $15,000+. Today, an MoE model with 100B+ total parameters and 5-17B active parameters runs comfortably on a single GPU with 24GB VRAM. That's a used RTX 3090 or a new RTX 4070 Ti Super — hardware that costs $600-$900 on the used market.

Let me be clear: these aren't "toy" models anymore. Qwen 3.6 leads or ties on coding benchmarks like LiveCodeBench and SWE-bench. Llama 4 Scout's 10-million-token context window enables use cases that are literally impossible with proprietary APIs (which max out at 1-2M tokens and charge astronomical per-token prices at those lengths).

Ollama in 2026: The Default Local AI Runtime

Ollama just released v0.22.1 (April 28, 2026), and it has quietly become the default way to run local LLMs. With 170K GitHub stars and a release cadence that ships meaningful improvements every few weeks, Ollama has earned its place at the center of the self-hosted AI stack.

What's New in v0.22.x

The latest releases bring several features worth highlighting:

1. Gemma 4 Renderer Improvements — Ollama v0.22.1 updated the Gemma 4 renderer for better thinking and tool calling support. If you're using Google's open models (which you should be — they're excellent and Apache 2.0 licensed), this fixes the tool calling quirks that made Gemma 3 frustrating to work with in agentic workflows.

2. Dynamic Model Recommendations — Model recommendations are now updated without needing to update Ollama itself. This is a small but meaningful quality-of-life improvement; as the model landscape shifts rapidly, you no longer need to wait for an Ollama release to discover new models.

3. ollama launch Integration — The ollama launch command has been tightened up across the desktop app, making it easier to spin up models and connect them to external tools without manual configuration.

Setting Up Ollama: A 2026 Quickstart

If you're new to Ollama or haven't updated in a while, here's the current setup:

# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Or update if already installed
ollama --version  # Check current version

# Pull and run the current best small model
ollama pull qwen3:8b
ollama run qwen3:8b

# For MoE models that need more VRAM
ollama pull llama4-scout:17b
ollama run llama4-scout:17b

# For coding tasks — Qwen is empirically strongest
ollama pull qwen3.6-coder:14b

# List what you have installed
ollama list

Pro tip: Ollama now supports the Modelfile system for creating custom models with system prompts, temperature settings, and template overrides. This is how you turn a generic base model into a purpose-built tool:

# Create a Modelfile
cat > Modelfile <<EOF
FROM qwen3:8b

# Set a system prompt for consistent behavior
SYSTEM """You are a senior DevOps engineer. Answer questions about 
infrastructure, Docker, Kubernetes, and CI/CD pipelines. Be concise 
and provide working code examples. When unsure, say so explicitly."""

# Override generation parameters
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
EOF

# Build and run your custom model
ollama create devops-assistant -f Modelfile
ollama run devops-assistant

Ollama as an API Server

One of Ollama's most powerful features is its built-in REST API. Start the server, and any application on your network can query your local models:

# Start Ollama server (usually auto-starts, but just in case)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Query from another machine
curl http://your-server:11434/api/generate -d '{
  "model": "qwen3:8b",
  "prompt": "Write a Docker Compose file for a WordPress + MySQL stack with automatic backups",
  "stream": false
}'

This turns your homelab server into a private AI API endpoint — no API keys, no rate limits, no data leaving your network.

Hardware Sizing: The Honest Guide

Let's talk hardware. The biggest mistake people make when sizing local LLM hardware is only looking at model weight size and ignoring the KV cache and overhead. Here's the actual formula:

Memory needed ≈ (parameters × bytes per parameter) + KV cache + overhead

That "~25% overhead" rule isn't optional. A 7B model in FP16 is 14GB on paper but needs ~18GB in practice. With quantization, things get more interesting:

Quantization: The Single Most Important Concept

Quantization compresses model weights from FP16 (16-bit) to lower-precision integer formats. The quality loss is surprisingly small at reasonable levels:

Format	Bits/Param	7B Model	14B Model	32B Model	70B Model	Quality vs FP16
FP16	16	14.0 GB	28.0 GB	64.0 GB	140 GB	Reference
Q8_0	8.5	7.5 GB	15.0 GB	34.0 GB	75 GB	~99%
Q6_K	6.6	5.8 GB	11.5 GB	26.5 GB	58 GB	~98%
Q5_K_M	5.7	5.0 GB	10.0 GB	23.0 GB	50 GB	~97%
Q4_K_M	4.8	4.2 GB	8.5 GB	19.5 GB	42 GB	~95%
Q3_K_M	3.9	3.4 GB	7.0 GB	16.0 GB	35 GB	~90%

Practical rules:
- Q4_K_M is the default sweet spot — use this unless you have a reason not to
- Q5_K_M or Q6_K for RAG, code generation, or reasoning tasks where quality matters
- Q8_0 only when you have abundant memory and need near-lossless quality
- Q3_K_M and below — avoid unless nothing else fits; the quality drop is visible

Hardware Tiers for 2026

Here's my honest hardware recommendation matrix, updated for the MoE era:

Tier 1: Laptop/Casual (8-16GB unified memory)
- Apple M2/M3/M4 MacBook with 16GB+ RAM
- Any laptop with 16GB RAM for CPU inference
- Best models: Qwen3 1.5B-4B (Q4), Gemma 4 4B, Mistral Small 4 (Q2-Q3, aggressively quantized)
- Use case: Quick queries, writing assistance, light coding help
- Speed: 5-15 tokens/sec on small models

Tier 2: Mid-Range Workstation (24GB VRAM)
- NVIDIA RTX 3090, 4070 Ti Super, or 4090 (used 3090s are ~$600)
- Apple M4 Pro/Max with 48GB+ unified memory
- Best models: Qwen3 8B (Q6_K), Llama 4 Scout 17B active (Q4_K_M), Mistral Small 4 (Q4_K_M)
- Use case: Coding assistant, document analysis, RAG pipelines
- Speed: 15-40 tokens/sec

Tier 3: Power User (48-80GB VRAM)
- 2× RTX 3090/4090 or 1× RTX 4090 + 1× A6000
- Apple M4 Ultra with 192GB unified memory
- Best models: Qwen3.6 32B (Q5_K_M), GLM-5 40B active (Q4_K_M), any 70B dense model (Q4_K_M)
- Use case: Production workloads, complex reasoning, fine-tuning
- Speed: 10-25 tokens/sec on large models

Tier 4: Multi-GPU Server (96-192GB+ VRAM)
- 4× RTX 3090, 2× A100 80GB, or 8× RTX 4090
- Best models: GLM-5 40B active (Q8_0), Qwen 3.6 Plus full, 70B+ dense models at high quant
- Use case: Serving multiple users, production API, complex agentic workflows
- Speed: Varies; expect 5-15 tokens/sec on the largest models

Apple Silicon note: Apple's unified memory architecture remains the best value proposition for large-model local inference. An M4 Ultra Mac Studio with 192GB of unified memory can run models that would require a multi-GPU NVIDIA setup costing 3-5x as much. The tradeoff is raw speed — Apple Silicon typically generates tokens at 30-50% of the speed of an equivalent VRAM NVIDIA setup. For many self-hosting use cases, that's an acceptable tradeoff.

The KV Cache Gotcha

Here's something most guides gloss over: the KV cache grows linearly with context length. At 8K context on a 7B model, it's 1-2GB. At 32K context on a 70B model, it's 20-40GB. At Llama 4 Scout's 10M token context? The KV cache alone would be hundreds of gigabytes.

In practice, no one runs Llama 4 Scout at full 10M context on local hardware — you'd need server-grade memory. But the principle matters: if you're working with long documents or running RAG with large context windows, budget at least 50% more memory than the model weights alone would suggest.

The Self-Hosted AI Stack: Beyond Just Models

Running a model is only half the equation. Here's the complete self-hosted AI stack I recommend in 2026, with each component doing a specific job:

1. Ollama — Inference Engine

We covered this above. It's the default for good reason: simple installation, automatic quantization, built-in API server, and support for 100+ models out of the box.

2. Open WebUI — Chat Interface

If you're running Ollama, you need Open WebUI. It provides a ChatGPT-style interface that turns your terminal-only setup into something your whole household or team can use. Key features:

Multi-user authentication and conversation history
Document upload with RAG (Retrieval-Augmented Generation)
Image generation integration
Custom system prompts per conversation
Connects to multiple Ollama instances

Deploy it with Docker Compose:

version: "3.8"
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui-data:/app/backend/data
    depends_on:
      - ollama

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    # Uncomment for NVIDIA GPU support
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

volumes:
  open-webui-data:
  ollama-data:

3. ChromaDB or Qdrant — Vector Database

For RAG workflows, you need a vector database to store document embeddings. ChromaDB is the simplest to get started with (it runs in-process or as a Docker container), while Qdrant offers better performance at scale:

# Add to your Docker Compose
  chromadb:
    image: chromadb/chroma:latest
    ports:
      - "8000:8000"
    volumes:
      - chroma-data:/chroma/chroma

4. n8n or Dify — Workflow Orchestration

For building AI agents and complex workflows, n8n (self-hosted automation) or Dify (purpose-built LLM workflow builder) connect your local models to external APIs, databases, and tools. This is where self-hosted AI starts to rival proprietary platforms — your private models, connected to your private data, orchestrated on your own infrastructure.

Model Selection Guide: Which Model for What

With six major model families competing, choosing is harder than ever. Here's my practical recommendation matrix:

Use Case	Recommended Model	Why
General chat & writing	Qwen3 8B	Best quality-to-size ratio, Apache 2.0
Coding & development	Qwen3.6 Coder 14B or 32B	Leads LiveCodeBench & SWE-bench
Long-context RAG	Llama 4 Scout	10M token context window
Reasoning & analysis	GLM-5 (if you have the hardware)	Highest total knowledge (744B params)
Low-resource devices	Gemma 4 4B or Mistral Small 4	Efficient, well-supported, Apache 2.0
Tool-calling agents	Mistral Small 4 or Gemma 4	Best tool-calling support in latest Ollama
Privacy-first deployments	Any — they all run locally	The whole point of self-hosting

If you can only install one model right now: Make it qwen3:8b. It's the best general-purpose model that runs on modest hardware, with Apache 2.0 licensing, strong multilingual support, and excellent coding capabilities.

If you have 24GB+ VRAM: Add llama4-scout for long-context work and qwen3.6-coder:14b as a dedicated coding assistant. The combination of a general model + a specialist model covers 95% of what most people use AI for.

The Cost Argument: Local vs. Cloud in 2026

Let's run the numbers. Suppose you're a developer who uses AI heavily — maybe 500K tokens per day across coding, writing, and research.

Cloud API costs (GPT-4o class):
- Input: ~$2.50 per 1M tokens → $1.25/day
- Output: ~$10.00 per 1M tokens → $2.50/day
- Daily total: ~$3.75
- Annual total: ~$1,370

Local inference (used RTX 3090 + electricity):
- GPU: ~$600 (one-time, used)
- Electricity (250W × 8 hours/day × $0.12/kWh): ~$88/year
- Annual total: ~$88 (after hardware amortization in year 1)

Even factoring in the hardware cost, you break even in about 6 months. And the local setup gives you things no API can: zero data leakage, no rate limits, no surprise billing spikes, and the ability to fine-tune models on your own data.

Getting Started This Weekend

If you've been on the fence about local AI, here's your action plan:

Install Ollama — 5 minutes on any machine with 8GB+ RAM
Pull Qwen3 8B — ollama pull qwen3:8b (about 4.7GB download)
Set up Open WebUI with the Docker Compose above — 10 minutes
Try a real task — upload a PDF to Open WebUI and ask questions about it, or have the model review your code
If you're hooked, start shopping for a used RTX 3090 to unlock the larger MoE models

The gap between what you can run locally and what the cloud offers has never been narrower. The MoE architecture shift means today's single-GPU setups run models that would have been science fiction two years ago. The licensing landscape has cleared up. The tools are mature. And your data stays yours.

That's not just a cost argument — it's a control argument. In a world where AI companies are increasingly opaque about how they use your prompts, self-hosting isn't just economical. It's principled.

Have questions about setting up your own local AI stack? Drop them in the comments — I read every one. And if you found this guide useful, share it with someone who's still paying per token.