Running LLMs Locally with Ollama: Complete Setup Guide

Ivan Horvatić 25 Apr 2026 4 min read

Running LLMs Locally with Ollama: Complete Setup Guide

---

**Ollama** is the easiest way to run large language models locally. Think "Docker for LLMs" — download a model, run a command, and you've got a local AI assistant. No cloud APIs, no data leaks, just your hardware.

This guide covers **everything**: installation, model selection, performance tuning, and integrating with your favorite tools.

Why Ollama?

✅ **One-command model downloads** (no model file hunting) ✅ **Automatic GPU acceleration** (CUDA, Metal, ROCm) ✅ **OpenAI-compatible API** (drop-in replacement for ChatGPT) ✅ **Cross-platform** (macOS, Linux, Windows) ✅ **Model library** with 100+ pre-configured models

Installation (3 Minutes)

macOS

brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download installer: [https://ollama.com/download/windows](https://ollama.com/download/windows)

Verify Installation

ollama --version

Output: ollama version 0.6.2

Your First Model (30 Seconds)

Download and run Mistral 7B (4.1GB)

ollama run mistral

You'll see:

pulling manifest

pulling layers... 100% [████████████]

success

Now chat:

>>> Write a Python function to calculate Fibonacci

**That's it.** Ollama handles: - Model download (with progress bar) - GPU detection and setup - Server initialization - Chat interface

Choosing the Right Model

Beginner-Friendly Models

| Model | Size | RAM | Best For | Speed | |-------|------|-----|----------|-------| | **Mistral 7B** | 4.1GB | 8GB | General use, fast responses | ⚡⚡⚡ | | **Llama 3.2 8B** | 4.7GB | 8GB | Balanced quality/speed | ⚡⚡⚡ | | **Gemma 2 9B** | 5.4GB | 12GB | Google-trained, efficient | ⚡⚡ | | **Phi-4** | 7.9GB | 16GB | Reasoning, small but mighty | ⚡⚡ |

Advanced Models (Better Quality, More Resources)

| Model | Size | RAM | Best For | Speed | |-------|------|-----|----------|-------| | **Mistral 34B** | 19GB | 32GB | Near-GPT-4 quality | ⚡ | | **Llama 3.3 70B** | 40GB | 64GB | Research, complex tasks | 🐌 | | **DeepSeek Coder 33B** | 18GB | 32GB | Code generation | ⚡ | | **Mixtral 8x7B** | 26GB | 48GB | Mixture-of-experts, versatile | ⚡⚡ |

Browse All Models

ollama list # Installed models ollama search <keyword> # Search library

Or visit: [https://ollama.com/library](https://ollama.com/library)

Model Management

Download Models

Pull specific version

ollama pull llama3.3:70b-instruct-q4_K_M

Pull latest (default)

ollama pull mistral

List Installed Models

ollama list

Delete Models

ollama rm mistral:7b

Check Disk Usage

du -sh ~/.ollama/models/*

Running Models

Interactive Chat

ollama run llama3.3

>>> Hello! Hello! How can I assist you today?

>>> /bye # Exit chat

API Server Mode

Start server (runs in background)

ollama serve

Test API

curl http://localhost:11434/api/generate -d '{ "model": "mistral", "prompt": "Why is the sky blue?", "stream": false }'

One-Shot Prompts

ollama run mistral "Explain quantum computing in 50 words"

Performance Tuning

GPU Acceleration

**macOS (Metal):** Automatic, no config needed **Linux (NVIDIA):** Requires CUDA drivers **Linux (AMD):** Requires ROCm

Check GPU usage:

NVIDIA

nvidia-smi

AMD

rocm-smi

macOS

sudo powermetrics --samplers gpu_power

Adjust Context Window

Increase from default 2048 to 8192 tokens

ollama run mistral --num_ctx 8192

Larger contexts = more RAM, slower inference. Find your sweet spot.

Quantization (Speed vs Quality)

Models come in different quantization levels:

| Quant | Quality | Speed | RAM | Use Case | |-------|---------|-------|-----|----------| | **Q2_K** | ⭐⭐ | ⚡⚡⚡ | Low | Fast prototyping | | **Q4_K_M** | ⭐⭐⭐⭐ | ⚡⚡ | Medium | **Recommended default** | | **Q6_K** | ⭐⭐⭐⭐⭐ | ⚡ | High | Quality-sensitive work | | **Q8_0** | ⭐⭐⭐⭐⭐ | 🐌 | Very high | Research |

**Example:**

Default (Q4_K_M)

ollama pull mistral

High quality (Q8)

ollama pull mistral:7b-instruct-q8_0

Integrating with Tools

OpenAI-Compatible API

Ollama's API mimics OpenAI's format:

import openai

client = openai.OpenAI( base_url="http://localhost:11434/v1", api_key="ollama" # Required but ignored )

response = client.chat.completions.create( model="mistral", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content)

Works with: - LangChain - LlamaIndex - Anything built for OpenAI API

VS Code / Cursor

Install [Continue.dev](https://continue.dev) extension:

// ~/.continue/config.json { "models": [ { "title": "Ollama Mistral", "provider": "ollama", "model": "mistral" } ] }

Now you have local code autocomplete!

LiteLLM (Unified Router)

pip install litellm

litellm --model ollama/mistral

Routes requests to Ollama, OpenAI, Claude, etc. via one API.

n8n Automation

Use Ollama node in n8n for AI-powered workflows: 1. Add "Ollama" node 2. Set base URL: `http://localhost:11434` 3. Select model: `mistral` 4. Connect to your automation

Common Issues & Fixes

"Model not found"

Re-pull the model

ollama pull mistral

Slow inference on macOS

Check Activity Monitor → GPU → ensure "Ollama" is using GPU.

Out of memory

Use smaller model or lower quant

ollama pull mistral:7b-instruct-q4_0

Port already in use

Change default port (11434)

OLLAMA_PORT=8080 ollama serve

Models won't download (firewall)

Use HTTP proxy

export HTTP_PROXY=http://proxy.example.com:8080 ollama pull mistral

Advanced: Custom Models

Import GGUF Files

Create Modelfile

cat > Modelfile <<EOF FROM ./custom-model.gguf PARAMETER temperature 0.7 SYSTEM You are a helpful assistant. EOF

Build custom model

ollama create mymodel -f Modelfile

Run it

ollama run mymodel

Fine-Tuned Models

If you've fine-tuned a model: 1. Export to GGUF format 2. Import via Modelfile (above) 3. Optionally push to Ollama registry

Security & Privacy

✅ **All inference happens locally** — no data sent to cloud ✅ **Models stored in `~/.ollama/models/`** — encrypted at rest (if disk encryption enabled) ✅ **API runs on localhost** by default (not exposed to internet)

**Pro tip:** Run Ollama in a VM or container for extra isolation.

Comparing Ollama to Alternatives

| Tool | Ease of Use | Performance | Model Selection | |------|-------------|-------------|-----------------| | **Ollama** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | **llama.cpp** | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | | **vLLM** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | | **LM Studio** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |

**Ollama wins on ease of use.** For raw speed, use vLLM. For GUI, try LM Studio.

Hardware Recommendations

If you want to upgrade your setup for better performance:

For Larger Models (30B+)

- **GPU:** [NVIDIA RTX 4070 Ti](https://www.amazon.com/s?k=rtx+4070+ti&tag=techkutak-20) (16GB VRAM) - Great for 30B models - **GPU:** [NVIDIA RTX 4090](https://www.amazon.com/s?k=rtx+4090&tag=techkutak-20) (24GB VRAM) - Runs 70B models smoothly - **RAM:** [Crucial 64GB DDR5 Kit](https://www.amazon.com/s?k=crucial+64gb+ddr5&tag=techkutak-20) - For system memory - **Storage:** [Samsung 980 PRO 2TB NVMe](https://www.amazon.com/dp/B08RK2SR23?tag=techkutak-20) - Fast model loading

For Learning & Small Models

- **SBC:** [Raspberry Pi 5 (8GB)](https://www.amazon.com/dp/B0CTQ3BQLS?tag=techkutak-20) - Perfect for testing 7B models - **Storage:** [Samsung 1TB USB SSD](https://www.amazon.com/s?k=samsung+t7+1tb&tag=techkutak-20) - Portable model storage

Learning Resources

- **Book:** [Hands-On Machine Learning](https://www.amazon.com/dp/1492032646?tag=techkutak-20) - Understand how LLMs work - **Book:** [Deep Learning with Python](https://www.amazon.com/dp/1617296864?tag=techkutak-20) - Build your own models

Next Steps

1. **Try different models:** Compare Mistral, Llama, Gemma on your tasks 2. **Integrate with tools:** VS Code, n8n, LangChain 3. **Optimize performance:** Test different quants, adjust context size 4. **Build something:** Local chatbot, code assistant, document analyzer

Resources

- [Official Ollama Docs](https://github.com/ollama/ollama/tree/main/docs) - [Model Library](https://ollama.com/library) - [Ollama Discord](https://discord.com/invite/ollama) - [Continue.dev (VS Code Extension)](https://continue.dev)

---

**Got stuck?** Drop a comment or ping me — I've run Ollama on everything from Raspberry Pis to H100 clusters.

---

**💡 Affiliate Disclosure:** This article contains Amazon affiliate links. If you purchase through these links, we earn a small commission at no extra cost to you. We only recommend products we personally use and trust.