Running LLMs Locally with Ollama: Complete Setup Guide
Running LLMs Locally with Ollama: Complete Setup Guide
---
**Ollama** is the easiest way to run large language models locally. Think "Docker for LLMs" — download a model, run a command, and you've got a local AI assistant. No cloud APIs, no data leaks, just your hardware.
This guide covers **everything**: installation, model selection, performance tuning, and integrating with your favorite tools.
Why Ollama?
✅ **One-command model downloads** (no model file hunting) ✅ **Automatic GPU acceleration** (CUDA, Metal, ROCm) ✅ **OpenAI-compatible API** (drop-in replacement for ChatGPT) ✅ **Cross-platform** (macOS, Linux, Windows) ✅ **Model library** with 100+ pre-configured models
Installation (3 Minutes)
macOS
brew install ollama
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download installer: [https://ollama.com/download/windows](https://ollama.com/download/windows)
Verify Installation
ollama --version
Output: ollama version 0.6.2
Your First Model (30 Seconds)
Download and run Mistral 7B (4.1GB)
ollama run mistral
You'll see:
pulling manifest
pulling layers... 100% [████████████]
success
Now chat:
>>> Write a Python function to calculate Fibonacci
**That's it.** Ollama handles: - Model download (with progress bar) - GPU detection and setup - Server initialization - Chat interface
Choosing the Right Model
Beginner-Friendly Models
| Model | Size | RAM | Best For | Speed | |-------|------|-----|----------|-------| | **Mistral 7B** | 4.1GB | 8GB | General use, fast responses | ⚡⚡⚡ | | **Llama 3.2 8B** | 4.7GB | 8GB | Balanced quality/speed | ⚡⚡⚡ | | **Gemma 2 9B** | 5.4GB | 12GB | Google-trained, efficient | ⚡⚡ | | **Phi-4** | 7.9GB | 16GB | Reasoning, small but mighty | ⚡⚡ |
Advanced Models (Better Quality, More Resources)
| Model | Size | RAM | Best For | Speed | |-------|------|-----|----------|-------| | **Mistral 34B** | 19GB | 32GB | Near-GPT-4 quality | ⚡ | | **Llama 3.3 70B** | 40GB | 64GB | Research, complex tasks | 🐌 | | **DeepSeek Coder 33B** | 18GB | 32GB | Code generation | ⚡ | | **Mixtral 8x7B** | 26GB | 48GB | Mixture-of-experts, versatile | ⚡⚡ |
Browse All Models
ollama list # Installed models ollama search <keyword> # Search library
Or visit: [https://ollama.com/library](https://ollama.com/library)
Model Management
Download Models
Pull specific version
ollama pull llama3.3:70b-instruct-q4_K_M
Pull latest (default)
ollama pull mistral
List Installed Models
ollama list
Delete Models
ollama rm mistral:7b
Check Disk Usage
du -sh ~/.ollama/models/*
Running Models
Interactive Chat
ollama run llama3.3
>>> Hello! Hello! How can I assist you today?
>>> /bye # Exit chat
API Server Mode
Start server (runs in background)
ollama serve
Test API
curl http://localhost:11434/api/generate -d '{ "model": "mistral", "prompt": "Why is the sky blue?", "stream": false }'
One-Shot Prompts
ollama run mistral "Explain quantum computing in 50 words"
Performance Tuning
GPU Acceleration
**macOS (Metal):** Automatic, no config needed **Linux (NVIDIA):** Requires CUDA drivers **Linux (AMD):** Requires ROCm
Check GPU usage:
NVIDIA
nvidia-smi
AMD
rocm-smi
macOS
sudo powermetrics --samplers gpu_power
Adjust Context Window
Increase from default 2048 to 8192 tokens
ollama run mistral --num_ctx 8192
Larger contexts = more RAM, slower inference. Find your sweet spot.
Quantization (Speed vs Quality)
Models come in different quantization levels:
| Quant | Quality | Speed | RAM | Use Case | |-------|---------|-------|-----|----------| | **Q2_K** | ⭐⭐ | ⚡⚡⚡ | Low | Fast prototyping | | **Q4_K_M** | ⭐⭐⭐⭐ | ⚡⚡ | Medium | **Recommended default** | | **Q6_K** | ⭐⭐⭐⭐⭐ | ⚡ | High | Quality-sensitive work | | **Q8_0** | ⭐⭐⭐⭐⭐ | 🐌 | Very high | Research |
**Example:**
Default (Q4_K_M)
ollama pull mistral
High quality (Q8)
ollama pull mistral:7b-instruct-q8_0
Integrating with Tools
OpenAI-Compatible API
Ollama's API mimics OpenAI's format:
import openai
client = openai.OpenAI( base_url="http://localhost:11434/v1", api_key="ollama" # Required but ignored )
response = client.chat.completions.create( model="mistral", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content)
Works with: - LangChain - LlamaIndex - Anything built for OpenAI API
VS Code / Cursor
Install [Continue.dev](https://continue.dev) extension:
// ~/.continue/config.json { "models": [ { "title": "Ollama Mistral", "provider": "ollama", "model": "mistral" } ] }
Now you have local code autocomplete!
LiteLLM (Unified Router)
pip install litellm
litellm --model ollama/mistral
Routes requests to Ollama, OpenAI, Claude, etc. via one API.
n8n Automation
Use Ollama node in n8n for AI-powered workflows: 1. Add "Ollama" node 2. Set base URL: `http://localhost:11434` 3. Select model: `mistral` 4. Connect to your automation
Common Issues & Fixes
"Model not found"
Re-pull the model
ollama pull mistral
Slow inference on macOS
Check Activity Monitor → GPU → ensure "Ollama" is using GPU.
Out of memory
Use smaller model or lower quant
ollama pull mistral:7b-instruct-q4_0
Port already in use
Change default port (11434)
OLLAMA_PORT=8080 ollama serve
Models won't download (firewall)
Use HTTP proxy
export HTTP_PROXY=http://proxy.example.com:8080 ollama pull mistral
Advanced: Custom Models
Import GGUF Files
Create Modelfile
cat > Modelfile <<EOF FROM ./custom-model.gguf PARAMETER temperature 0.7 SYSTEM You are a helpful assistant. EOF
Build custom model
ollama create mymodel -f Modelfile
Run it
ollama run mymodel
Fine-Tuned Models
If you've fine-tuned a model: 1. Export to GGUF format 2. Import via Modelfile (above) 3. Optionally push to Ollama registry
Security & Privacy
✅ **All inference happens locally** — no data sent to cloud ✅ **Models stored in `~/.ollama/models/`** — encrypted at rest (if disk encryption enabled) ✅ **API runs on localhost** by default (not exposed to internet)
**Pro tip:** Run Ollama in a VM or container for extra isolation.
Comparing Ollama to Alternatives
| Tool | Ease of Use | Performance | Model Selection | |------|-------------|-------------|-----------------| | **Ollama** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | **llama.cpp** | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | | **vLLM** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | | **LM Studio** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
**Ollama wins on ease of use.** For raw speed, use vLLM. For GUI, try LM Studio.
Hardware Recommendations
If you want to upgrade your setup for better performance:
For Larger Models (30B+)
- **GPU:** [NVIDIA RTX 4070 Ti](https://www.amazon.com/s?k=rtx+4070+ti&tag=techkutak-20) (16GB VRAM) - Great for 30B models - **GPU:** [NVIDIA RTX 4090](https://www.amazon.com/s?k=rtx+4090&tag=techkutak-20) (24GB VRAM) - Runs 70B models smoothly - **RAM:** [Crucial 64GB DDR5 Kit](https://www.amazon.com/s?k=crucial+64gb+ddr5&tag=techkutak-20) - For system memory - **Storage:** [Samsung 980 PRO 2TB NVMe](https://www.amazon.com/dp/B08RK2SR23?tag=techkutak-20) - Fast model loading
For Learning & Small Models
- **SBC:** [Raspberry Pi 5 (8GB)](https://www.amazon.com/dp/B0CTQ3BQLS?tag=techkutak-20) - Perfect for testing 7B models - **Storage:** [Samsung 1TB USB SSD](https://www.amazon.com/s?k=samsung+t7+1tb&tag=techkutak-20) - Portable model storage
Learning Resources
- **Book:** [Hands-On Machine Learning](https://www.amazon.com/dp/1492032646?tag=techkutak-20) - Understand how LLMs work - **Book:** [Deep Learning with Python](https://www.amazon.com/dp/1617296864?tag=techkutak-20) - Build your own models
Next Steps
1. **Try different models:** Compare Mistral, Llama, Gemma on your tasks 2. **Integrate with tools:** VS Code, n8n, LangChain 3. **Optimize performance:** Test different quants, adjust context size 4. **Build something:** Local chatbot, code assistant, document analyzer
Resources
- [Official Ollama Docs](https://github.com/ollama/ollama/tree/main/docs) - [Model Library](https://ollama.com/library) - [Ollama Discord](https://discord.com/invite/ollama) - [Continue.dev (VS Code Extension)](https://continue.dev)
---
**Got stuck?** Drop a comment or ping me — I've run Ollama on everything from Raspberry Pis to H100 clusters.
---
**💡 Affiliate Disclosure:** This article contains Amazon affiliate links. If you purchase through these links, we earn a small commission at no extra cost to you. We only recommend products we personally use and trust.