Running LLMs Locally with Ollama: Complete Setup Guide
Ollama is the easiest way to run large language models locally. Think "Docker for LLMs" — download a model, run a command, and you've got a local AI assistant. No cloud APIs, no data leaks, just your hardware.
This guide covers everything: installation, model selection, performance tuning, and integrating with your favorite tools.
Why Ollama?
✅ One-command model downloads (no model file hunting)
✅ Automatic GPU acceleration (CUDA, Metal, ROCm)
✅ OpenAI-compatible API (drop-in replacement for ChatGPT)
✅ Cross-platform (macOS, Linux, Windows)
✅ Model library with 100+ pre-configured models
Installation (3 Minutes)
macOS
brew install ollamaLinux
curl -fsSL https://ollama.com/install.sh | shWindows
Download installer: https://ollama.com/download/windows
Verify Installation
ollama --version
# Output: ollama version 0.6.2Your First Model (30 Seconds)
# Download and run Mistral 7B (4.1GB)
ollama run mistral
# You'll see:
# pulling manifest
# pulling layers... 100% [████████████]
# success
# Now chat:
>>> Write a Python function to calculate FibonacciThat's it. Ollama handles:
- Model download (with progress bar)
- GPU detection and setup
- Server initialization
- Chat interface
Choosing the Right Model
Beginner-Friendly Models
Advanced Models (Better Quality, More Resources)
Browse All Models
ollama list # Installed models
ollama search <keyword> # Search libraryOr visit: https://ollama.com/library
Model Management
Download Models
# Pull specific version
ollama pull llama3.3:70b-instruct-q4_K_M
# Pull latest (default)
ollama pull mistralList Installed Models
ollama listDelete Models
ollama rm mistral:7bCheck Disk Usage
du -sh ~/.ollama/models/*Running Models
Interactive Chat
ollama run llama3.3
>>> Hello!
Hello! How can I assist you today?
>>> /bye # Exit chatAPI Server Mode
# Start server (runs in background)
ollama serve
# Test API
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Why is the sky blue?",
"stream": false
}'One-Shot Prompts
ollama run mistral "Explain quantum computing in 50 words"Performance Tuning
GPU Acceleration
macOS (Metal): Automatic, no config needed
Linux (NVIDIA): Requires CUDA drivers
Linux (AMD): Requires ROCm
Check GPU usage:
# NVIDIA
nvidia-smi
# AMD
rocm-smi
# macOS
sudo powermetrics --samplers gpu_powerAdjust Context Window
# Increase from default 2048 to 8192 tokens
ollama run mistral --num_ctx 8192Larger contexts = more RAM, slower inference. Find your sweet spot.
Quantization (Speed vs Quality)
Models come in different quantization levels:
Example:
# Default (Q4_K_M)
ollama pull mistral
# High quality (Q8)
ollama pull mistral:7b-instruct-q8_0Integrating with Tools
OpenAI-Compatible API
Ollama's API mimics OpenAI's format:
import openai
client = openai.OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but ignored
)
response = client.chat.completions.create(
model="mistral",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)Works with:
- LangChain
- LlamaIndex
- Anything built for OpenAI API
VS Code / Cursor
Install Continue.dev extension:
// ~/.continue/config.json
{
"models": [
{
"title": "Ollama Mistral",
"provider": "ollama",
"model": "mistral"
}
]
}Now you have local code autocomplete!
LiteLLM (Unified Router)
pip install litellm
litellm --model ollama/mistralRoutes requests to Ollama, OpenAI, Claude, etc. via one API.
n8n Automation
Use Ollama node in n8n for AI-powered workflows:
- Add "Ollama" node
- Set base URL:
http://localhost:11434 - Select model:
mistral - Connect to your automation
Common Issues & Fixes
"Model not found"
# Re-pull the model
ollama pull mistralSlow inference on macOS
Check Activity Monitor → GPU → ensure "Ollama" is using GPU.
Out of memory
# Use smaller model or lower quant
ollama pull mistral:7b-instruct-q4_0Port already in use
# Change default port (11434)
OLLAMA_PORT=8080 ollama serveModels won't download (firewall)
# Use HTTP proxy
export HTTP_PROXY=http://proxy.example.com:8080
ollama pull mistralAdvanced: Custom Models
Import GGUF Files
# Create Modelfile
cat > Modelfile <<EOF
FROM ./custom-model.gguf
PARAMETER temperature 0.7
SYSTEM You are a helpful assistant.
EOF
# Build custom model
ollama create mymodel -f Modelfile
# Run it
ollama run mymodelFine-Tuned Models
If you've fine-tuned a model:
- Export to GGUF format
- Import via Modelfile (above)
- Optionally push to Ollama registry
Security & Privacy
✅ All inference happens locally — no data sent to cloud
✅ Models stored in `~/.ollama/models/` — encrypted at rest (if disk encryption enabled)
✅ API runs on localhost by default (not exposed to internet)
Pro tip: Run Ollama in a VM or container for extra isolation.
Comparing Ollama to Alternatives
Ollama wins on ease of use. For raw speed, use vLLM. For GUI, try LM Studio.
Next Steps
- Try different models: Compare Mistral, Llama, Gemma on your tasks
- Integrate with tools: VS Code, n8n, LangChain
- Optimize performance: Test different quants, adjust context size
- Build something: Local chatbot, code assistant, document analyzer
Resources
Got stuck? Drop a comment or ping me — I've run Ollama on everything from Raspberry Pis to H100 clusters.
(Affiliate disclosure: Some links may include referral codes. I only recommend tools I actually use.)