Why Run LLMs Locally?
With the growth of Large Language Models (LLMs), the need to run them locally arises for reasons such as:
- π Privacy: Sensitive data never leaves your environment
- π° Cost: No per-token or API call fees
- β‘ Latency: Faster responses without internet dependency
- ποΈ Control: Full model customization
What is Ollama?
Ollama is a tool that simplifies running LLMs locally, offering:
- Simple CLI interface
- Support for multiple models
- Built-in REST API
- Automatic resource management
Installation and Setup
1. Installation on Linux
curl -fsSL https://ollama.ai/install.sh | sh
2. Installation on macOS
brew install ollama
3. Installation on Windows
Download the installer at ollama.ai
4. Verification
ollama --version
Available Models
Popular Models
| Model | Size | Required RAM | Recommended Use |
|---|---|---|---|
| Llama 2 7B | 3.8GB | 8GB | General purpose |
| Llama 2 13B | 7.3GB | 16GB | Complex tasks |
| Code Llama | 3.8GB | 8GB | Programming |
| Mistral 7B | 4.1GB | 8GB | Multilingual |
| Phi-2 | 1.7GB | 4GB | Limited devices |
Installing a Model
# Install Llama 2 7B
ollama pull llama2
# Install Code Llama for programming
ollama pull codellama
# Install Mistral 7B
ollama pull mistral
Basic Usage
1. Interactive Chat
ollama run llama2
2. Single Prompt
ollama run llama2 "Explain what DevOps is"
3. Via REST API
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why use Kubernetes?",
"stream": false
}'
Application Integration
Python Example
import requests
import json
def query_ollama(prompt, model="llama2"):
url = "http://localhost:11434/api/generate"
data = {
"model": model,
"prompt": prompt,
"stream": False
}
response = requests.post(url, json=data)
return json.loads(response.text)["response"]
# Usage
result = query_ollama("How to implement CI/CD?")
print(result)
Node.js Example
const axios = require('axios');
async function queryOllama(prompt, model = 'llama2') {
try {
const response = await axios.post('http://localhost:11434/api/generate', {
model: model,
prompt: prompt,
stream: false
});
return response.data.response;
} catch (error) {
console.error('Error:', error);
}
}
// Usage
queryOllama('Explain Docker containers')
.then(result => console.log(result));
Customization and Fine-tuning
1. Creating a Modelfile
FROM llama2
# Set temperature (creativity)
PARAMETER temperature 0.7
# Set system prompt
SYSTEM """
You are a specialist in DevOps and Cloud Computing.
Always respond in a technical and practical manner.
"""
2. Creating a Custom Model
ollama create devops-expert -f ./Modelfile
ollama run devops-expert
Performance and Optimization
1. Hardware Requirements
Minimum:
- RAM: 8GB
- CPU: 4 cores
- Storage: 10GB free
Recommended:
- RAM: 16GB+
- CPU: 8+ cores
- GPU: NVIDIA with CUDA (optional)
- SSD: For better I/O
2. Performance Settings
# Set number of threads
export OLLAMA_NUM_THREADS=8
# Use GPU (if available)
export OLLAMA_GPU=1
# Configure memory
export OLLAMA_MAX_LOADED_MODELS=2
3. Monitoring
# View loaded models
ollama list
# Monitor resource usage
htop
nvidia-smi # For GPU
Practical Use Cases
1. Code Assistant
ollama run codellama "Create a Python function to validate a CPF"
2. Log Analysis
ollama run llama2 "Analyze this error: $(cat error.log)"
3. Automated Documentation
ollama run codellama "Document this function: $(cat function.py)"
4. Code Review
ollama run codellama "Review this code and suggest improvements: $(cat script.sh)"
Comparison with External APIs
| Aspect | Local Ollama | External APIs |
|---|---|---|
| Privacy | β Full | β Limited |
| Cost | β Free | π° Pay per use |
| Latency | β‘ Low | π Variable |
| Models | π Limited | π More options |
| Setup | π οΈ Complex | β Simple |
Troubleshooting
1. Model won’t load
# Check disk space
df -h
# Check available RAM
free -h
# Ollama logs
ollama logs
2. Low performance
# Check if GPU is being used
nvidia-smi
# Adjust threads
export OLLAMA_NUM_THREADS=4
3. API not responding
# Check if the service is running
ps aux | grep ollama
# Restart service
ollama serve
Conclusion
Ollama democratizes access to local LLMs, offering a private and cost-effective alternative to external APIs. It is ideal for:
- Prototype development
- Corporate environments with sensitive data
- Learning and experimentation
- Applications that require low latency
Next Steps
- Experiment with different models
- Integrate with your applications
- Explore fine-tuning
- Set up for production
Additional Resources: