Why Run LLMs Locally?

With the growth of Large Language Models (LLMs), the need to run them locally arises for reasons such as:

  • πŸ”’ Privacy: Sensitive data never leaves your environment
  • πŸ’° Cost: No per-token or API call fees
  • ⚑ Latency: Faster responses without internet dependency
  • πŸŽ›οΈ Control: Full model customization

What is Ollama?

Ollama is a tool that simplifies running LLMs locally, offering:

  • Simple CLI interface
  • Support for multiple models
  • Built-in REST API
  • Automatic resource management

Installation and Setup

1. Installation on Linux

curl -fsSL https://ollama.ai/install.sh | sh

2. Installation on macOS

brew install ollama

3. Installation on Windows

Download the installer at ollama.ai

4. Verification

ollama --version

Available Models

ModelSizeRequired RAMRecommended Use
Llama 2 7B3.8GB8GBGeneral purpose
Llama 2 13B7.3GB16GBComplex tasks
Code Llama3.8GB8GBProgramming
Mistral 7B4.1GB8GBMultilingual
Phi-21.7GB4GBLimited devices

Installing a Model

# Install Llama 2 7B
ollama pull llama2

# Install Code Llama for programming
ollama pull codellama

# Install Mistral 7B
ollama pull mistral

Basic Usage

1. Interactive Chat

ollama run llama2

2. Single Prompt

ollama run llama2 "Explain what DevOps is"

3. Via REST API

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why use Kubernetes?",
  "stream": false
}'

Application Integration

Python Example

import requests
import json

def query_ollama(prompt, model="llama2"):
    url = "http://localhost:11434/api/generate"
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    
    response = requests.post(url, json=data)
    return json.loads(response.text)["response"]

# Usage
result = query_ollama("How to implement CI/CD?")
print(result)

Node.js Example

const axios = require('axios');

async function queryOllama(prompt, model = 'llama2') {
    try {
        const response = await axios.post('http://localhost:11434/api/generate', {
            model: model,
            prompt: prompt,
            stream: false
        });
        
        return response.data.response;
    } catch (error) {
        console.error('Error:', error);
    }
}

// Usage
queryOllama('Explain Docker containers')
    .then(result => console.log(result));

Customization and Fine-tuning

1. Creating a Modelfile

FROM llama2

# Set temperature (creativity)
PARAMETER temperature 0.7

# Set system prompt
SYSTEM """
You are a specialist in DevOps and Cloud Computing.
Always respond in a technical and practical manner.
"""

2. Creating a Custom Model

ollama create devops-expert -f ./Modelfile
ollama run devops-expert

Performance and Optimization

1. Hardware Requirements

Minimum:

  • RAM: 8GB
  • CPU: 4 cores
  • Storage: 10GB free

Recommended:

  • RAM: 16GB+
  • CPU: 8+ cores
  • GPU: NVIDIA with CUDA (optional)
  • SSD: For better I/O

2. Performance Settings

# Set number of threads
export OLLAMA_NUM_THREADS=8

# Use GPU (if available)
export OLLAMA_GPU=1

# Configure memory
export OLLAMA_MAX_LOADED_MODELS=2

3. Monitoring

# View loaded models
ollama list

# Monitor resource usage
htop
nvidia-smi  # For GPU

Practical Use Cases

1. Code Assistant

ollama run codellama "Create a Python function to validate a CPF"

2. Log Analysis

ollama run llama2 "Analyze this error: $(cat error.log)"

3. Automated Documentation

ollama run codellama "Document this function: $(cat function.py)"

4. Code Review

ollama run codellama "Review this code and suggest improvements: $(cat script.sh)"

Comparison with External APIs

AspectLocal OllamaExternal APIs
Privacyβœ… Full❌ Limited
Costβœ… FreeπŸ’° Pay per use
Latency⚑ Low🌐 Variable
ModelsπŸ”„ LimitedπŸš€ More options
SetupπŸ› οΈ Complexβœ… Simple

Troubleshooting

1. Model won’t load

# Check disk space
df -h

# Check available RAM
free -h

# Ollama logs
ollama logs

2. Low performance

# Check if GPU is being used
nvidia-smi

# Adjust threads
export OLLAMA_NUM_THREADS=4

3. API not responding

# Check if the service is running
ps aux | grep ollama

# Restart service
ollama serve

Conclusion

Ollama democratizes access to local LLMs, offering a private and cost-effective alternative to external APIs. It is ideal for:

  • Prototype development
  • Corporate environments with sensitive data
  • Learning and experimentation
  • Applications that require low latency

Next Steps

  1. Experiment with different models
  2. Integrate with your applications
  3. Explore fine-tuning
  4. Set up for production

Additional Resources: