Running LLMs Locally with Ollama: Privacy and Performance

Why Run LLMs Locally?

With the growth of Large Language Models (LLMs), the need to run them locally arises for reasons such as:

🔒 Privacy: Sensitive data never leaves your environment
💰 Cost: No per-token or API call fees
⚡ Latency: Faster responses without internet dependency
🎛️ Control: Full model customization

What is Ollama?

Ollama is a tool that simplifies running LLMs locally, offering:

Simple CLI interface
Support for multiple models
Built-in REST API
Automatic resource management

Installation and Setup

1. Installation on Linux

curl -fsSL https://ollama.ai/install.sh | sh

2. Installation on macOS

brew install ollama

3. Installation on Windows

Download the installer at ollama.ai

4. Verification

ollama --version

Available Models

Popular Models

Model	Size	Required RAM	Recommended Use
Llama 2 7B	3.8GB	8GB	General purpose
Llama 2 13B	7.3GB	16GB	Complex tasks
Code Llama	3.8GB	8GB	Programming
Mistral 7B	4.1GB	8GB	Multilingual
Phi-2	1.7GB	4GB	Limited devices

Installing a Model

# Install Llama 2 7B
ollama pull llama2

# Install Code Llama for programming
ollama pull codellama

# Install Mistral 7B
ollama pull mistral

Basic Usage

1. Interactive Chat

ollama run llama2

2. Single Prompt

ollama run llama2 "Explain what DevOps is"

3. Via REST API

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why use Kubernetes?",
  "stream": false
}'

Application Integration

Python Example

import requests
import json

def query_ollama(prompt, model="llama2"):
    url = "http://localhost:11434/api/generate"
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    
    response = requests.post(url, json=data)
    return json.loads(response.text)["response"]

# Usage
result = query_ollama("How to implement CI/CD?")
print(result)

Node.js Example

const axios = require('axios');

async function queryOllama(prompt, model = 'llama2') {
    try {
        const response = await axios.post('http://localhost:11434/api/generate', {
            model: model,
            prompt: prompt,
            stream: false
        });
        
        return response.data.response;
    } catch (error) {
        console.error('Error:', error);
    }
}

// Usage
queryOllama('Explain Docker containers')
    .then(result => console.log(result));

Customization and Fine-tuning

1. Creating a Modelfile

FROM llama2

# Set temperature (creativity)
PARAMETER temperature 0.7

# Set system prompt
SYSTEM """
You are a specialist in DevOps and Cloud Computing.
Always respond in a technical and practical manner.
"""

2. Creating a Custom Model

ollama create devops-expert -f ./Modelfile
ollama run devops-expert

Performance and Optimization

1. Hardware Requirements

Minimum:

RAM: 8GB
CPU: 4 cores
Storage: 10GB free

Recommended:

RAM: 16GB+
CPU: 8+ cores
GPU: NVIDIA with CUDA (optional)
SSD: For better I/O

2. Performance Settings

# Set number of threads
export OLLAMA_NUM_THREADS=8

# Use GPU (if available)
export OLLAMA_GPU=1

# Configure memory
export OLLAMA_MAX_LOADED_MODELS=2

3. Monitoring

# View loaded models
ollama list

# Monitor resource usage
htop
nvidia-smi  # For GPU

Practical Use Cases

1. Code Assistant

ollama run codellama "Create a Python function to validate a CPF"

2. Log Analysis

ollama run llama2 "Analyze this error: $(cat error.log)"

3. Automated Documentation

ollama run codellama "Document this function: $(cat function.py)"

4. Code Review

ollama run codellama "Review this code and suggest improvements: $(cat script.sh)"

Comparison with External APIs

Aspect	Local Ollama	External APIs
Privacy	✅ Full	❌ Limited
Cost	✅ Free	💰 Pay per use
Latency	⚡ Low	🌐 Variable
Models	🔄 Limited	🚀 More options
Setup	🛠️ Complex	✅ Simple

Troubleshooting

1. Model won’t load

# Check disk space
df -h

# Check available RAM
free -h

# Ollama logs
ollama logs

2. Low performance

# Check if GPU is being used
nvidia-smi

# Adjust threads
export OLLAMA_NUM_THREADS=4

3. API not responding

# Check if the service is running
ps aux | grep ollama

# Restart service
ollama serve

Conclusion

Ollama democratizes access to local LLMs, offering a private and cost-effective alternative to external APIs. It is ideal for:

Prototype development
Corporate environments with sensitive data
Learning and experimentation
Applications that require low latency

Next Steps

Experiment with different models
Integrate with your applications
Explore fine-tuning
Set up for production

Additional Resources:

Why Run LLMs Locally?#

What is Ollama?#

Installation and Setup#

1. Installation on Linux#

2. Installation on macOS#

3. Installation on Windows#

4. Verification#

Available Models#

Popular Models#

Installing a Model#

Basic Usage#

1. Interactive Chat#

2. Single Prompt#

3. Via REST API#

Application Integration#

Python Example#

Node.js Example#

Customization and Fine-tuning#

1. Creating a Modelfile#

2. Creating a Custom Model#

Performance and Optimization#

1. Hardware Requirements#

2. Performance Settings#

3. Monitoring#

Practical Use Cases#

1. Code Assistant#

2. Log Analysis#

3. Automated Documentation#

4. Code Review#

Comparison with External APIs#

Troubleshooting#

1. Model won’t load#

2. Low performance#

3. API not responding#

Conclusion#

Next Steps#