API based LLM model access while LLM is self-hosted

API based LLM model access while LLM is self-hosted

Architecture Overview

Text
┌──────────────────────────────────────────────────────────┐
│                    REMOTE CLIENT                         │
│              (Laptop / Phone / Tablet)                   │
│                  Twingate Client App                     │
└──────────────────┬───────────────────────────────────────┘
                   │  Encrypted Zero Trust Tunnel

┌──────────────────────────────────────────────────────────┐
│               HOSTINGER VPS (Cloud Relay)                │
│          Docker: twingate/connector container            │
│      Authenticates via Twingate Identity Provider        │
└──────────────────┬───────────────────────────────────────┘
                   │  Secure Tunnel (No open ports needed)

┌──────────────────────────────────────────────────────────┐
│            HOME UBUNTU WORKSTATION                       │
│                                                          │
│  ┌─────────────┐    ┌──────────────┐   ┌─────────────┐  │
│  │  Open WebUI  │◄──│    Ollama     │◄──│  AMD ROCm   │  │
│  │  :9090       │   │   :11434     │   │  gfx1100    │  │
│  └─────────────┘    └──────────────┘   └─────────────┘  │
│                                                          │
│  GPU: Sapphire NITRO+ RX 7900 XTX Vapor-X (24GB VRAM)  │
└──────────────────────────────────────────────────────────┘

Hardware Specifications

ComponentSpecification
OSUbuntu 24.04.3 LTS x86\_64
Kernel6.14.0-37-generic
MotherboardGigabyte X870E AORUS ELITE WIFI7
CPUAMD Ryzen 7 7800X3D (8 cores, 16 threads @ 5.05 GHz)
L3 Cache96 MB (3D V-Cache)
RAM64 GB DDR5
GPUSapphire NITRO+ AMD Radeon RX 7900 XTX Vapor-X
VRAM24 GB GDDR6
GPU ArchRDNA 3 (gfx1100, 48 compute units, 2526 MHz)
ShellBash 5.2.21
Resolution1920x1080

Geekbench 6 OpenCL Performance

BenchmarkScoreThroughput
Overall216,220
Background Blur100,194414.7 images/sec
Horizon Detection335,18810.4 Gpixels/sec
Edge Detection376,17014.0 Gpixels/sec
Stereo Matching831,163790.1 Gpixels/sec
Particle Physics622,44527,394.3 FPS

Installing ROCm for GPU Acceleration

The RX 7900 XTX is natively supported as gfx1100 under ROCm. This setup enables GPU-accelerated LLM inference.

Add ROCm Repository

Bash
# Update system packages
sudo apt update && sudo apt upgrade -y

# Install prerequisites
sudo apt install -y wget gnupg2

# Add ROCm repository
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \\
  gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null

echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.1 noble main" | \\
  sudo tee /etc/apt/sources.list.d/rocm.list

sudo apt update

Install and Verify ROCm

Bash
# Install ROCm
sudo apt install -y rocm

# Add to PATH
echo 'PATH=/opt/rocm/bin:/opt/rocm/opencl/bin:$PATH' >> ~/.profile
source ~/.profile

# Verify GPU detection
sudo /opt/rocm/bin/rocminfo | grep gfx
# Expected: gfx1100

# Add user to required groups
sudo usermod -aG render,video $USER

The RX 7900 XTX is recognized as gfx1100 and requires no version overrides.

Setting Up Ollama with ROCm

Install Ollama

Bash
curl -fsSL https://ollama.com/install.sh | sh

Ollama automatically detects AMD GPUs when ROCm drivers are installed. The 24 GB VRAM allows running models up to approximately 30B parameters with 4-bit quantization.

Configure Network Binding

Ollama binds to 127.0.0.1 by default. To allow Docker containers to connect, expose it on all interfaces:

Bash
sudo systemctl edit ollama.service

Add this configuration:

Text
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Apply changes:

Bash
sudo systemctl daemon-reload
sudo systemctl restart ollama

Test GPU Acceleration

Bash
# Download a model
ollama pull llama3.1:8b

# Run inference
ollama run llama3.1:8b "Explain how GPU acceleration works"

# Monitor GPU utilization
watch -n 1 rocm-smi

You should observe VRAM allocation on the GPU, confirming hardware acceleration.

Deploying Open WebUI

Open WebUI provides a web interface for interacting with Ollama models.

Install Docker

Bash
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Enable non-root Docker usage
sudo usermod -aG docker $USER
newgrp docker

Run Open WebUI Container

Bash
docker run -d \\
  -p 9090:8080 \\
  --name open-webui \\
  --restart unless-stopped \\
  --add-host=host.docker.internal:host-gateway \\
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \\
  -v $HOME/.open-webui:/app/backend/data \\
  ghcr.io/open-webui/open-webui:main

Access the interface at http://localhost:9090. Create an admin account on first launch.

Configuring Twingate Zero Trust Access

This setup uses Twingate to provide secure remote access without opening inbound ports on the home network.

Architecture Details

  1. Hostinger VPS runs a Twingate connector that maintains an outbound connection to Twingate's control plane
  2. Home workstation can optionally run a second connector or be accessed via the VPS relay
  3. Remote clients authenticate through an identity provider and connect via the Twingate client
  4. Result: Secure access to the workstation's Open WebUI without port forwarding

Deploy Connector on Hostinger VPS

SSH into the VPS and run:

Bash
# Install Docker if needed
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Deploy Twingate connector
# Obtain credentials from Twingate Admin Console → Add Connector
docker run -d \\
  --sysctl net.ipv4.ping_group_range="0 2147483647" \\
  --env TWINGATE_NETWORK="<YOUR_NETWORK_NAME>" \\
  --env TWINGATE_ACCESS_TOKEN="<ACCESS_TOKEN>" \\
  --env TWINGATE_REFRESH_TOKEN="<REFRESH_TOKEN>" \\
  --env TWINGATE_LABEL_HOSTNAME="$(hostname)" \\
  --env TWINGATE_LOG_ANALYTICS="v2" \\
  --name twingate-connector \\
  --restart unless-stopped \\
  --pull always \\
  twingate/connector:1

Configure Access in Twingate Console

  1. Navigate to Remote Networks and select your network
  2. Add a Resource pointing to workstation-local-ip:9090
  3. Configure access policies and assign to appropriate user groups
  4. Install the Twingate client on devices that need access

Docker Compose Configuration

For unified management, use this docker-compose.yml:

YAML
version: "3.8"

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "9090:8080"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    volumes:
      - open-webui-data:/app/backend/data

  twingate-connector:
    image: twingate/connector:1
    container_name: twingate-home-connector
    restart: unless-stopped
    pull_policy: always
    sysctls:
      - net.ipv4.ping_group_range=0 2147483647
    environment:
      - TWINGATE_NETWORK=<YOUR_NETWORK>
      - TWINGATE_ACCESS_TOKEN=<ACCESS_TOKEN>
      - TWINGATE_REFRESH_TOKEN=<REFRESH_TOKEN>
      - TWINGATE_LABEL_HOSTNAME=home-workstation
      - TWINGATE_LOG_ANALYTICS=v2

volumes:
  open-webui-data:

Deploy with:

Bash
docker compose up -d

Verification

ComponentCommandExpected Output
ROCm GPU detectionrocminfo \grep gfxgfx1100
Ollama service statussystemctl status ollamaactive (running)
GPU utilizationrocm-smi (while model loaded)VRAM usage on device 0
Open WebUI accessibilitycurl http://localhost:9090HTML response
Twingate connectivityCheck Twingate Admin ConsoleConnector status: online
Remote accessAccess via Twingate clientOpen WebUI login page

Model Performance

Tested configurations with 24 GB VRAM and 64 GB system RAM:

ModelParametersQuantizationVRAM UsageTokens/sec
Llama 3.1 8B8BQ4\_K\_M\~5 GB80-100
DeepSeek-R114BQ4\_K\_M\~9 GB45-60
Qwen 2.532BQ4\_K\_M\~20 GB20-30
Llama 3.1 70B70BQ4\_K\_M\~22 GB\*8-12
  • No port forwarding required: Twingate connectors establish outbound-only connections
  • Identity-based authentication: Integrates with Google Workspace, Okta, Azure AD, or any OIDC provider
  • Native GPU support: The RX 7900 XTX (gfx1100) is officially supported by ROCm without workarounds
  • Automatic GPU detection: Ollama automatically uses ROCm-compatible GPUs when available
  • VPS role: The Hostinger VPS acts as a relay and does not perform inference computations