API based LLM model access while LLM is self-hosted

Architecture Overview

Text

┌──────────────────────────────────────────────────────────┐
│                    REMOTE CLIENT                         │
│              (Laptop / Phone / Tablet)                   │
│                  Twingate Client App                     │
└──────────────────┬───────────────────────────────────────┘
                   │  Encrypted Zero Trust Tunnel
                   ▼
┌──────────────────────────────────────────────────────────┐
│               HOSTINGER VPS (Cloud Relay)                │
│          Docker: twingate/connector container            │
│      Authenticates via Twingate Identity Provider        │
└──────────────────┬───────────────────────────────────────┘
                   │  Secure Tunnel (No open ports needed)
                   ▼
┌──────────────────────────────────────────────────────────┐
│            HOME UBUNTU WORKSTATION                       │
│                                                          │
│  ┌─────────────┐    ┌──────────────┐   ┌─────────────┐  │
│  │  Open WebUI  │◄──│    Ollama     │◄──│  AMD ROCm   │  │
│  │  :9090       │   │   :11434     │   │  gfx1100    │  │
│  └─────────────┘    └──────────────┘   └─────────────┘  │
│                                                          │
│  GPU: Sapphire NITRO+ RX 7900 XTX Vapor-X (24GB VRAM)  │
└──────────────────────────────────────────────────────────┘

Hardware Specifications

Component	Specification
OS	Ubuntu 24.04.3 LTS x86\_64
Kernel	6.14.0-37-generic
Motherboard	Gigabyte X870E AORUS ELITE WIFI7
CPU	AMD Ryzen 7 7800X3D (8 cores, 16 threads @ 5.05 GHz)
L3 Cache	96 MB (3D V-Cache)
RAM	64 GB DDR5
GPU	Sapphire NITRO+ AMD Radeon RX 7900 XTX Vapor-X
VRAM	24 GB GDDR6
GPU Arch	RDNA 3 (gfx1100, 48 compute units, 2526 MHz)
Shell	Bash 5.2.21
Resolution	1920x1080

Geekbench 6 OpenCL Performance

Benchmark	Score	Throughput
Overall	216,220	—
Background Blur	100,194	414.7 images/sec
Horizon Detection	335,188	10.4 Gpixels/sec
Edge Detection	376,170	14.0 Gpixels/sec
Stereo Matching	831,163	790.1 Gpixels/sec
Particle Physics	622,445	27,394.3 FPS

Installing ROCm for GPU Acceleration

The RX 7900 XTX is natively supported as gfx1100 under ROCm. This setup enables GPU-accelerated LLM inference.

Add ROCm Repository

Bash

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install prerequisites
sudo apt install -y wget gnupg2

# Add ROCm repository
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \\
  gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null

echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.1 noble main" | \\
  sudo tee /etc/apt/sources.list.d/rocm.list

sudo apt update

Install and Verify ROCm

Bash

# Install ROCm
sudo apt install -y rocm

# Add to PATH
echo 'PATH=/opt/rocm/bin:/opt/rocm/opencl/bin:$PATH' >> ~/.profile
source ~/.profile

# Verify GPU detection
sudo /opt/rocm/bin/rocminfo | grep gfx
# Expected: gfx1100

# Add user to required groups
sudo usermod -aG render,video $USER

The RX 7900 XTX is recognized as gfx1100 and requires no version overrides.

Setting Up Ollama with ROCm

Install Ollama

Bash

curl -fsSL https://ollama.com/install.sh | sh

Ollama automatically detects AMD GPUs when ROCm drivers are installed. The 24 GB VRAM allows running models up to approximately 30B parameters with 4-bit quantization.

Configure Network Binding

Ollama binds to 127.0.0.1 by default. To allow Docker containers to connect, expose it on all interfaces:

Bash

sudo systemctl edit ollama.service

Add this configuration:

Text

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Apply changes:

Bash

sudo systemctl daemon-reload
sudo systemctl restart ollama

Test GPU Acceleration

Bash

# Download a model
ollama pull llama3.1:8b

# Run inference
ollama run llama3.1:8b "Explain how GPU acceleration works"

# Monitor GPU utilization
watch -n 1 rocm-smi

You should observe VRAM allocation on the GPU, confirming hardware acceleration.

Deploying Open WebUI

Open WebUI provides a web interface for interacting with Ollama models.

Install Docker

Bash

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Enable non-root Docker usage
sudo usermod -aG docker $USER
newgrp docker

Run Open WebUI Container

Bash

docker run -d \\
  -p 9090:8080 \\
  --name open-webui \\
  --restart unless-stopped \\
  --add-host=host.docker.internal:host-gateway \\
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \\
  -v $HOME/.open-webui:/app/backend/data \\
  ghcr.io/open-webui/open-webui:main

Access the interface at http://localhost:9090. Create an admin account on first launch.

Configuring Twingate Zero Trust Access

This setup uses Twingate to provide secure remote access without opening inbound ports on the home network.

Architecture Details

Hostinger VPS runs a Twingate connector that maintains an outbound connection to Twingate's control plane
Home workstation can optionally run a second connector or be accessed via the VPS relay
Remote clients authenticate through an identity provider and connect via the Twingate client
Result: Secure access to the workstation's Open WebUI without port forwarding

Deploy Connector on Hostinger VPS

SSH into the VPS and run:

Bash

# Install Docker if needed
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Deploy Twingate connector
# Obtain credentials from Twingate Admin Console → Add Connector
docker run -d \\
  --sysctl net.ipv4.ping_group_range="0 2147483647" \\
  --env TWINGATE_NETWORK="<YOUR_NETWORK_NAME>" \\
  --env TWINGATE_ACCESS_TOKEN="<ACCESS_TOKEN>" \\
  --env TWINGATE_REFRESH_TOKEN="<REFRESH_TOKEN>" \\
  --env TWINGATE_LABEL_HOSTNAME="$(hostname)" \\
  --env TWINGATE_LOG_ANALYTICS="v2" \\
  --name twingate-connector \\
  --restart unless-stopped \\
  --pull always \\
  twingate/connector:1

Configure Access in Twingate Console

Navigate to Remote Networks and select your network
Add a Resource pointing to workstation-local-ip:9090
Configure access policies and assign to appropriate user groups
Install the Twingate client on devices that need access

Docker Compose Configuration

For unified management, use this docker-compose.yml:

YAML

version: "3.8"

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "9090:8080"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    volumes:
      - open-webui-data:/app/backend/data

  twingate-connector:
    image: twingate/connector:1
    container_name: twingate-home-connector
    restart: unless-stopped
    pull_policy: always
    sysctls:
      - net.ipv4.ping_group_range=0 2147483647
    environment:
      - TWINGATE_NETWORK=<YOUR_NETWORK>
      - TWINGATE_ACCESS_TOKEN=<ACCESS_TOKEN>
      - TWINGATE_REFRESH_TOKEN=<REFRESH_TOKEN>
      - TWINGATE_LABEL_HOSTNAME=home-workstation
      - TWINGATE_LOG_ANALYTICS=v2

volumes:
  open-webui-data:

Deploy with:

Bash

docker compose up -d

Verification

Component	Command	Expected Output
ROCm GPU detection	rocminfo \	grep gfx	gfx1100
Ollama service status	systemctl status ollama	active (running)
GPU utilization	rocm-smi (while model loaded)	VRAM usage on device 0
Open WebUI accessibility	curl http://localhost:9090	HTML response
Twingate connectivity	Check Twingate Admin Console	Connector status: online
Remote access	Access via Twingate client	Open WebUI login page

Model Performance

Tested configurations with 24 GB VRAM and 64 GB system RAM:

Model	Parameters	Quantization	VRAM Usage	Tokens/sec
Llama 3.1 8B	8B	Q4\_K\_M	\~5 GB	80-100
DeepSeek-R1	14B	Q4\_K\_M	\~9 GB	45-60
Qwen 2.5	32B	Q4\_K\_M	\~20 GB	20-30
Llama 3.1 70B	70B	Q4\_K\_M	\~22 GB\*	8-12

No port forwarding required: Twingate connectors establish outbound-only connections
Identity-based authentication: Integrates with Google Workspace, Okta, Azure AD, or any OIDC provider
Native GPU support: The RX 7900 XTX (gfx1100) is officially supported by ROCm without workarounds
Automatic GPU detection: Ollama automatically uses ROCm-compatible GPUs when available
VPS role: The Hostinger VPS acts as a relay and does not perform inference computations