WZ-IT Logo

Self-Hosted ChatGPT: vLLM with Qwen3.5-122B on Two RTX PRO 6000 Blackwell GPUs

Timo Wevelsiep
Timo Wevelsiep
#vLLM #Qwen #SelfHosted #RTXPRO6000 #Blackwell #OpenWebUI #LLM #GDPR

Editorial note: The information in this article was compiled to the best of our knowledge at the time of publication. Technical details, prices, versions, licensing terms, and external content may change. Please verify the information provided independently, particularly before making business-critical or security-related decisions. This article does not replace individual professional, legal, or tax advice.

Self-Hosted ChatGPT: vLLM with Qwen3.5-122B on Two RTX PRO 6000 Blackwell GPUs

Have vLLM set up and operated - WZ-IT sets up production LLM inference, including on your own GPU hardware: model selection, tensor parallelism, OpenWebUI integration, monitoring and operations. Schedule a free consultation

Your own ChatGPT, running entirely in-house: a familiar chat interface, a large language model on your own hardware behind it, without a single prompt going to an external service. This is no longer a future scenario but readily achievable with current hardware and open-source software.

This article shows what such a setup looks like in practice: the inference framework vLLM, which distributes a 122-billion-parameter model across two NVIDIA RTX PRO 6000 Blackwell GPUs via tensor parallelism, plus OpenWebUI as the web interface. Instead of a polished glossy guide, the focus is on the points that actually cost time in practice: the right model choice, vLLM parameters, a typical startup gotcha with hybrid models, memory planning and clean operations.

Table of Contents

What This Setup Delivers

The goal is a data-sovereign AI platform: a ChatGPT-like web interface with a large, locally running language model, on-premise and without cloud dependency. Two open-source components carry it:

  • vLLM as the inference server. It loads the model, distributes it across the GPUs and provides an OpenAI-compatible API. Existing tools and SDKs talk to the local model without rework.
  • OpenWebUI as the web interface. Login, user management, chat, document upload with RAG. For users it feels like a familiar chat frontend.

The appeal lies in the combination of convenience and control: employees get a familiar interface, while the company retains full authority over model, data and infrastructure.

The Architecture: Two Servers, Cleanly Separated

Separating compute-intensive inference from the lightweight frontend onto two machines has proven its worth:

Role Task Profile
Inference backend vLLM, model serving, OpenAI API on port 8000 GPU server with two large cards
Web frontend OpenWebUI on port 3000, users and sessions CPU server, no GPU needed

This separation has tangible advantages: both components can be updated, restarted and secured independently. The GPU backend stays lean and does one thing only, the frontend can address multiple backends or models. Communication runs over the internal network, secured with an API key.

Hardware and Drivers: RTX PRO 6000 Blackwell

The basis is two NVIDIA RTX PRO 6000 Blackwell with 96 GB GDDR7 each (with ECC). Together that is 192 GB of VRAM, enough for a 122B model in FP8 plus cache.

A current driver stack is important for Blackwell GPUs. In practice this setup runs with the NVIDIA driver from the 580 branch and CUDA 13. Older drivers do not yet know the Blackwell architecture or lack the necessary FP8 support. Add a current Docker stack and the NVIDIA Container Toolkit so the GPUs are visible inside the container.

One detail that matters later: for tensor parallelism the two GPU worker processes must communicate via shared memory. In the container this means --ipc=host and a generously sized --shm-size.

The Model: Qwen3.5-122B-A10B-FP8

The model in use is Qwen/Qwen3.5-122B-A10B-FP8, released under the Apache 2.0 license. Its key data explain why it fits this hardware well:

  • Mixture of experts (MoE): around 122 billion total parameters, but only about 10 billion active per token (hence "A10B"). This delivers the quality of a large model at the compute load of a much smaller one.
  • FP8 quantization: the weights take up around 125 GB instead of roughly 245 GB in BF16. Only this makes operation on two 96 GB cards realistic.
  • Hybrid architecture: Qwen3.5 mixes Gated DeltaNet layers (a linear attention variant with a recurrent state) and classic full-attention layers in a 3:1 ratio. This significantly lowers the KV cache requirement compared to a pure attention model and makes long contexts on this hardware practical in the first place.
  • Reasoning mode: Qwen3.5 generates a thinking block before the actual answer by default. This can be disabled per request via enable_thinking: false.

A common misconception: the native context window is not 64k but 256k tokens (considerably more via YaRN). 64k is a deliberate limit in the setup, not the model's limit.

Configuring vLLM: Tensor Parallelism and FP8

The core is a Docker Compose file for vLLM. The decisive parameter is --tensor-parallel-size 2: vLLM splits the single model across both GPUs (tensor parallelism, not two copies). The API key comes via an environment variable, never as plaintext in the file.

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm-qwen
    restart: unless-stopped
    gpus: all
    ipc: host
    shm_size: "64gb"
    ports:
      - "8000:8000"
    environment:
      HF_HOME: "/models/huggingface"
      VLLM_API_KEY: "${VLLM_API_KEY}"
    volumes:
      - /srv/hf-cache:/models/huggingface
      - /srv/vllm-cache:/root/.cache/vllm
    command:
      - "--model"
      - "Qwen/Qwen3.5-122B-A10B-FP8"
      - "--served-model-name"
      - "qwen35-122b"
      - "--tensor-parallel-size"
      - "2"
      - "--max-model-len"
      - "65536"
      - "--gpu-memory-utilization"
      - "0.90"
      - "--max-num-seqs"
      - "16"
      - "--reasoning-parser"
      - "qwen3"

The most important levers:

  • --max-model-len 65536 deliberately limits context to 64k. More context costs more cache and reduces the possible parallelism.
  • --gpu-memory-utilization 0.90 reserves 90 percent of VRAM for vLLM. The vLLM default is 0.92; 0.90 leaves a little more headroom.
  • --max-num-seqs 16 is the point where most setups fail on the first start. See the next section.
  • --reasoning-parser qwen3 makes vLLM cleanly separate Qwen3.5's thinking block from the answer.

The Gotcha: max-num-seqs with Hybrid Models

The typical first start attempt ends with an abort pointing to too many parallel sequences and missing state cache blocks. The reason is the hybrid architecture.

max_num_seqs sets how many sequences vLLM processes at most concurrently. The current V1 engine default is 1024. For pure attention models this is usually unproblematic. For hybrid models like Qwen3.5, however, the number of recurrent-state slots depends on this value, and vLLM manages these slots via its Mamba cache subsystem (which is why "Mamba" terms appear at startup, even though this is not a classic Mamba model). If the available state cache blocks are not enough for 1024 sequences, the start aborts.

The fix is to explicitly lower max_num_seqs. For the first successful start a conservative value makes sense:

--max-num-seqs 16

This is no loss: 16 concurrent sequences are plenty for many internal scenarios, and the value can be raised in a controlled way later. It is important not to confuse this state-slot topic with a classic VRAM bottleneck. If vLLM later runs out of memory for the KV cache during inference, the lever is a different one (smaller context or fewer parallel sequences), not raising max_num_seqs further.

Where the Model Lands: Cache and Storage Planning

A question that comes up regularly in operations: where does vLLM actually download the model to? The answer is the HuggingFace cache. The path is set via HF_HOME, here to a mounted volume inside the container. The model then lands in a structure like:

$HF_HOME/hub/models--Qwen--Qwen3.5-122B-A10B-FP8/
  ├── blobs/
  ├── snapshots/
  └── refs/

During the download, temporary files sit there and only transition into the final blobs once fully loaded. As long as incomplete files exist, vLLM keeps downloading, even when the GPUs already hold memory. This is normal and not a hang.

For planning this means concretely: the FP8 weights take up around 125 GB. The cache should live on fast storage (NVMe), with plenty of headroom for future models. A download of this size takes noticeable time depending on the connection, and a second model change should not fail on disk space.

Connecting OpenWebUI as the Web Interface

On the frontend server, OpenWebUI runs as a second container. It is attached to vLLM via the OpenAI-compatible API. Ollama is deliberately disabled, since inference runs entirely through vLLM.

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    environment:
      WEBUI_NAME: "Internal AI"
      WEBUI_AUTH: "true"
      WEBUI_SECRET_KEY: "${OPENWEBUI_SECRET_KEY}"
      ENABLE_OPENAI_API: "true"
      # IP/hostname of the inference server, placeholder here
      OPENAI_API_BASE_URL: "http://10.0.0.11:8000/v1"
      OPENAI_API_KEY: "${VLLM_API_KEY}"
      DEFAULT_MODELS: "qwen35-122b"
      ENABLE_OLLAMA_API: "false"
      ENABLE_SIGNUP: "false"
      RAG_FULL_CONTEXT: "false"
    volumes:
      - open-webui-data:/app/backend/data
volumes:
  open-webui-data:

Three points from practice:

  • Self-registration off: ENABLE_SIGNUP: "false" prevents arbitrary users from signing up. The admin creates users.
  • RAG instead of full context: with RAG_FULL_CONTEXT: "false", uploaded documents are processed via retrieval. Loading the entire document content into the context only makes sense for individual small files, otherwise it quickly blows the context window.
  • CORS and URL: as soon as the interface is reachable via a fixed hostname or later over HTTPS, WEBUI_URL and CORS_ALLOW_ORIGIN must be set to exactly those addresses, otherwise WebSocket connections and streaming break.

Security and Operations

A working setup is not the same as a production-ready one. Three topics need clarifying before rollout:

Network and access. The vLLM endpoint on port 8000 must not sit open on the network. It should only be reachable for the frontend and authorized admin networks. OpenWebUI on port 3000 belongs behind a VPN or at least in a segmented network. Firewall and network segmentation are a deliberate task here, not a side effect of the installation.

Reverse proxy and TLS. As a first step, access often runs directly on port 3000 without encryption. For production operation a reverse proxy with TLS belongs in front. Important here: enable WebSocket forwarding and disable proxy buffering for streaming, otherwise output stalls.

Backup and updates. OpenWebUI's users, chats and settings live in a Docker volume. This volume belongs in a regular, tested backup strategy; a backup without a restore test is not a backup. Equally important: a plan for updates and ongoing CVE monitoring for the containers in use.

Our Approach at WZ-IT

We have implemented a setup like this in practice, and the path from "runs on the bench" to "runs cleanly in operation" is the actual effort. Here is how we approach it:

  1. Architecture and sizing. Clarify model choice, quantization, context length, GPU topology and realistic user count before anything is installed. That prevents most later surprises.

  2. Setup on your or our hardware. We set up vLLM and OpenWebUI cleanly, whether on your existing GPU servers (bring your own infrastructure) or on a GPU server or AI Cube operated by us. Tensor parallelism, FP8, cache paths and API integration included.

  3. Hardening and handover. Access concept, reverse proxy, backup, monitoring and complete operational documentation. So the system remains operable even without us.

  4. Managed operations on request. Updates, model changes, CVE tracking and support via our managed AI service, if you do not want to handle ongoing operations yourself.

The value lies not in a license but in clean design and reliable operations. With large, self-hosted models, that is exactly the difference between a demo and a platform that employees can rely on.

Further Reading


Own GPU hardware but no team for LLM operations? We set up vLLM and OpenWebUI production-ready, document the setup and take over operations on request - GDPR-compliant and without data leaving your house. Schedule an intro call

As of June 2026. Version and hardware details refer to the stated state. Open-source projects and models evolve quickly - for the latest details, check the official vLLM and model documentation.

Sources

Frequently Asked Questions

Answers to important questions about this topic

Ollama is ideal for development and single users. As soon as many employees access a large model concurrently, vLLM is the better choice: it is built for high throughput (PagedAttention, continuous batching) and distributes a model that does not fit on one GPU across multiple cards via tensor parallelism. For a 122B model on two GPUs, there is practically no way around vLLM or a comparable serving engine.

Qwen/Qwen3.5-122B-A10B-FP8 is a mixture-of-experts model with around 122 billion total and about 10 billion active parameters. In FP8 the weights take up roughly 125 GB. Together with KV and state cache this fits well into 2x 96 GB VRAM (192 GB) if you size context length and parallelism sensibly.

Qwen3.5 is a hybrid model with a recurrent-state component. vLLM manages this state via its Mamba cache subsystem, and the number of state slots depends on max_num_seqs. If max_num_seqs is too high (the V1 engine default is 1024), the available state cache blocks are not enough and vLLM aborts at startup. The fix is to explicitly lower max_num_seqs to a realistic value such as 16.

Natively Qwen3.5-122B-A10B supports up to 256k tokens, and considerably more via YaRN scaling. In a production setup you deliberately limit the context length via --max-model-len, for example to 64k, because larger context costs more cache memory and reduces parallelism. So 64k is a setup decision, not the model's limit.

vLLM downloads the model on first start via the HuggingFace Hub into the local cache, by default under the path configured via HF_HOME, in the structure hub/models--Qwen--Qwen3.5-122B-A10B-FP8/. During the download temporary files sit there; the FP8 weights take up around 125 GB. The cache should live on fast, sufficiently large storage.

Yes. The entire setup - model, inference and web interface - runs on your own hardware. There is no third-country transfer and no data processing agreement with a US provider. Employees use a ChatGPT-like interface without prompts or documents leaving the company network.

Both. If you already have GPU servers, the setup is built on them (bring your own infrastructure). If you do not want to operate your own hardware, you can rent and have GPU servers or an AI Cube operated. What matters is clean configuration and ongoing operations, not who owns the cards.

Timo Wevelsiep

Written by

Timo Wevelsiep

Co-Founder & CEO

Co-Founder of WZ-IT. Specialized in cloud infrastructure, open-source platforms and managed services for SMEs and enterprise clients worldwide.

LinkedIn

Let's Talk About Your Idea

Whether a specific IT challenge or just an idea - we look forward to the exchange. In a brief conversation, we'll evaluate together if and how your project fits with WZ-IT.

E-Mail
[email protected]

Leading companies trust WZ-IT

  • Rekorder
  • Keymate
  • Führerscheinmacher
  • SolidProof
  • ARGE
  • Boese VA
  • NextGym
  • Maho Management
  • Golem.de
  • Millenium
  • Paritel
  • Yonju
  • EVADXB
  • Mr. Clipart
  • Aphy
  • Negosh
  • ABCO Water
Timo Wevelsiep & Robin Zins - CEOs of WZ-IT

Timo Wevelsiep & Robin Zins

Managing Directors of WZ-IT

1/3 – Topic Selection33%

What is your inquiry about?

Select one or more areas where we can support you.