WZ-IT Logo

GPT-OSS 120B on AI Cube Pro: Run OpenAI's Open-Source Model Locally

Timo Wevelsiep
Timo Wevelsiep
#AI #OpenAI #GPTOSS #SelfHosting #LocalAI #GPU #AIServer #OnPremise #OpenSource

With GPT-OSS 120B, OpenAI released their first open-weight model since GPT-2 in August 2025 – and it's impressive. The model achieves near o4-mini performance but runs entirely on your own hardware. In this article, we show you how to run GPT-OSS 120B on our AI Cube Pro.

What is GPT-OSS 120B?

GPT-OSS marks OpenAI's return to open source. After years of closed models (GPT-3, GPT-4, o-Series), OpenAI released two open-weight models under the permissive Apache 2.0 license:

  • GPT-OSS 120B: The large model with 117 billion parameters
  • GPT-OSS 20B: The smaller variant for edge devices

What makes it special: GPT-OSS uses a Mixture-of-Experts (MoE) architecture. Of the 117 billion parameters, only 5.1 billion are active per token. This makes the model extremely efficient – it runs on a single GPU with 80+ GB VRAM.

Technical Specifications

Property GPT-OSS 120B
Parameters (total) 117 billion
Active parameters 5.1 billion
Architecture Mixture of Experts (MoE)
Context length up to 128k tokens
Quantization MXFP4
VRAM requirement ~80 GB
License Apache 2.0

Why GPT-OSS 120B on the AI Cube Pro?

Our AI Cube Pro with the RTX PRO 6000 Blackwell (96 GB VRAM) is perfectly suited for GPT-OSS 120B. The 96 GB VRAM provides enough headroom for the model plus context buffer.

Advantages over Cloud APIs

  1. No token costs: One-time hardware investment instead of pay-per-use
  2. Full data control: Your data never leaves your company network
  3. GDPR compliance: No data transfer to US servers
  4. Unlimited usage: No rate limits or usage caps
  5. Customizable: Fine-tuning for your own use cases possible

Performance on the AI Cube Pro

Our benchmarks with GPT-OSS 120B on the AI Cube Pro show impressive results:

Scenario Throughput
Single user, small context ~150 tokens/s
20 users parallel, small contexts ~1,050 tokens/s total
20 users parallel, mixed contexts ~300-500 tokens/s total

These numbers make the AI Cube Pro ideal for team deployments: helpdesk bots, code review assistants, or internal knowledge bases with RAG.

The Pre-installed Software Stack

Every AI Cube Pro ships with a fully pre-configured stack:

Inference Engines

  • Ollama: Simple model management, quick start
  • vLLM: Maximum throughput, optimized for multi-user scenarios

Web Interfaces

Base System

  • Ubuntu Server LTS
  • Optimized CUDA drivers for RTX Blackwell
  • Docker for container deployments
  • nvidia-smi exporter for monitoring

GPT-OSS 120B in Practice

Use Case 1: Internal Helpdesk Bot

A mid-sized company uses GPT-OSS 120B as first-level support. The model answers employee questions about internal processes, IT issues, and HR topics. Through RAG integration with the internal knowledge base, it delivers context-aware answers – without sensitive company data ever leaving the firewall.

Use Case 2: Code Review Assistant

A development team uses GPT-OSS 120B for code reviews. The model analyzes pull requests, finds potential bugs, and suggests improvements. At ~150 tokens/s per user, the interaction feels fluid – like having an experienced senior developer looking over your shoulder.

A law firm uses GPT-OSS 120B to search through thousands of pages of contract documents. The 128k token context length allows analysis of entire contracts in one pass. Client data stays on-premise – a must for attorney-client privilege.

Getting Started with GPT-OSS 120B

After delivery of the AI Cube Pro, getting started is simple:

  1. Connect: Plug in power and network
  2. Power on: The system boots with pre-configured stack
  3. Load model: Download GPT-OSS 120B via Ollama or vLLM
  4. Start using: Interact with the model through Open WebUI or AnythingLLM

Optionally, we configure before delivery:

  • SSO integration (LDAP, SAML, OAuth)
  • Prometheus/Grafana monitoring
  • RAG pipelines with your data sources
  • Backup strategies

Conclusion

GPT-OSS 120B is a milestone: OpenAI's first real open-source model in years, with o4-mini-level performance. On the AI Cube Pro, it runs stable, fast, and – most importantly – completely under your control.

No cloud dependency. No token costs. No privacy concerns.

Interested? Schedule a consultation or check out our AI Cube models.


Related Articles:

Let's Talk About Your Idea

Whether a specific IT challenge or just an idea – we look forward to the exchange. In a brief conversation, we'll evaluate together if and how your project fits with WZ-IT.

Trusted by leading companies

  • Keymate
  • SolidProof
  • Rekorder
  • Führerscheinmacher
  • ARGE
  • NextGym
  • Paritel
  • EVADXB
  • Boese VA
  • Maho Management
  • Aphy
  • Negosh
  • Millenium
  • Yonju
  • Mr. Clipart
1/3 – Topic Selection33%

What is your inquiry about?

Select one or more areas where we can support you.