WZ-IT Logo

GPT-OSS 120B on AI Cube Pro: Run OpenAI's Open-Source Model Locally

Timo Wevelsiep
Timo Wevelsiep
#AI #OpenAI #GPTOSS #SelfHosting #LocalAI #GPU #AIServer #OnPremise #OpenSource

Editorial note: The information in this article was compiled to the best of our knowledge at the time of publication. Technical details, prices, versions, licensing terms, and external content may change. Please verify the information provided independently, particularly before making business-critical or security-related decisions. This article does not replace individual professional, legal, or tax advice.

GPT-OSS 120B on AI Cube Pro: Run OpenAI's Open-Source Model Locally

With GPT-OSS 120B, OpenAI released their first open-weight model since GPT-2 in August 2025 – and it's impressive. The model achieves near o4-mini performance but runs entirely on your own hardware. In this article, we show you how to run GPT-OSS 120B on our AI Cube Pro.

What is GPT-OSS 120B?

GPT-OSS marks OpenAI's return to open source. After years of closed models (GPT-3, GPT-4, o-Series), OpenAI released two open-weight models under the permissive Apache 2.0 license:

  • GPT-OSS 120B: The large model with 117 billion parameters
  • GPT-OSS 20B: The smaller variant for edge devices

What makes it special: GPT-OSS uses a Mixture-of-Experts (MoE) architecture. Of the 117 billion parameters, only 5.1 billion are active per token. This makes the model extremely efficient – it runs on a single GPU with 80+ GB VRAM.

Technical Specifications

Property GPT-OSS 120B
Parameters (total) 117 billion
Active parameters 5.1 billion
Architecture Mixture of Experts (MoE)
Context length up to 128k tokens
Quantization MXFP4
VRAM requirement ~80 GB
License Apache 2.0

Why GPT-OSS 120B on the AI Cube Pro?

Our AI Cube Pro with the RTX PRO 6000 Blackwell (96 GB VRAM) is perfectly suited for GPT-OSS 120B. The 96 GB VRAM provides enough headroom for the model plus context buffer.

Advantages over Cloud APIs

  1. No token costs: One-time hardware investment instead of pay-per-use
  2. Full data control: Your data never leaves your company network
  3. GDPR compliance: No data transfer to US servers
  4. Unlimited usage: No rate limits or usage caps
  5. Customizable: Fine-tuning for your own use cases possible

Performance on the AI Cube Pro

Our benchmarks with GPT-OSS 120B on the AI Cube Pro show impressive results:

Scenario Throughput
Single user, small context ~150 tokens/s
20 users parallel, small contexts ~1,050 tokens/s total
20 users parallel, mixed contexts ~300-500 tokens/s total

These numbers make the AI Cube Pro ideal for team deployments: helpdesk bots, code review assistants, or internal knowledge bases with RAG.

The Pre-installed Software Stack

Every AI Cube Pro ships with a fully pre-configured stack:

Inference Engines

  • Ollama: Simple model management, quick start
  • vLLM: Maximum throughput, optimized for multi-user scenarios

Web Interfaces

Base System

  • Ubuntu Server LTS
  • Optimized CUDA drivers for RTX Blackwell
  • Docker for container deployments
  • nvidia-smi exporter for monitoring

GPT-OSS 120B in Practice

Use Case 1: Internal Helpdesk Bot

A mid-sized company uses GPT-OSS 120B as first-level support. The model answers employee questions about internal processes, IT issues, and HR topics. Through RAG integration with the internal knowledge base, it delivers context-aware answers – without sensitive company data ever leaving the firewall.

Use Case 2: Code Review Assistant

A development team uses GPT-OSS 120B for code reviews. The model analyzes pull requests, finds potential bugs, and suggests improvements. At ~150 tokens/s per user, the interaction feels fluid – like having an experienced senior developer looking over your shoulder.

A law firm uses GPT-OSS 120B to search through thousands of pages of contract documents. The 128k token context length allows analysis of entire contracts in one pass. Client data stays on-premise – a must for attorney-client privilege.

Getting Started with GPT-OSS 120B

After delivery of the AI Cube Pro, getting started is simple:

  1. Connect: Plug in power and network
  2. Power on: The system boots with pre-configured stack
  3. Load model: Download GPT-OSS 120B via Ollama or vLLM
  4. Start using: Interact with the model through Open WebUI or AnythingLLM

Optionally, we configure before delivery:

  • SSO integration (LDAP, SAML, OAuth)
  • Prometheus/Grafana monitoring
  • RAG pipelines with your data sources
  • Backup strategies

Conclusion

GPT-OSS 120B is a milestone: OpenAI's first real open-source model in years, with o4-mini-level performance. On the AI Cube Pro, it runs stable, fast, and – most importantly – completely under your control.

No cloud dependency. No token costs. No privacy concerns.

Interested? Schedule a consultation or check out our AI Cube models.


Related Articles:

Timo Wevelsiep

Written by

Timo Wevelsiep

Co-Founder & CEO

Co-Founder of WZ-IT. Specialized in cloud infrastructure, open-source platforms and managed services for SMEs and enterprise clients worldwide.

LinkedIn

Let's Talk About Your Idea

Whether a specific IT challenge or just an idea – we look forward to the exchange. In a brief conversation, we'll evaluate together if and how your project fits with WZ-IT.

E-Mail
[email protected]

Trusted by leading companies

  • Rekorder
  • Keymate
  • Führerscheinmacher
  • SolidProof
  • ARGE
  • Boese VA
  • NextGym
  • Maho Management
  • Golem.de
  • Millenium
  • Paritel
  • Yonju
  • EVADXB
  • Mr. Clipart
  • Aphy
  • Negosh
  • ABCO Water
Timo Wevelsiep & Robin Zins - CEOs of WZ-IT

Timo Wevelsiep & Robin Zins

CEOs of WZ-IT

1/3 – Topic Selection33%

What is your inquiry about?

Select one or more areas where we can support you.