GPT-OSS 120B on AI Cube Pro: Run OpenAI's Open-Source Model Locally

With GPT-OSS 120B, OpenAI released their first open-weight model since GPT-2 in August 2025 – and it's impressive. The model achieves near o4-mini performance but runs entirely on your own hardware. In this article, we show you how to run GPT-OSS 120B on our AI Cube Pro.

What is GPT-OSS 120B?

GPT-OSS marks OpenAI's return to open source. After years of closed models (GPT-3, GPT-4, o-Series), OpenAI released two open-weight models under the permissive Apache 2.0 license:

GPT-OSS 120B: The large model with 117 billion parameters
GPT-OSS 20B: The smaller variant for edge devices

What makes it special: GPT-OSS uses a Mixture-of-Experts (MoE) architecture. Of the 117 billion parameters, only 5.1 billion are active per token. This makes the model extremely efficient – it runs on a single GPU with 80+ GB VRAM.

Technical Specifications

Property	GPT-OSS 120B
Parameters (total)	117 billion
Active parameters	5.1 billion
Architecture	Mixture of Experts (MoE)
Context length	up to 128k tokens
Quantization	MXFP4
VRAM requirement	~80 GB
License	Apache 2.0

Why GPT-OSS 120B on the AI Cube Pro?

Our AI Cube Pro with the RTX PRO 6000 Blackwell (96 GB VRAM) is perfectly suited for GPT-OSS 120B. The 96 GB VRAM provides enough headroom for the model plus context buffer.

Advantages over Cloud APIs

No token costs: One-time hardware investment instead of pay-per-use
Full data control: Your data never leaves your company network
GDPR compliance: No data transfer to US servers
Unlimited usage: No rate limits or usage caps
Customizable: Fine-tuning for your own use cases possible

Performance on the AI Cube Pro

Our benchmarks with GPT-OSS 120B on the AI Cube Pro show impressive results:

Scenario	Throughput
Single user, small context	~150 tokens/s
20 users parallel, small contexts	~1,050 tokens/s total
20 users parallel, mixed contexts	~300-500 tokens/s total

These numbers make the AI Cube Pro ideal for team deployments: helpdesk bots, code review assistants, or internal knowledge bases with RAG.

The Pre-installed Software Stack

Every AI Cube Pro ships with a fully pre-configured stack:

Inference Engines

Ollama: Simple model management, quick start
vLLM: Maximum throughput, optimized for multi-user scenarios

Web Interfaces

Open WebUI: ChatGPT-like interface with team features
AnythingLLM: Workspace-based with integrated RAG

Base System

Ubuntu Server LTS
Optimized CUDA drivers for RTX Blackwell
Docker for container deployments
nvidia-smi exporter for monitoring

GPT-OSS 120B in Practice

Use Case 1: Internal Helpdesk Bot

A mid-sized company uses GPT-OSS 120B as first-level support. The model answers employee questions about internal processes, IT issues, and HR topics. Through RAG integration with the internal knowledge base, it delivers context-aware answers – without sensitive company data ever leaving the firewall.

Use Case 2: Code Review Assistant

A development team uses GPT-OSS 120B for code reviews. The model analyzes pull requests, finds potential bugs, and suggests improvements. At ~150 tokens/s per user, the interaction feels fluid – like having an experienced senior developer looking over your shoulder.

Use Case 3: Document Analysis in Legal

A law firm uses GPT-OSS 120B to search through thousands of pages of contract documents. The 128k token context length allows analysis of entire contracts in one pass. Client data stays on-premise – a must for attorney-client privilege.

Getting Started with GPT-OSS 120B

After delivery of the AI Cube Pro, getting started is simple:

Connect: Plug in power and network
Power on: The system boots with pre-configured stack
Load model: Download GPT-OSS 120B via Ollama or vLLM
Start using: Interact with the model through Open WebUI or AnythingLLM

Optionally, we configure before delivery:

SSO integration (LDAP, SAML, OAuth)
Prometheus/Grafana monitoring
RAG pipelines with your data sources
Backup strategies

Conclusion

GPT-OSS 120B is a milestone: OpenAI's first real open-source model in years, with o4-mini-level performance. On the AI Cube Pro, it runs stable, fast, and – most importantly – completely under your control.

No cloud dependency. No token costs. No privacy concerns.

Interested? Schedule a consultation or check out our AI Cube models.

Related Articles: