GPT-OSS 120B on AI Cube Pro: Run OpenAI's Open-Source Model Locally

Editorial note: The information in this article was compiled to the best of our knowledge at the time of publication. Technical details, prices, versions, licensing terms, and external content may change. Please verify the information provided independently, particularly before making business-critical or security-related decisions. This article does not replace individual professional, legal, or tax advice.

With GPT-OSS 120B, OpenAI released their first open-weight model since GPT-2 in August 2025 – and it's impressive. The model achieves near o4-mini performance but runs entirely on your own hardware. In this article, we show you how to run GPT-OSS 120B on our AI Cube Pro.
What is GPT-OSS 120B?
GPT-OSS marks OpenAI's return to open source. After years of closed models (GPT-3, GPT-4, o-Series), OpenAI released two open-weight models under the permissive Apache 2.0 license:
- GPT-OSS 120B: The large model with 117 billion parameters
- GPT-OSS 20B: The smaller variant for edge devices
What makes it special: GPT-OSS uses a Mixture-of-Experts (MoE) architecture. Of the 117 billion parameters, only 5.1 billion are active per token. This makes the model extremely efficient – it runs on a single GPU with 80+ GB VRAM.
Technical Specifications
| Property | GPT-OSS 120B |
|---|---|
| Parameters (total) | 117 billion |
| Active parameters | 5.1 billion |
| Architecture | Mixture of Experts (MoE) |
| Context length | up to 128k tokens |
| Quantization | MXFP4 |
| VRAM requirement | ~80 GB |
| License | Apache 2.0 |
Why GPT-OSS 120B on the AI Cube Pro?
Our AI Cube Pro with the RTX PRO 6000 Blackwell (96 GB VRAM) is perfectly suited for GPT-OSS 120B. The 96 GB VRAM provides enough headroom for the model plus context buffer.
Advantages over Cloud APIs
- No token costs: One-time hardware investment instead of pay-per-use
- Full data control: Your data never leaves your company network
- GDPR compliance: No data transfer to US servers
- Unlimited usage: No rate limits or usage caps
- Customizable: Fine-tuning for your own use cases possible
Performance on the AI Cube Pro
Our benchmarks with GPT-OSS 120B on the AI Cube Pro show impressive results:
| Scenario | Throughput |
|---|---|
| Single user, small context | ~150 tokens/s |
| 20 users parallel, small contexts | ~1,050 tokens/s total |
| 20 users parallel, mixed contexts | ~300-500 tokens/s total |
These numbers make the AI Cube Pro ideal for team deployments: helpdesk bots, code review assistants, or internal knowledge bases with RAG.
The Pre-installed Software Stack
Every AI Cube Pro ships with a fully pre-configured stack:
Inference Engines
- Ollama: Simple model management, quick start
- vLLM: Maximum throughput, optimized for multi-user scenarios
Web Interfaces
- Open WebUI: ChatGPT-like interface with team features
- AnythingLLM: Workspace-based with integrated RAG
Base System
- Ubuntu Server LTS
- Optimized CUDA drivers for RTX Blackwell
- Docker for container deployments
- nvidia-smi exporter for monitoring
GPT-OSS 120B in Practice
Use Case 1: Internal Helpdesk Bot
A mid-sized company uses GPT-OSS 120B as first-level support. The model answers employee questions about internal processes, IT issues, and HR topics. Through RAG integration with the internal knowledge base, it delivers context-aware answers – without sensitive company data ever leaving the firewall.
Use Case 2: Code Review Assistant
A development team uses GPT-OSS 120B for code reviews. The model analyzes pull requests, finds potential bugs, and suggests improvements. At ~150 tokens/s per user, the interaction feels fluid – like having an experienced senior developer looking over your shoulder.
Use Case 3: Document Analysis in Legal
A law firm uses GPT-OSS 120B to search through thousands of pages of contract documents. The 128k token context length allows analysis of entire contracts in one pass. Client data stays on-premise – a must for attorney-client privilege.
Getting Started with GPT-OSS 120B
After delivery of the AI Cube Pro, getting started is simple:
- Connect: Plug in power and network
- Power on: The system boots with pre-configured stack
- Load model: Download GPT-OSS 120B via Ollama or vLLM
- Start using: Interact with the model through Open WebUI or AnythingLLM
Optionally, we configure before delivery:
- SSO integration (LDAP, SAML, OAuth)
- Prometheus/Grafana monitoring
- RAG pipelines with your data sources
- Backup strategies
Conclusion
GPT-OSS 120B is a milestone: OpenAI's first real open-source model in years, with o4-mini-level performance. On the AI Cube Pro, it runs stable, fast, and – most importantly – completely under your control.
No cloud dependency. No token costs. No privacy concerns.
Interested? Schedule a consultation or check out our AI Cube models.
Related Articles:

Written by
Timo Wevelsiep
Co-Founder & CEO
Co-Founder of WZ-IT. Specialized in cloud infrastructure, open-source platforms and managed services for SMEs and enterprise clients worldwide.
LinkedInLet's Talk About Your Idea
Whether a specific IT challenge or just an idea – we look forward to the exchange. In a brief conversation, we'll evaluate together if and how your project fits with WZ-IT.


Timo Wevelsiep & Robin Zins
CEOs of WZ-IT




