GPT-OSS 120B on AI Cube Pro: Run OpenAI's Open-Source Model Locally

With GPT-OSS 120B, OpenAI released their first open-weight model since GPT-2 in August 2025 – and it's impressive. The model achieves near o4-mini performance but runs entirely on your own hardware. In this article, we show you how to run GPT-OSS 120B on our AI Cube Pro.
What is GPT-OSS 120B?
GPT-OSS marks OpenAI's return to open source. After years of closed models (GPT-3, GPT-4, o-Series), OpenAI released two open-weight models under the permissive Apache 2.0 license:
- GPT-OSS 120B: The large model with 117 billion parameters
- GPT-OSS 20B: The smaller variant for edge devices
What makes it special: GPT-OSS uses a Mixture-of-Experts (MoE) architecture. Of the 117 billion parameters, only 5.1 billion are active per token. This makes the model extremely efficient – it runs on a single GPU with 80+ GB VRAM.
Technical Specifications
| Property | GPT-OSS 120B |
|---|---|
| Parameters (total) | 117 billion |
| Active parameters | 5.1 billion |
| Architecture | Mixture of Experts (MoE) |
| Context length | up to 128k tokens |
| Quantization | MXFP4 |
| VRAM requirement | ~80 GB |
| License | Apache 2.0 |
Why GPT-OSS 120B on the AI Cube Pro?
Our AI Cube Pro with the RTX PRO 6000 Blackwell (96 GB VRAM) is perfectly suited for GPT-OSS 120B. The 96 GB VRAM provides enough headroom for the model plus context buffer.
Advantages over Cloud APIs
- No token costs: One-time hardware investment instead of pay-per-use
- Full data control: Your data never leaves your company network
- GDPR compliance: No data transfer to US servers
- Unlimited usage: No rate limits or usage caps
- Customizable: Fine-tuning for your own use cases possible
Performance on the AI Cube Pro
Our benchmarks with GPT-OSS 120B on the AI Cube Pro show impressive results:
| Scenario | Throughput |
|---|---|
| Single user, small context | ~150 tokens/s |
| 20 users parallel, small contexts | ~1,050 tokens/s total |
| 20 users parallel, mixed contexts | ~300-500 tokens/s total |
These numbers make the AI Cube Pro ideal for team deployments: helpdesk bots, code review assistants, or internal knowledge bases with RAG.
The Pre-installed Software Stack
Every AI Cube Pro ships with a fully pre-configured stack:
Inference Engines
- Ollama: Simple model management, quick start
- vLLM: Maximum throughput, optimized for multi-user scenarios
Web Interfaces
- Open WebUI: ChatGPT-like interface with team features
- AnythingLLM: Workspace-based with integrated RAG
Base System
- Ubuntu Server LTS
- Optimized CUDA drivers for RTX Blackwell
- Docker for container deployments
- nvidia-smi exporter for monitoring
GPT-OSS 120B in Practice
Use Case 1: Internal Helpdesk Bot
A mid-sized company uses GPT-OSS 120B as first-level support. The model answers employee questions about internal processes, IT issues, and HR topics. Through RAG integration with the internal knowledge base, it delivers context-aware answers – without sensitive company data ever leaving the firewall.
Use Case 2: Code Review Assistant
A development team uses GPT-OSS 120B for code reviews. The model analyzes pull requests, finds potential bugs, and suggests improvements. At ~150 tokens/s per user, the interaction feels fluid – like having an experienced senior developer looking over your shoulder.
Use Case 3: Document Analysis in Legal
A law firm uses GPT-OSS 120B to search through thousands of pages of contract documents. The 128k token context length allows analysis of entire contracts in one pass. Client data stays on-premise – a must for attorney-client privilege.
Getting Started with GPT-OSS 120B
After delivery of the AI Cube Pro, getting started is simple:
- Connect: Plug in power and network
- Power on: The system boots with pre-configured stack
- Load model: Download GPT-OSS 120B via Ollama or vLLM
- Start using: Interact with the model through Open WebUI or AnythingLLM
Optionally, we configure before delivery:
- SSO integration (LDAP, SAML, OAuth)
- Prometheus/Grafana monitoring
- RAG pipelines with your data sources
- Backup strategies
Conclusion
GPT-OSS 120B is a milestone: OpenAI's first real open-source model in years, with o4-mini-level performance. On the AI Cube Pro, it runs stable, fast, and – most importantly – completely under your control.
No cloud dependency. No token costs. No privacy concerns.
Interested? Schedule a consultation or check out our AI Cube models.
Related Articles:
Let's Talk About Your Idea
Whether a specific IT challenge or just an idea – we look forward to the exchange. In a brief conversation, we'll evaluate together if and how your project fits with WZ-IT.



