Upgrade Announcement: Our Cloud GPU Servers Now Run on NVIDIA RTX 6000 Blackwell

More GPU power for your AI applications? Our AI Server Pro now runs on the NVIDIA RTX 6000 Blackwell Max-Q with 96 GB VRAM. Schedule a consultation →
We are now running the NVIDIA RTX 6000 Blackwell Max-Q instead of the RTX 6000 Ada in our AI Server Pro. This gives you significantly more headroom in the cloud for large models, longer contexts, and higher throughput requirements – without having to rethink your setup (Ollama/vLLM/OpenWebUI).
Table of Contents
- Key Changes at a Glance
- Why This Step
- Performance: Blackwell vs. Ada
- What 96 GB VRAM Enables
- What Stays the Same
- Availability
Key Changes at a Glance
1) Significantly More VRAM: 96 GB Instead of 48 GB
Our Pro instances now come with 96 GB GDDR7 VRAM (ECC) in the Blackwell generation – the RTX 6000 Ada had 48 GB GDDR6 (ECC). This is the most important change for LLMs, because VRAM in practice determines model size, context length, batch size, and parallelism.
2) New Architecture (Blackwell) and New Core Generations
The RTX PRO 6000 Blackwell series brings Blackwell architecture, 5th-Gen Tensor Cores, and 4th-Gen RT Cores. The RTX 6000 Ada is based on Ada Lovelace with 4th-Gen Tensor Cores and 3rd-Gen RT Cores. For AI workloads, the jump in Tensor Cores is particularly relevant.
3) Platform Modernization: PCIe Gen 5 vs. PCIe Gen 4
The RTX PRO 6000 Blackwell variants are designed for PCIe Gen 5 x16, the RTX 6000 Ada for PCIe Gen 4 x16. This can provide additional reserves depending on workload (e.g., data streaming, multi-node pipelines).
4) Max-Q: Density & Efficiency for Scalable Setups
We use the Max-Q variant in the AI Server Pro. NVIDIA explicitly positions this for dense configurations and as a balance of performance and energy efficiency – ideal when you want to optimize performance per rack/server.
Why This Step: Larger Models, Longer Contexts, More Concurrent Users
In practice, we see three typical bottlenecks in productive LLM applications:
- VRAM Limit – Model doesn't fit completely in the GPU
- Time-to-First-Token – Prompt processing / prefill is too slow
- Throughput – Token generation under load is too low
With Blackwell, we address exactly these points: more VRAM for large models/long context and noticeably more performance in both prompt processing and token generation.
Performance: Blackwell vs. Ada (LLM Inference Benchmarks)
For external, verifiable context, we use measurements from Hardware Corner. They test with llama.cpp / llama-bench on Ubuntu 24.04 and CUDA 12.8; token generation is in tokens/second, at 4-bit quantization (Q4_K_XL).
Token Generation at 16K Context
| Model (16K Context, 4-bit) | RTX 6000 Ada (t/s) | RTX 6000 Blackwell (t/s) | Blackwell Advantage |
|---|---|---|---|
| Qwen3 8B | 98.68 | 140.62 | +42.5% |
| Qwen3 14B | 58.51 | 96.86 | +65.5% |
| Qwen3 30B | 120.12 | 139.76 | +16.4% |
| Qwen3 32B | 25.08 | 45.72 | +82.3% |
| gpt-oss 20B | 137.10 | 237.92 | +73.5% |
| Llama 70B | 13.65 | 28.24 | +106.9% |
Source: Hardware Corner Token Generation Table (16K Context)
Important for practice: The larger the model and context, the more Blackwell pays off. At 65K context, you see nearly a doubling (or more) of token generation in several cases – e.g., Qwen3 14B +123%, Qwen3 32B +173%.
Prompt Processing (Time-to-First-Token)
Prefill/prompt processing is often the underestimated factor. At 16K context:
| Model | RTX 6000 Ada (tok/s) | RTX 6000 Blackwell (tok/s) | Advantage |
|---|---|---|---|
| Qwen3 8B | 4,096 | 7,588 | +85% |
| Llama 3.3 70B | 526 | 1,355 | +158% |
This is particularly relevant for RAG (long prompts), tool use, or multi-turn chats – because it directly affects perceived responsiveness.
What 96 GB VRAM Enables for You
The biggest jump isn't "just" faster – but more possible on a single GPU:
- 100B+ class without offloading / without multi-GPU as requirement: The 96 GB class is the practical single-GPU option when models in the 100B+ parameter range should run without system RAM offloading.
- With Blackwell, large models like GPT-OSS 120B (~150 t/s) or Mistral Large 123B run directly on a single GPU.
Comparison: What Fits on Which GPU?
| Model | RTX 6000 Ada (48 GB) | RTX 6000 Blackwell (96 GB) |
|---|---|---|
| Llama 3.1 8B (Q4) | ✅ | ✅ |
| Llama 3.1 70B (Q4) | ⚠️ tight | ✅ |
| Qwen3 32B (Q4) | ✅ | ✅ |
| GPT-OSS 120B (MXFP4) | ❌ | ✅ |
| Mistral Large 123B (Q4) | ❌ | ✅ |
What Stays the Same: Your Stack, Your Control, Your Location
The GPU change doesn't affect the basic principles of our AI servers:
- GDPR-compliant hosting in Germany and ISO 27001 certified data center operations
- Dedicated GPU resources without sharing – optimized for low latency and high throughput
- On request: Ollama & vLLM setup, OpenWebUI installation, GPU optimization
- With AI Server Pro: model training (fine-tuning) also possible
Note on Benchmark Transferability
The values quoted above come from a clearly defined test setup (llama.cpp/llama-bench, CUDA 12.8, 4-bit quantization). In production environments (e.g., vLLM, other quantizations, batch settings, KV cache strategies), absolute numbers may vary – but the direction (more VRAM + significant performance jump) is consistent.
Availability & Next Step
The AI Server Pro with NVIDIA RTX 6000 Blackwell Max-Q (96 GB GDDR7) is available now (on request, limited availability).
If you already have a specific goal (e.g., "70B chat with 32K context and X concurrent users" or "RAG with large document corpus"), we can quickly derive a suitable configuration and inference engine recommendation (Ollama vs. vLLM).
Further Guides
Frequently Asked Questions
Answers to important questions about this topic
We're switching from the RTX 6000 Ada (48 GB VRAM) to the RTX 6000 Blackwell Max-Q (96 GB VRAM). Double the VRAM, new architecture, more performance.
40-170% faster depending on model and context. For large models like Llama 70B, token generation doubles.
With 96 GB VRAM, even 100B+ models like GPT-OSS 120B or Mistral Large 123B run on a single GPU without offloading.
No. Ollama, vLLM, and OpenWebUI work as before. The GPU change is transparent to your applications.
€1,549.90/month – same as before, but with significantly more performance.
Yes, available now on request. Limited availability.
Let's Talk About Your Idea
Whether a specific IT challenge or just an idea – we look forward to the exchange. In a brief conversation, we'll evaluate together if and how your project fits with WZ-IT.

Timo Wevelsiep & Robin Zins
CEOs of WZ-IT



