WZ-IT Logo

Upgrade Announcement: Our Cloud GPU Servers Now Run on NVIDIA RTX 6000 Blackwell

Timo Wevelsiep
Timo Wevelsiep
#AI #GPU #NVIDIA #Blackwell #LLM #Inference #CloudServer #GDPR #AIServer

More GPU power for your AI applications? Our AI Server Pro now runs on the NVIDIA RTX 6000 Blackwell Max-Q with 96 GB VRAM. Schedule a consultation →

We are now running the NVIDIA RTX 6000 Blackwell Max-Q instead of the RTX 6000 Ada in our AI Server Pro. This gives you significantly more headroom in the cloud for large models, longer contexts, and higher throughput requirements – without having to rethink your setup (Ollama/vLLM/OpenWebUI).


Table of Contents


Key Changes at a Glance

1) Significantly More VRAM: 96 GB Instead of 48 GB

Our Pro instances now come with 96 GB GDDR7 VRAM (ECC) in the Blackwell generation – the RTX 6000 Ada had 48 GB GDDR6 (ECC). This is the most important change for LLMs, because VRAM in practice determines model size, context length, batch size, and parallelism.

2) New Architecture (Blackwell) and New Core Generations

The RTX PRO 6000 Blackwell series brings Blackwell architecture, 5th-Gen Tensor Cores, and 4th-Gen RT Cores. The RTX 6000 Ada is based on Ada Lovelace with 4th-Gen Tensor Cores and 3rd-Gen RT Cores. For AI workloads, the jump in Tensor Cores is particularly relevant.

3) Platform Modernization: PCIe Gen 5 vs. PCIe Gen 4

The RTX PRO 6000 Blackwell variants are designed for PCIe Gen 5 x16, the RTX 6000 Ada for PCIe Gen 4 x16. This can provide additional reserves depending on workload (e.g., data streaming, multi-node pipelines).

4) Max-Q: Density & Efficiency for Scalable Setups

We use the Max-Q variant in the AI Server Pro. NVIDIA explicitly positions this for dense configurations and as a balance of performance and energy efficiency – ideal when you want to optimize performance per rack/server.


Why This Step: Larger Models, Longer Contexts, More Concurrent Users

In practice, we see three typical bottlenecks in productive LLM applications:

  1. VRAM Limit – Model doesn't fit completely in the GPU
  2. Time-to-First-Token – Prompt processing / prefill is too slow
  3. Throughput – Token generation under load is too low

With Blackwell, we address exactly these points: more VRAM for large models/long context and noticeably more performance in both prompt processing and token generation.


Performance: Blackwell vs. Ada (LLM Inference Benchmarks)

For external, verifiable context, we use measurements from Hardware Corner. They test with llama.cpp / llama-bench on Ubuntu 24.04 and CUDA 12.8; token generation is in tokens/second, at 4-bit quantization (Q4_K_XL).

Token Generation at 16K Context

Model (16K Context, 4-bit) RTX 6000 Ada (t/s) RTX 6000 Blackwell (t/s) Blackwell Advantage
Qwen3 8B 98.68 140.62 +42.5%
Qwen3 14B 58.51 96.86 +65.5%
Qwen3 30B 120.12 139.76 +16.4%
Qwen3 32B 25.08 45.72 +82.3%
gpt-oss 20B 137.10 237.92 +73.5%
Llama 70B 13.65 28.24 +106.9%

Source: Hardware Corner Token Generation Table (16K Context)

Important for practice: The larger the model and context, the more Blackwell pays off. At 65K context, you see nearly a doubling (or more) of token generation in several cases – e.g., Qwen3 14B +123%, Qwen3 32B +173%.

Prompt Processing (Time-to-First-Token)

Prefill/prompt processing is often the underestimated factor. At 16K context:

Model RTX 6000 Ada (tok/s) RTX 6000 Blackwell (tok/s) Advantage
Qwen3 8B 4,096 7,588 +85%
Llama 3.3 70B 526 1,355 +158%

This is particularly relevant for RAG (long prompts), tool use, or multi-turn chats – because it directly affects perceived responsiveness.


What 96 GB VRAM Enables for You

The biggest jump isn't "just" faster – but more possible on a single GPU:

  • 100B+ class without offloading / without multi-GPU as requirement: The 96 GB class is the practical single-GPU option when models in the 100B+ parameter range should run without system RAM offloading.
  • With Blackwell, large models like GPT-OSS 120B (~150 t/s) or Mistral Large 123B run directly on a single GPU.

Comparison: What Fits on Which GPU?

Model RTX 6000 Ada (48 GB) RTX 6000 Blackwell (96 GB)
Llama 3.1 8B (Q4)
Llama 3.1 70B (Q4) ⚠️ tight
Qwen3 32B (Q4)
GPT-OSS 120B (MXFP4)
Mistral Large 123B (Q4)

What Stays the Same: Your Stack, Your Control, Your Location

The GPU change doesn't affect the basic principles of our AI servers:

  • GDPR-compliant hosting in Germany and ISO 27001 certified data center operations
  • Dedicated GPU resources without sharing – optimized for low latency and high throughput
  • On request: Ollama & vLLM setup, OpenWebUI installation, GPU optimization
  • With AI Server Pro: model training (fine-tuning) also possible

Note on Benchmark Transferability

The values quoted above come from a clearly defined test setup (llama.cpp/llama-bench, CUDA 12.8, 4-bit quantization). In production environments (e.g., vLLM, other quantizations, batch settings, KV cache strategies), absolute numbers may vary – but the direction (more VRAM + significant performance jump) is consistent.


Availability & Next Step

The AI Server Pro with NVIDIA RTX 6000 Blackwell Max-Q (96 GB GDDR7) is available now (on request, limited availability).

If you already have a specific goal (e.g., "70B chat with 32K context and X concurrent users" or "RAG with large document corpus"), we can quickly derive a suitable configuration and inference engine recommendation (Ollama vs. vLLM).

Schedule a consultation →


Further Guides

Frequently Asked Questions

Answers to important questions about this topic

We're switching from the RTX 6000 Ada (48 GB VRAM) to the RTX 6000 Blackwell Max-Q (96 GB VRAM). Double the VRAM, new architecture, more performance.

40-170% faster depending on model and context. For large models like Llama 70B, token generation doubles.

With 96 GB VRAM, even 100B+ models like GPT-OSS 120B or Mistral Large 123B run on a single GPU without offloading.

No. Ollama, vLLM, and OpenWebUI work as before. The GPU change is transparent to your applications.

€1,549.90/month – same as before, but with significantly more performance.

Yes, available now on request. Limited availability.

Let's Talk About Your Idea

Whether a specific IT challenge or just an idea – we look forward to the exchange. In a brief conversation, we'll evaluate together if and how your project fits with WZ-IT.

Trusted by leading companies

  • Keymate
  • SolidProof
  • Rekorder
  • Führerscheinmacher
  • ARGE
  • NextGym
  • Paritel
  • EVADXB
  • Boese VA
  • Maho Management
  • Aphy
  • Negosh
  • Millenium
  • Yonju
  • Mr. Clipart
Timo Wevelsiep & Robin Zins - CEOs of WZ-IT

Timo Wevelsiep & Robin Zins

CEOs of WZ-IT

1/3 – Topic Selection33%

What is your inquiry about?

Select one or more areas where we can support you.