What changes with the AI Server Pro?

We're switching from the RTX 6000 Ada (48 GB VRAM) to the RTX 6000 Blackwell Max-Q (96 GB VRAM). Double the VRAM, new architecture, more performance.

How much faster is Blackwell for LLM inference?

40-170% faster depending on model and context. For large models like Llama 70B, token generation doubles.

Which models now run on a single GPU?

With 96 GB VRAM, even 100B+ models like GPT-OSS 120B or Mistral Large 123B run on a single GPU without offloading.

Do I need to adjust my setup?

No. Ollama, vLLM, and OpenWebUI work as before. The GPU change is transparent to your applications.

What does the AI Server Pro with Blackwell cost?

€1,549.90/month – same as before, but with significantly more performance.

Is the server available immediately?

Yes, available now on request. Limited availability.

Upgrade Announcement: Our Cloud GPU Servers Now Run on NVIDIA RTX 6000 Blackwell

More GPU power for your AI applications? Our AI Server Pro now runs on the NVIDIA RTX 6000 Blackwell Max-Q with 96 GB VRAM. Schedule a consultation →

We are now running the NVIDIA RTX 6000 Blackwell Max-Q instead of the RTX 6000 Ada in our AI Server Pro. This gives you significantly more headroom in the cloud for large models, longer contexts, and higher throughput requirements – without having to rethink your setup (Ollama/vLLM/OpenWebUI).

Key Changes at a Glance
Why This Step
Performance: Blackwell vs. Ada
What 96 GB VRAM Enables
What Stays the Same
Availability

Key Changes at a Glance

1) Significantly More VRAM: 96 GB Instead of 48 GB

Our Pro instances now come with 96 GB GDDR7 VRAM (ECC) in the Blackwell generation – the RTX 6000 Ada had 48 GB GDDR6 (ECC). This is the most important change for LLMs, because VRAM in practice determines model size, context length, batch size, and parallelism.

2) New Architecture (Blackwell) and New Core Generations

The RTX PRO 6000 Blackwell series brings Blackwell architecture, 5th-Gen Tensor Cores, and 4th-Gen RT Cores. The RTX 6000 Ada is based on Ada Lovelace with 4th-Gen Tensor Cores and 3rd-Gen RT Cores. For AI workloads, the jump in Tensor Cores is particularly relevant.

3) Platform Modernization: PCIe Gen 5 vs. PCIe Gen 4

The RTX PRO 6000 Blackwell variants are designed for PCIe Gen 5 x16, the RTX 6000 Ada for PCIe Gen 4 x16. This can provide additional reserves depending on workload (e.g., data streaming, multi-node pipelines).

4) Max-Q: Density & Efficiency for Scalable Setups

We use the Max-Q variant in the AI Server Pro. NVIDIA explicitly positions this for dense configurations and as a balance of performance and energy efficiency – ideal when you want to optimize performance per rack/server.

Why This Step: Larger Models, Longer Contexts, More Concurrent Users

In practice, we see three typical bottlenecks in productive LLM applications:

VRAM Limit – Model doesn't fit completely in the GPU
Time-to-First-Token – Prompt processing / prefill is too slow
Throughput – Token generation under load is too low

With Blackwell, we address exactly these points: more VRAM for large models/long context and noticeably more performance in both prompt processing and token generation.

Performance: Blackwell vs. Ada (LLM Inference Benchmarks)

For external, verifiable context, we use measurements from Hardware Corner. They test with llama.cpp / llama-bench on Ubuntu 24.04 and CUDA 12.8; token generation is in tokens/second, at 4-bit quantization (Q4_K_XL).

Token Generation at 16K Context

Model (16K Context, 4-bit)	RTX 6000 Ada (t/s)	RTX 6000 Blackwell (t/s)	Blackwell Advantage
Qwen3 8B	98.68	140.62	+42.5%
Qwen3 14B	58.51	96.86	+65.5%
Qwen3 30B	120.12	139.76	+16.4%
Qwen3 32B	25.08	45.72	+82.3%
gpt-oss 20B	137.10	237.92	+73.5%
Llama 70B	13.65	28.24	+106.9%

Source: Hardware Corner Token Generation Table (16K Context)

Important for practice: The larger the model and context, the more Blackwell pays off. At 65K context, you see nearly a doubling (or more) of token generation in several cases – e.g., Qwen3 14B +123%, Qwen3 32B +173%.

Prompt Processing (Time-to-First-Token)

Prefill/prompt processing is often the underestimated factor. At 16K context:

Model	RTX 6000 Ada (tok/s)	RTX 6000 Blackwell (tok/s)	Advantage
Qwen3 8B	4,096	7,588	+85%
Llama 3.3 70B	526	1,355	+158%

This is particularly relevant for RAG (long prompts), tool use, or multi-turn chats – because it directly affects perceived responsiveness.

What 96 GB VRAM Enables for You

The biggest jump isn't "just" faster – but more possible on a single GPU:

100B+ class without offloading / without multi-GPU as requirement: The 96 GB class is the practical single-GPU option when models in the 100B+ parameter range should run without system RAM offloading.
With Blackwell, large models like GPT-OSS 120B (~150 t/s) or Mistral Large 123B run directly on a single GPU.

Comparison: What Fits on Which GPU?

Model	RTX 6000 Ada (48 GB)	RTX 6000 Blackwell (96 GB)
Llama 3.1 8B (Q4)	✅	✅
Llama 3.1 70B (Q4)	⚠️ tight	✅
Qwen3 32B (Q4)	✅	✅
GPT-OSS 120B (MXFP4)	❌	✅
Mistral Large 123B (Q4)	❌	✅

What Stays the Same: Your Stack, Your Control, Your Location

The GPU change doesn't affect the basic principles of our AI servers:

GDPR-compliant hosting in Germany and ISO 27001 certified data center operations
Dedicated GPU resources without sharing – optimized for low latency and high throughput
On request: Ollama & vLLM setup, OpenWebUI installation, GPU optimization
With AI Server Pro: model training (fine-tuning) also possible

Note on Benchmark Transferability

The values quoted above come from a clearly defined test setup (llama.cpp/llama-bench, CUDA 12.8, 4-bit quantization). In production environments (e.g., vLLM, other quantizations, batch settings, KV cache strategies), absolute numbers may vary – but the direction (more VRAM + significant performance jump) is consistent.

Availability & Next Step

The AI Server Pro with NVIDIA RTX 6000 Blackwell Max-Q (96 GB GDDR7) is available now (on request, limited availability).

If you already have a specific goal (e.g., "70B chat with 32K context and X concurrent users" or "RAG with large document corpus"), we can quickly derive a suitable configuration and inference engine recommendation (Ollama vs. vLLM).

Schedule a consultation →

Upgrade Announcement: Our Cloud GPU Servers Now Run on NVIDIA RTX 6000 Blackwell

Table of Contents

Key Changes at a Glance

1) Significantly More VRAM: 96 GB Instead of 48 GB

2) New Architecture (Blackwell) and New Core Generations

3) Platform Modernization: PCIe Gen 5 vs. PCIe Gen 4

4) Max-Q: Density & Efficiency for Scalable Setups

Why This Step: Larger Models, Longer Contexts, More Concurrent Users

Performance: Blackwell vs. Ada (LLM Inference Benchmarks)

Token Generation at 16K Context

Prompt Processing (Time-to-First-Token)

What 96 GB VRAM Enables for You

Comparison: What Fits on Which GPU?

What Stays the Same: Your Stack, Your Control, Your Location

Note on Benchmark Transferability

Availability & Next Step

Further Guides

Frequently Asked Questions

What changes with the AI Server Pro?

How much faster is Blackwell for LLM inference?

Which models now run on a single GPU?

Do I need to adjust my setup?

What does the AI Server Pro with Blackwell cost?

Is the server available immediately?

Further Insights

Let's Talk About Your Idea

What is your inquiry about?

Cloud & Infrastructure (Hosting, Setup & Migration)

Custom Software Development

AI & LLM Solutions (incl. AI Servers)

IT Security & Identity Management

IoT & LoRaWAN (Sensoren, Plattformen & Netzwerke)

IT Consulting & Strategy

Something else