Simulate token generation speed for Large Language Models and understand how different speeds affect user experience.
0.00s
0 tok/s
Local AI inference without cloud dependency. Full data control.
NVIDIA RTX PRO 4000 Blackwell
NVIDIA RTX PRO 6000 Blackwell
| Model | AI Cube Basic | AI Cube Pro |
|---|---|---|
GPT-OSS 20B ~20 Billion Parameters | 50 tok/s | 200 tok/s |
GPT-OSS 120B ~120 Billion Parameters | — Not enough VRAM | 150 tok/s |
* All values measured with batch size 1. Performance may vary depending on configuration.
Quick responses for fluid conversation
Sufficient for email generation
Longer texts with good performance
Everything you need to know about token generation
Themen
Token generation speed (measured in tokens per second or tok/s) indicates how fast an AI model can generate text. One token corresponds to approximately 4 characters or 0.75 words. At 100 tok/s, about 75 words are generated per second.
Simulation helps developers and businesses understand how different speeds affect user experience. Slow generation (under 30 tok/s) feels sluggish, while fast generation (over 100 tok/s) provides a fluid experience.
A token is the smallest unit an LLM processes. It can be a word, word part, or punctuation mark. 'Hello' is one token, 'championship' might be split into 'champion' + 'ship'. Most LLMs use about 1 token per 4 characters.
Key factors include: 1) GPU performance and VRAM, 2) Model size (7B, 70B, 120B parameters), 3) Quantization (FP16, INT8, INT4), 4) Batch size, 5) Context length, 6) Inference backend (vLLM, Ollama, TensorRT).
Larger models are slower. A 7B model can achieve 200+ tok/s, a 70B model about 50-100 tok/s, and a 120B model typically 30-60 tok/s on consumer hardware. However, response quality increases with model size.
No, speed varies. At the beginning (prefill phase), generation is often slower, then stabilizes. Query complexity, context length, and system load also affect speed.
Under 20 tok/s: Noticeably slow, frustrating. 20-50 tok/s: Acceptable for most applications. 50-100 tok/s: Good experience, feels fluid. Over 100 tok/s: Excellent, text appears almost instantly.
Chatbots: 50-100 tok/s for fluid conversation. Document generation: 30-50 tok/s sufficient. Real-time translation: 100+ tok/s recommended. Code completion: 100+ tok/s for best developer experience.
1) Use better GPU with more VRAM, 2) Quantization (INT4/INT8) for smaller models, 3) Use optimized inference engines like vLLM, 4) Batch processing for multiple requests, 5) Optimize KV-cache, 6) Enable continuous batching.
AI Cube Basic (RTX 4060 Ti 16GB): ~50 tok/s with 20B models. AI Cube Pro (RTX 5090 48GB): ~200 tok/s with 20B models, ~150 tok/s with 120B models. These values may vary depending on configuration and optimization.
Cloud APIs have network latency, rate limits, and share resources with other users. Local hardware like the AI Cube offers dedicated resources, no network delay, and consistent performance without queues.
Cloud APIs charge per token (e.g., $0.002/1K tokens). With high usage, this adds up quickly. Local hardware has one-time acquisition costs but no ongoing token costs. From about 500,000 tokens/month, local hardware often pays off.
The AI Cube offers local AI inference up to 200 tok/s – without cloud dependency.
Discover AI CubeWhether a specific IT challenge or just an idea – we look forward to the exchange. In a brief conversation, we'll evaluate together if and how your project fits with WZ-IT.
Timo Wevelsiep & Robin Zins
CEOs of WZ-IT
