At AIFinitee, we’ve spent months chasing tokens per second. Our two-node AMD Ryzen AI MAX cluster hits 17-20 tok/s with MiniMax-M2. Our Linux-vs-Windows benchmarks showed how your OS quietly taxes performance by 5-30%.
But here’s the question nobody asks: what if faster isn’t always better?
This post answers that question with real-world benchmark data from our dual-node AMD Ryzen AI MAX 395+ setup. We’ll compare models across speed tiers, show exactly where model size beats raw velocity, and help you decide when to run a 27B model at 25+ tokens/s versus a 397B MoE at 10-12 tokens/s for AI coding workflows like Claude Code or Hermes Agent.
The Models: What We’re Comparing
| Model | Total Params | Active Params | Architecture | LiveCodeBench | SWE-bench Verified |
|---|---|---|---|---|---|
| MiniMax-M2 | 230B | 10B (MoE) | Mixture of Experts | 83% | 69.4% |
| Qwen3.5-397B-A17B | 397B | 17B (MoE) | Mixture of Experts | ~85% (est.) | TBD |
Source: HuggingFace model cards for Qwen3.5-27B, MiniMax-M2
Benchmarks: AMD Ryzen AI MAX 395+ Dual-Node
Here’s what we measured on our two-node AMD Ryzen AI MAX 395+ cluster:
| Model | Quantization | Tokens/Second (Dual Node) | Tokens/Second (Single Node, Est.) |
|---|---|---|---|
| MiniMax-M2 | Q4_K_M | 17-20 tok/s | ~8-10 tok/s |
| Qwen3.5-397B-A17B | Q4_K_XL | 10-12 tok/s (confirmed) | Not feasible (VRAM) |
Notes:
- Single-node estimates based on dual-node scaling patterns
- Qwen3.5-397B-A17B requires distributed inference due to VRAM requirements (~100GB+ at Q4_K_XL)
- All benchmarks use llama.cpp with GGUF quantized models
The Core Question: What Do You Gain at 10 tok/s vs. 20+ tok/s?
| Speed | Experience | When It Feels Right |
|---|---|---|
| 2-4 tok/s | Painfully slow | Only for specialized tasks where accuracy dramatically outweighs wait time |
| 10-12 tok/s | Usable, deliberate | Complex reasoning where you’d rather ask once than iterate five times |
| 17-20 tok/s | Comfortable sweet spot | Explanations flow at reading pace; code feels responsive |
| 20-25+ tok/s | Fluid, disappears | Output races ahead of your eyes; ideal for rapid prototyping |
Where Model Size Actually Matters: Three Concrete Examples
Example 1: The “Photoshop-esque” Codebase
You’re building a web app with 40+ files: layer management, filter pipelines, undo/redo, and export workflows.
Qwen3.5-27B at 25 tok/s: Fast, but misses hidden connections. Refactor the filter system and it breaks the undo stack.
Qwen3.5-397B-A17B at 10 tok/s: Slower, but sees the full picture. Updates filters, undo, and exports in one pass.
Verdict: Multi-file refactors favor size over speed. Use 397B for architecture; 27B for simple components.
Example 2: The “Inherit a Mess” Debugging Session
Your team inherits a Node.js microservices monorepo with 12 services, shared utility packages, and circular dependency hell. A production bug surfaces: user sessions expire randomly in staging but never in dev.
Qwen3.5-27B at 25 tok/s: Suggests checking the Redis TTL config, then the JWT expiry, then the load balancer timeout—each guess is reasonable but shallow. After 5 iterations and 15 minutes, you’re still hunting.
Qwen3.5-397B-A17B at 10 tok/s: Asks to see how session tokens are generated vs. validated across services. Spots that the auth service uses UTC milliseconds but the gateway parses as seconds—a timezone drift bug. One shot, correct fix.
Verdict: Debugging unknown systems rewards deep inference over fast iteration. The 397B model’s extra reasoning capacity pays off in reduced total time-to-fix.
Example 3: The “Migration Marathon”
Your team is migrating from Next.js 13 to 15: App Router restructuring, Server Components adoption, API route changes, and third-party library updates across 80+ files.
Qwen3.5-27B at 25 tok/s: Handles file-by-file conversions well. But when use client directives need to move because of a prop type change three levels up the tree, it loses track. You become the integration layer.
Qwen3.5-397B-A17B at 10 tok/s: Tracks the cascade. It knows which components need use client, updates the shared types, fixes the Server Component boundaries, and adjusts the loading states for streaming compatibility. Slower token output, but coherent across the whole migration.
Verdict: Cross-file migrations require holding “before” and “after” states simultaneously—a strength of larger architectures.
When Speed Beats Size: The Counterexamples
Not every task needs a 397B model. Here’s when faster wins:
| Scenario | Why Faster Models Win |
|---|---|
| “Write a React form” | 27B knows forms. No need to wait for 397B to think. |
| “Explain this error” | Error messages are lookup tasks, not reasoning tasks. |
| “Add logging here” | Mechanical edits favor tokens/second over depth. |
| “Generate unit tests” | Volume matters more than insight for coverage. |
| Exploration / prototyping | You’re testing ideas; fast iteration beats perfect answers. |
The Pattern: When Does Size Win?
| Scenario Type | Why Larger Models Win | Recommended Model |
|---|---|---|
| Multi-file refactors | Track dependencies across files simultaneously | Qwen3.5-397B-A17B @ 10 tok/s |
| Debugging unknown systems | Infer hidden couplings from limited context | Qwen3.5-397B-A17B @ 10 tok/s |
| Architecture decisions | Weigh tradeoffs across multiple dimensions | MiniMax-M2 or 397B-A17B @ 10-17 tok/s |
| Cross-service integration | Understand protocol interactions and failure modes | Qwen3.5-397B-A17B @ 10 tok/s |
| Rapid prototyping | Fast iteration > perfect answers | Qwen3.5-27B @ 25+ tok/s |
| Mechanical edits | Tokens/second directly impacts workflow speed | Qwen3.5-27B @ 25+ tok/s |
The Practical Verdict
| Model | TPS (Dual Node) | Coding Quality | Overall Value | Best Use Case |
|---|---|---|---|---|
| Qwen3.5-27B Q4 | 20-25+ tok/s | Good (80% of max) | ★★★★★ | Daily driver for 90% of tasks |
| MiniMax-M2 Q4 | 17-20 tok/s | Very Good (83% LCB) | ★★★★ | Quality/speed sweet spot |
| Qwen3.5-397B-A17B Q4_K_XL | 10-12 tok/s | Excellent (~85%+ est.) | ★★★ | Complex refactors, debugging, architecture |
Our Recommendation: Run both. Keep Qwen3.5-27B at 20-25+ tok/s as your “pair programmer” for daily work—fast enough that the model disappears. When you hit a wall (complex refactor, mysterious bug, architecture decision), switch to Qwen3.5-397B-A17B at 10-12 tok/s as your “senior engineer consult.”
The speed gap is real. But so is the complexity gap. The trick isn’t choosing one—it’s knowing when to pay the token tax for deeper reasoning.
Hardware Guide: What You Need for Each Tier
| Target Speed | Model | Minimum Hardware |
|---|---|---|
| 20-25+ tok/s | Qwen3.5-27B Q4_K_M | Dual-node AMD Ryzen AI MAX 395+ (or single high-end GPU) |
| 17-20 tok/s | MiniMax-M2 Q4_K_M | Dual-node AMD Ryzen AI MAX 395+ cluster |
| 10-12 tok/s | Qwen3.5-397B-A17B Q4_K_XL | Dual-node AMD Ryzen AI MAX 395+ with distributed inference |
Methodology Notes
- Hardware: Two-node AMD Ryzen AI MAX 395+ cluster
- Inference Engine: llama.cpp with GGUF quantized models
- Measurements: Tokens/second averaged over multiple generations (code, explanations, mixed content)
Coming Next
We’re testing additional workloads across all three speed tiers and measuring not just accuracy but time-to-solution—because sometimes asking once beats iterating five times, even if each iteration is faster.
Until then: test your workflows at different speeds. You might find that slower isn’t always worse—it’s just different, and sometimes worth the wait.
