AMD 395+ AI MAX Two-Node Cluster Guide
Procedural instruction set to setup Two-Node Cluster Guide to run MiniMax M2.7 Q4
5/4/20263 min read
One of the beliefs we hold at AIfinitee is that LLMs will only get bigger. Bigger LLMs will mean more memory for usage in either cloud compute or local compute. We do have a skew towards local compute as we do believe token costs will increase commensurate to hardware prices.
Assuming the above plays right in the coming quarters, those that choose to deploy their own hardware will likely require scaling methods and in that context, and this essentially means knowledge to distribute across clusters will become increasingly in demand.
In particular, this post is meant to provide an instruction set to distribute across two clusters of AMD Ryzen 395+ AI Max computers in Linux. This process is meant to educate the reader a repeatable process that can scale upwards from not just two computer clusters of the same AMD chipset but even towards four.
Want to run MiniMax M2.7 Q4? No problem! This guide will set you up specifically with ~17-20 tokens per second with flash attention and Q4 KV-cache quantization (i.e faster throughput + memory savings).
You want to run GLM 5.1 or even Kimi K2.6? Add more to the cluster. We won't get in you way and with that, let's start!
Step 1.
First of all, use the BIOS to set your iGPU memory size as low as possible. In the case of the Minisforum MS-S1 Max, this is 1GB in BIOS. This implies you will likely have around 120GB memory available for inference including Linux (Ubuntu in this test case) to run LLM.
Set this across both PC leaving 240GB of theoretical usable memory. In this case, we will use the unsloth MiniMax-M2.7 Q4_K_XL model as the example.
Step 2.
Once that is done, proceed to run the below commands to confirm memory availability and you should see at least 120000M of GTT memory available.
$ sudo dmesg | grep "amdgpu.*memory"
Step 3.
Head to https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/ to obtain the latest Lemonade SDK installation which is the easiest setup process. For the AMD Ryzen 395+ MAX AI series, the iGPU is denoted by 1151, and in this case we are using Linux Ubuntu so go and download that. (note that if there are technical difficulties you may need to consult further in AI Modes or other LLM processes).
Specific file to download where xxxx is the latest version: llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip
Once downloaded, run these commands (basically unzips and provides the correct access points for the next step:
unzip llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip
chmod +x llama-cli llama-server rpc-server
Step 4:
Run ./rpc-server -p 50053 -c --host 0.0.0.0 on both the main and auxiliary nodes (this basically adds the server so it can be searched within network - and note here you can use either Ethernet or USB4 (but USB4 will be faster).
Step 5:
Run the below which is a command to run a server with distributed inference at 195000 context limit, flash attention on and other standards which should allow it to launch. Upon launch, you can point your CLI or any other to the local IP on the network to run agentic coding. In this use-case, we achieved ~17-20 tokens per second (lower as context expands).
./llama-server \
-m /path/to/MiniMax-M2.7-Q4_K_XL-00001-of-00008.gguf \
-c 195000\
-fa on \
-ngl 999 \
--no-mmap \
--host 0.0.0.0 \
--port 8081 \
--rpc <RPC_WORKER_1_IP>:50053,<RPC_WORKER_2_IP>:50053
The above showcases an instruction set which you can copy and paste to your own agentic AI to procedurally setup multiple clusters and you can save it as a markdown or instruction txt file for future expansion (e.g. add more RPC_WORKER_3_IP when AMD Gorgon Point with that nifty 192gb unified memory comes out!)
Thanks for tuning in and we will showcase exos in our next post.