AMD Ryzen AI MAX Two-Node Cluster Guide | LLM Setup (17-20 tok/s)

AMD Ryzen AI MAX Two-Node Cluster Guide | LLM Setup (17-20 tok/s)

One of the beliefs we hold at AIfinitee is that LLMs will only get bigger. Bigger LLMs will mean more memory for usage in either cloud compute or local compute. We do have a skew towards local compute as we do believe token costs will increase commensurate to hardware prices.

Assuming the above plays right in the coming quarters, those that choose to deploy their own hardware will likely require scaling methods. In this context, knowledge to distribute across clusters will become increasingly in demand.

In particular, this post is meant to provide an instruction set to distribute across two clusters of AMD Ryzen 395+ AI Max computers in Linux. This process is meant to educate the reader on a repeatable process that can scale upwards from not just two computer clusters of the same AMD chipset but even towards four.

Want to run MiniMax M2.7 Q4? No problem! This guide will set you up specifically with ~17-20 tokens per second with flash attention and Q4 KV-cache quantization (i.e., faster throughput + memory savings).

You want to run GLM 5.1 or even Kimi K2.6? Add more to the cluster. We won’t get in your way, and with that, let’s start!

Step 1

First of all, use the BIOS to set your iGPU memory size as low as possible. In the case of the Minisforum MS-S1 Max, this is 1GB in BIOS. This implies you will likely have around 120GB memory available for inference including Linux (Ubuntu in this test case) to run LLM.

Set this across both PCs, leaving 240GB of theoretical usable memory. In this case, we will use the unsloth MiniMax-M2.7 Q4_K_XL model as the example.

Step 2

Once that is done, proceed to run the below command to confirm memory availability. You should see at least 120000M of GTT memory available:

sudo dmesg | grep "amdgpu.*memory"

Step 3

Head to https://github.com/lemonade-sdk/llamacpp-rocm/releases/latest/ to obtain the latest Lemonade SDK installation, which is the easiest setup process. For the AMD Ryzen 395+ MAX AI series, the iGPU is denoted by 1151, and in this case we are using Linux Ubuntu, so download that version.

Note: If there are technical difficulties, you may need to consult AI Modes or other LLM processes for further assistance.

Specific file to download where xxxx is the latest version:

llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip

Once downloaded, run these commands (basically unzips and provides the correct access points for the next step):

unzip llama-bxxxx-ubuntu-rocm-gfx1151-x64.zip
chmod +x llama-cli llama-server rpc-server

Step 4

Run the following on both the main and auxiliary nodes (this basically adds the server so it can be searched within the network – note here you can use either Ethernet or USB4, but USB4 will be faster):

./rpc-server -p 50053 -c --host 0.0.0.0

Step 5

Run the below command to start a server with distributed inference at 195000 context limit, flash attention on, and other standards which should allow it to launch. Upon launch, you can point your CLI or any other client to the local IP on the network to run agentic coding. In this use-case, we achieved ~17-20 tokens per second (lower as context expands).

./llama-server \
  -m /path/to/MiniMax-M2.7-Q4_K_XL-00001-of-00008.gguf \
  -c 195000 \
  -fa on \
  -ngl 999 \
  --no-mmap \
  --host 0.0.0.0 \
  --port 8081 \
  --rpc <RPC_WORKER_1_IP>:50053,<RPC_WORKER_2_IP>:50053

The above showcases an instruction set which you can copy and paste to your own agentic AI to procedurally set up multiple clusters. You can save it as a markdown or instruction txt file for future expansion (e.g., add more RPC_WORKER_3_IP when AMD Gorgon Point with that nifty 192GB unified memory comes out!).

Thanks for tuning in, and we will showcase exos in our next post.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *