My first impressions on ROCm and Strix Halo

4/19/2026 at 3:41:17 AM

I'm somewhat confused as to why this is on the front page. It doesn't go into any real detail, and the advice it gives is... not good. You should definitely not be quantizing your own gguf's using an old method like that hf script. There are lots of ways to run LLMs via podman (some even officially recommended by the project!). The chip has been out for almost a year now, and its most notable (and relevant-to-AI) feature is not mentioned in this article (it's the only x86_64 chip below workstation/server grade that has quad-channel RAM-- and inference is generally RAM constrained). I'm also quite puzzled about this bit about running pytorch via uv.

Anyway. I wouldn't recommend following the steps posted in there. Poke around google, or ask your friendly neighborhood LLM for some advice on how to set up your Strix Halo laptop/desktop for the tasks described. A good resource to start with would probably be the unsloth page for whichever model you are trying to run. (There are a few quantization groups that are competing for top-place with gguf's, and unsloth is regularly at the top-- with incredible documentation on inference, training, etc.)

Anyway, sorry to be harsh. I understand that this is just a blog for jotting down stuff you're doing, which is a great thing to do. I'm mostly just commenting on the fact that this is on the front page of hn for some reason.

by spoaceman7777

4/19/2026 at 4:50:39 AM

Thanks for writing this comment, I think seeing someone’s “first impressions” and then seeing someone else’s response to those thoughts is more interesting and feels more connected socially than just reading a “correct” guide or similar especially when it’s something I’m curious about but wouldn’t necessarily be motivated enough to actually try out myself.

by pierrekin

4/19/2026 at 1:54:36 PM

Agreed. Been running a Strix Halo box since mid-2025. Lemonade builds of llama.cpp with Unsloth or Bartowski quants have proven to be excellent.

by rpdillon

4/19/2026 at 5:50:30 AM

Quad-channel RAM is common on consumer desktops. Strix Halo has *8* channels, and also very fast RAM (soldered RAM can be faster than dimms because the traces are shorter.)

by fwipsy

4/19/2026 at 6:16:12 AM

Quad channel memory is not common on consumer desktops, it's a strictly HEDT and above feature. The vast majority of consumer desktops have 2 channels or fewer.

by fluoridation

4/19/2026 at 9:52:42 AM

One should no longer use the word "channel" because the width of a channel differs between various kinds of memories, even among those that can be used with the same CPU (e.g. between DDR and LPDDR or between DDR4 and DDR5).

For instance, now the majority of desktops with DDR5 have 4 channels, not 2 channels, but the channels are narrower, so the width of the memory interface is the same as before.

To avoid ambiguities, one should always write the width of the memory interface.

Most desktop computers and laptop computers have 128-bit memory interfaces.

The cheapest desktop computers and laptop computers, e.g. those with Intel Alder Lake N/Twin Lake CPUs, and also many smartphones and Arm-based SBCs, have 64-bit memory interfaces.

Cheaper smartphones and Arm-based SBCs have 32-bit memory interfaces.

Strix Halo and many older workstations and many cheaper servers have 256-bit memory interfaces.

High-end servers and workstations have 768-bit or 512-bit memory interfaces.

It is expected that future high-end servers will have 1024-bit memory interfaces per socket.

GPUs with private memory have usually memory interfaces between 192-bit and 1024-bit, but newer consumer GPUs have usually narrower memory interfaces than older consumer GPUs, to reduce cost. The narrower memory interface is compensated by faster memories, so the available bandwidth in consumer GPUs has been increased much slower than the increase in GDDR memory speed would have allowed.

by adrian_b

4/19/2026 at 7:33:21 PM

>now the majority of desktops with DDR5 have 4 channels, not 2 channels

Source? I just looked up two random X870E boards from Gigabyte and both are dual channel.

>To avoid ambiguities, one should always write the width of the memory interface.

They're incomparable quantities. More channels support more parallel operations, while a wider bus at a constant frequency supports higher throughput.

The bus width is not even that useful of a metric. It's more useful to talk about bits per second, which is the product of bus width and frequency.

by fluoridation

4/19/2026 at 6:59:33 AM

4 DIMMS =/= 4 channels

by phonon

4/19/2026 at 2:51:13 PM

I knew that, but I still thought most desktops with 4 dimm slots supported quad-channel memory. I guess I was wrong.

by fwipsy

4/19/2026 at 6:41:02 AM

If you are using quants below Q8 then get them from Unsloth or Bartowski.

They are higher quality than the quants you can make yourself due to their imatrix datasets and selective quantisation of different parts of the model.

For Qwen 3.5 Unsloth did 9 terabytes of quants to benchmark the effects of this:

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

by suprjami

4/19/2026 at 8:34:36 AM

That used to be a good suggestion, and it still most likely is if you're using a recent Nvidia dGPU, but absolutely not for iGPUs like the Halo/Point or Arc LPG. The problem is bf16.

In short, even lower quants leave some layers at original precision and llama.cpp in its endless wisdom does not do any conversion when loading weights and seeing what your card supports, so every time you run inference it gets so surprised and hits a brick wall when there's no bf16 acceleration. Then it has to convert to fp16 on the fly or something else which can literally drop tg by half or even more. I've seen fp16 models literally run faster than Q8 on Arc despite being twice the size with the same bandwidth and it's expectedly similar [0] on AMD.

Models used to be released as fp16 which was fine, then Gemma did native bf16 and Bartowski initially came up with a compatibility thing where they converted bf16 to fp32 then fp16 and used that for quants. Most models are released as bf16 these days though and Bartowski's given up on doing that (while Unsloth never did that to begin with). So if you do want max speed, you kinda have to do static quants yourself and follow the same multi-step process to remove all the stupid bf16 weights from the model. I don't get why this can't be done once at model load ffs, but this is what we've got.

[0] https://old.reddit.com/r/LocalLLaMA/comments/1r0b7p8/free_st...

by moffkalast

4/19/2026 at 3:34:04 PM

At least for qwen3.5, it looks like unsloth has updated their quantization algorithms to avoid bf16. See the march 5th update:

https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/discussi...

I assume they're applying the same technique going forward, but I have no idea how to determine if this is the case.

by stebalien

4/19/2026 at 10:08:58 AM

The CPU of Strix Halo has good BF16 acceleration, like any other Zen 4/Zen 5 CPU (the future Zen 6 will add FP16 acceleration).

I do not know about its GPU, which might have only FP16.

So it is likely that the right inference strategy would be to run any BF16 computations on the Strix Halo CPU, while running the quantized computations on its GPU.

by adrian_b

4/19/2026 at 11:13:17 AM

The GPU has INT4, INT8, BF16 and FP16. Notably no FP8 or FP4.The official GPTQ-Int4 release from Qwen is a great quant for this but custom kernels are still rare for this hardware.

by tssge

4/19/2026 at 5:20:37 PM

Must be a case of the hardware being there and the software not actually supporting it then.

by moffkalast

4/19/2026 at 5:16:12 AM

Check out the officially supported project Lemonade[0] by AMD. It has gfx1151 specific builds of vLLM, llama.cpp, comfy-ui, and even a PR to merge a Strix Halo port of Apple’s MLX[1] with a quick and easy install.

[0] https://www.amd.com/en/developer/resources/technical-article...

[1] https://github.com/lemonade-sdk/lemonade/issues/1642

by seemaze

4/19/2026 at 2:23:49 PM

I don’t think lemonade includes a comfyui wrapper, it does have stable diffusion support built in though.

by data-ottawa

4/20/2026 at 4:15:39 AM

I think you are correct. I’ve mostly been working with plain llama.cpp, but recently started looking into lemonade for the baked-in NPU support.

by seemaze

4/19/2026 at 6:36:09 AM

> It seems that things wouldn't work without a BIOS update: PyTorch was unable to find the GPU. This was easily done on the BIOS settings: it was able to connect to my Wifi network and download it automatically.

Call me traditional but I find it a bit scary for my BIOS to be connecting to WiFi and doing the downloading. Makes me wonder if the new BIOS blob would be secure i.e. did the BIOS connect over securely over https ? Did it check for the appropriate hash/signature etc. ? I would suppose all this is more difficult to do in the BIOS. I would expect better security if this was done in user space in the OS.

I'm much prefer if the OS did the actual downloading followed by the BIOS just doing the installation of the update.

by sidkshatriya

4/19/2026 at 7:12:47 AM

I have never seen a BIOS that didn't allow offline updates? However SSL is much less processing then a WPA2 WiFi stack, I would certainly expect this to be fully secure and boycot a manufacturer who failed. Conversely updating your BIOS without worrying if your OS is rooted is nice.

by ZiiS

4/19/2026 at 10:17:17 AM

Updating your BIOS without worrying if your OS is rooted can be easily and more securely done from an USB memory.

The BIOSes recent enough to be able to connect through the Internet normally have the option to use a USB memory from inside the BIOS setup.

Some motherboards can update the BIOS from a USB memory even without a CPU in the socket.

by adrian_b

4/19/2026 at 11:15:22 AM

You don't HAVE to update the bios over wifi, fwupd is perfectly able to do it as well.

by bityard

4/19/2026 at 6:49:18 AM

Isn't this pretty much standard in this day and age? HP for example also has this option in BIOS for their laptops (but you still can either download the BIOS blob manually in Linux or use the automatic updater in Windows if you want).

by imp0cat

4/19/2026 at 7:09:25 AM

> Isn't this pretty much standard in this day and age?

If something is "standard" nowadays does it mean it is the right way to go ?

One of my main issues is that this means your BIOS has to have a WiFi software stack in it, have a TLS stack in it etc. Basically millions of lines of extra code. Most of it in a blob never to be seen by more than a few engineers.

Though in another a way allowing BIOS to perform self updates is good because it doesn't matter if you've installed FreeBSD, OpenBSD, Linux, Windows, <any other os> you will be able to update your BIOS.

by sidkshatriya

4/19/2026 at 6:22:50 PM

> If something is "standard" nowadays does it mean it is the right way to go ?

Next thing you'll be telling me that you have a problem piping internet hosted install scripts directly into shell!

by ethbr1

4/19/2026 at 8:34:46 AM

I fully expect any BIOS to have millions of unnecessary lines of code already though. May as well have a bit more for user convenience.

by trvz

4/19/2026 at 3:02:40 AM

I would be interested to know what speeds you can get from gemma4 26b + 31b from this machine. also how rocm compares to triton.

by anko

4/19/2026 at 8:28:58 AM

## performance data for token generation using lmstudio

- gemma4-31b normal q8 -> 5.1 tok/s

- gemma4-31b normal q16 -> 3.7 t/s

- gemma4-31b distil q16 -> 3.6 t/s

- gemma4-31b distil q8 -> 5.7 tok/s (!)

- gemma4-26b-a4b ud q8kxl -> 38 t/s (!)

- gemma4-26b-a4b ud q16 -> 12 t/s

- gemma4-26b-a4b cl q8 -> 42 t/s (!)

- gemma4-26b-a4b cl q16 -> 12 t/s

- qwen3.5-35b-a3b-UD@q6_k -> 52 t/s (!)

- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@q8_0 -> 34 tok/s (!)

- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@bf16 -> 11 tok/s

- qwen3.5-27b-claude-4.6-opus-reasoning-distilled-v2 q8 -> 8 tok/s

- qwen3.5 122B A10B MXFP4 Mo qwen3.5-122b-a10b (q4) -> 11 tok/s

- qwen3.5-122b-a10b-uncensored-hauhaucs-aggressive (q6) -> 10 tok/s

by rdslw

4/19/2026 at 4:28:32 AM

Currently running Gemma 4 26B A4B 8-bit quantization, reasoning off, and the most recent job performed thus (which seems about average, though these are short running tasks, <2 seconds for each prompt):

prompt eval time = 315.66 ms / 221 tokens ( 1.43 ms per token, 700.13 tokens per second)

eval time = 1431.96 ms / 58 tokens ( 24.69 ms per token, 40.50 tokens per second)

total time = 1747.62 ms / 279 tokens

With reasoning enabled, it's about a quarter or fifth of that performance, quite a lot slower, but still reasonably comfortable to use interactively. The dense model is even slower. For some reason, Gemma 4 is pretty slow on the Strix Halo with reasoning enabled, compared to similar other models. It reasons really hard, I guess. I don't understand what makes models slower or faster given similar sizes, it surprised me.

Qwen 3.5 and 3.6 in the similar sized MoE versions at 8-bit quantization are notably faster on this hardware. If I were using Gemma 4 31B with reasoning interactively, I'd use a smaller 6-bit or even 5-bit quantization, to speed it up to something sort of comfortable to use. Because it is dog slow at 8-bit quantization, but shockingly smart and effective for such a tiny model.

Edit: Here's some benchmarks which feel right, based on my own experiences. https://kyuz0.github.io/amd-strix-halo-toolboxes/

by SwellJoe

4/19/2026 at 11:31:22 AM

If you just want to run models, most of TFA is taking the scenic route.

All you really need is podman, toolbx, and the Strix Halo toolbox images from https://github.com/kyuz0/amd-strix-halo-toolboxes. Then you just download your ggufs and hand them to llama-server.

Yes, there are other solutions that are a bit more hand-holdy, but if you already know how to use docker/podman and just want to get something working in an evening, this works too.

by bityard

4/19/2026 at 2:44:19 AM

owning GGUF conversion step is good in sone circumstances, but running in fp16 is below optimal for this hardware due to low-ish bandwidth.

It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.

by everlier

4/20/2026 at 12:41:26 AM

The unified memory architecture is what makes Strix Halo interesting for inference workloads. No PCIe bottleneck moving weights between CPU and GPU memory. For anyone getting started, the Unsloth UD quants are the way to go their imatrix calibration makes a real difference in output quality at Q6/Q8 compared to naive quantization. Curious about the ROCm vs Vulkan situation though. Has anyone benchmarked the prompt processing speed difference? For agentic workflows where you're constantly feeding new context, first-token latency matters more than raw tok/s.

by thr3at-surfac3

4/19/2026 at 2:32:13 PM

Linux kernel 7 enables the NPU on Linux. You can use fastflowLM with lemonade now.

It is quite slow, but if you want to compute embeddings in the background it’s fine.

I didn’t find it more energy efficient than just using the GPU for time insensitive tasks though.

by data-ottawa

4/19/2026 at 2:53:15 AM

Nice. Thanks for the writeup. My Strix Halo machine is arriving next week. This is handy and helpful.

by IamTC

4/19/2026 at 3:01:44 AM

I thought the point of something like Strix Halo was to avoid ROCm all together? AMDs strategy seems to have been to unify GPU/CPU memory then let people write their own libraries.

The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.

ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.

by roenxi

4/19/2026 at 4:20:44 AM

You misunderstand the point, and ROCm. The GPU and CPU share memory, that doesn't mean you don't need to interact with the GPU, anymore.

You can use Vulkan instead of ROCm on Radeon GPUs, including on the Strix Halo (and for a while, Vulkan was more likely to work on the Strix Halo, as ROCm support was slow to arrive and stabilize), but you need something that talks to the GPU.

Current ROCm, 7.2.1, works quite well on the Strix Halo. Vulkan does, too. ROCm tends to be a little faster, though. Not always, but mostly. People used to benchmark to figure out which was the best for a given model/workload, but now, I think most folks just assume ROCm is the better choice and use it exclusively. That's what I do, though I did find Gemma 4 wouldn't work on ROCm for a little bit after release (I think that was a llamma.cpp issue, though).

by SwellJoe

4/19/2026 at 1:28:53 PM

This hasn't been my experience, ROCm is usually not only a bit slower for me (~32 t/s vs ~43 t/s on the main model I use), it is way less reliable; any upgrade in kernel version or AMD driver and suddenly everything is broken

by anaisbetts

4/19/2026 at 5:28:52 PM

It can be tricky to get/keep ROCm working, but around 7.2 it became reliable and as fast as or faster than ROCm 6.4.

And, I think the first response time of ROCm is pretty consistently faster than Vulkan, even if Vulkan has a slightly higher token rate. Though I don't see that big of a different on token rates, either. Honestly, though, I haven't done enough real testing to know for sure. The benchmarks Donato Capitella posts (https://kyuz0.github.io/amd-strix-halo-toolboxes/) have been my guide on what to run in what way, and the performance of most things that can run on the Strix Halo are Fast Enough(tm) such that I don't agonize about performance. When Vulkan was all that worked with llama.cpp, that's what I used. Now that ROCm is reliable, I'm using ROCm. ROCm feels faster, maybe just because it processes prompts faster and starts typing the answer fast (at a rate faster than I can read it, so when it starts answering is the more important metric even if faster token rate would lead to it finishing faster).

In short: If ever I'm doing something that will take many hours to complete, and I need to optimize it, I'll do some tests first to be sure I'm using the optimal path. Otherwise, as long as ROCm is working, I'll probably just keep using it.

by SwellJoe

4/19/2026 at 4:34:38 AM

> The GPU and CPU share memory, that doesn't mean you don't need to interact with the GPU, anymore.

But we already have software that talks to the GPU; mesa3d and the ecosystem around that. It has existed for decades. My understanding was that the main reasons not to use it was that memory management was too complicated and CUDA solved that problem.

If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?

by roenxi

4/19/2026 at 4:48:53 AM

> CUDA solved that problem.

CUDA is a proprietary Nvidia product. CUDA solved the problem for Nvidia chips.

On AMD GPUs, you use ROCm. On Intel, you use OpenVINO. On Apple silicon you use MLX. All work fine with all the common AI tasks you'd want to do on self-hosted hardware. CUDA was there first and so it has a more mature ecosystem, but, so far, I've found 0 models or tasks I haven't been able to use with ROCm. llama.cpp works fine. ComfyUI works fine. Transformers library works fine. LM Studio works fine.

Unless you believe Nvidia having a monopoly on inference or training AI models is good for the world, you can't oppose all the other GPU makers having a way for their chips to be used for those purposes. CUDA is a proprietary vendor-specific solution.

Edit: But, also, Vulkan works fine on the Strix Halo. It is reliable and usually not that much slower than ROCm (and occasionally faster, somehow). Here's some benchmarks: https://kyuz0.github.io/amd-strix-halo-toolboxes/

by SwellJoe

4/19/2026 at 5:02:46 AM

Why? What is the point of focusing on something that seems to be a memory management solution when the memory management problem theoretically just went away?

That has been one of the big themes in GPU hardware since around 2010 era when AMD committed to ATI. Nvidia tried to solve the memory management problem in the software layer, AMD committed to doing it in hardware. Software was a better bet by around a trillion dollars so far, but if the hardware solutions have finally come to fruit then why the focus on ROCm?

by roenxi

4/19/2026 at 6:21:42 AM

I dunno. GPU programming and performance is above my pay grade. I assume the reason every GPU maker is investing in software is because they understand the problems to be solved and feel it's worth the investment to solve them. I like AMD because their Linux drivers are open source. I like Intel because all their stuff is Open Source. I like Nvidia notably less because none of their stuff is Open Source, not even the Linux drivers.

by SwellJoe

4/19/2026 at 5:00:01 AM

The problem with ROCm, unlike CUDA, is that it doesn’t run on much of AMDs own hardware, most notably their iGPU.

by sabedevops

4/19/2026 at 6:19:13 AM

Yeah, that kinda sucks, but, all their new generation onboard GPUs are supported by ROCm. e.g. Ryzen AI 395 and 400 series which will be found in mid-to-high end laptops and desktops and motherboards. They seem to have realized that the reason Nvidia is kicking their ass is that people can develop with CUDA on all sorts of hardware, including their personal laptop or desktop.

by SwellJoe

4/19/2026 at 5:31:23 AM

> If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?

And the memory barriers? How do you sync up the L1/L2 cache of a CPU core with the GPU's cache?

Exactly. With a ROCm memory barrier, ensuring parallelism between CPU + GPU, while also providing a mechanism for synchronization.

GPU and CPU can share memory, but they do not share caches. You need programming effort to make ANY of this work.

by dragontamer

4/19/2026 at 1:53:28 AM

Thanks for sharing. However, this missed being a good writeup due to lack of numbers and data.

I'll give a specific example in my feedback, You said:

``` so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window ```

But there are no numbers, results or output paste. Performance, or timings.

Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.

Do you mind sharing these? Thanks!

by timmy777

4/19/2026 at 2:33:12 AM

This is more of a “succeeding to get anywhere close to messing around” rather than “it works so now I can run some benchmarks” type of article.

by gessha

4/19/2026 at 2:33:27 AM

To give benefit of doubt, author does state multiple times (including in the title) that these were "first impressions", so perhaps they should have mentioned something like "...In the next post, we'll explore performance and numbers" to avoid a cliffhanger situation, or do a part 1 (assuming the intention was to follow-up with a part 2).

by l33tfr4gg3r

4/19/2026 at 2:45:17 AM

Perfect. No fluff, just the minimum needed to get things working.

by JSR_FDED

4/19/2026 at 5:34:30 AM

No benchmarks?

by aappleby

4/19/2026 at 5:51:04 AM

First impressions

by politelemon