There has been at least one report of this in Fedora, as negative karma on the latest build: https://bodhi.fedoraproject.org/updates/FEDORA-2025-698dc1bbfa There have been various issues filed upstream about some newer AMD platforms (Strix Point and Strix Halo APUs): https://gitlab.freedesktop.org/drm/amd/-/issues/4751 https://gitlab.freedesktop.org/drm/amd/-/issues/4738 https://gitlab.freedesktop.org/drm/amd/-/issues/4737 The commits that address those issues are: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=3d5c8135206cef364e7d353711b3e7358a90d152 https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=c092c7487eb7c3d58697f490ff605bc38f4cc947 https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=baf6c2f67a247eba7f298ed74bc471de43ad632d Please make a new build with at least those commits in it to address those upstream issues.
Groan, so good to see the AMD CI is non-existent :-/
Unfortunately; linux-firmware is totally a blind spot. There is ROCm CI for all things ROCm, there is IGT for all things GFX. These interactions are where the problems are :/ I think the right answer is going to be adding runners to the ROCm CI that linux-firmware CI can contact.
I think I'm hitting this bug on fedora 43. With amd-ucode-firmware-20251125 and amd-gpu-firmware-20251125 gnome was crashing multiple times. I tried kde and xfce with same result. When I downgraded to amd-ucode-firmware-20251021 and amd-gpu-firmware-20251021, the firmware that was released with fedora 43 and rebuilt initramfs system was stable. This is the second time amd firmware bit me.
I don't think amd-ucode-firmware has anything to with the crashing, I suspect if you upgrade amd-ucode-firmware and leave the GPU FW downgraded you'll be fine. But also this bug isssues with ROCm so I'm not sure if your issue is directly related.
2 of the linked issues are rocm specific and are why I started digging into the negative karma but https://gitlab.freedesktop.org/drm/amd/-/issues/4737 is more general, AFAIK. It details crashes and system freezes during normal graphical usage.
> I don't think amd-ucode-firmware has anything to with the crashing, I suspect if you upgrade amd-ucode-firmware and leave the GPU FW downgraded you'll be fine. I agree. > But also this bug isssues with ROCm so I'm not sure if your issue is directly related. > 2 of the linked issues are rocm specific and are why I started digging into the negative karma but https://gitlab.freedesktop.org/drm/amd/-/issues/4737 is more general, AFAIK. It details crashes and system freezes during normal graphical usage. There's definitely a real issue. ROCm probably just tickled it more easily.
Hello everyone, Merry Christmas! Is there any way we could push out the reverted version from Mario? I am currently stuck with old packages thanks to an atomic system (silverblue). I am new to the processes here - if there is anything I can do to help out / expedite, just leave me some info.
> Is there any way we could push out the reverted version from Mario? I am > currently stuck with old packages thanks to an atomic system (silverblue). You should be able to do a dnf downgrade to drop back to working firmware. The FW will be updated when things are coordinated, it is holiday season so things take a little longer at times.
Happy New Year! Just checking in to see if there's an update on the expected timeline. I am part of a community of Strix Halo users and wrote many tutorial based on Fedora, and right now ever user who's intalling an updated version of Fedora has a broken ROCm implementation. I have been advising users to downgrade the Linux firmware, but as you can imagine this creates a lot of confusion.
Aiming for Friday
@Tim Flink (and AMD team) -- A new release came thru today but failed ``` sid@vega:~$ ./run-rocm-smoketest.sh ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | Memory access fault by GPU node-1 (Agent handle: 0x2ccbbe20) on address 0x7f45a612a000. Reason: Page not present or supervisor privilege. sid@vega:~$ rpm-ostree status State: idle Deployments: ● fedora:fedora/43/x86_64/silverblue Version: 43.20260110.0 (2026-01-10T00:28:27Z) Commit: 3a0477ec79f1edb269ba6ed2844d86777b2d5a0d70624be1efa4c5530c9161c6 GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531 fedora:fedora/43/x86_64/silverblue Version: 43.1.6 (2025-10-23T03:11:18Z) Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8 GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531 fedora:fedora/43/x86_64/silverblue Version: 43.1.6 (2025-10-23T03:11:18Z) Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8 GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531 Pinned: yes sid@vega:~$ ``` Rollback makes it working again ``` ``` sid@vega:~$ rpm-ostree status State: idle Deployments: ● fedora:fedora/43/x86_64/silverblue Version: 43.1.6 (2025-10-23T03:11:18Z) Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8 GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531 fedora:fedora/43/x86_64/silverblue Version: 43.20260110.0 (2026-01-10T00:28:27Z) Commit: 3a0477ec79f1edb269ba6ed2844d86777b2d5a0d70624be1efa4c5530c9161c6 GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531 fedora:fedora/43/x86_64/silverblue Version: 43.1.6 (2025-10-23T03:11:18Z) Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8 GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531 Pinned: yes sid@vega:~$ ./run-rocm-smoketest.sh ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | | gemma3 4B Q4_K - Medium | 2.31 GiB | 3.88 B | ROCm | 99 | 1 | 0 | pp512 | 2541.22 ± 29.97 | | gemma3 4B Q4_K - Medium | 2.31 GiB | 3.88 B | ROCm | 99 | 1 | 0 | tg128 | 68.67 ± 0.10 | build: 9e41884dc (7687) sid@vega:~$ ``` ``` What the quick smoke test runs ... ``` sid@vega:~$ cat ./run-rocm-smoketest.sh #!/usr/bin/env bash set -uo pipefail toolbox run -c llama-rocm-7.1.1 -- /usr/local/bin/llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf ```
FEDORA-2026-1d240112ff (linux-firmware-20260110-1.fc42) has been submitted as an update to Fedora 42. https://bodhi.fedoraproject.org/updates/FEDORA-2026-1d240112ff
FEDORA-2026-2cebf295af (linux-firmware-20260110-1.fc43) has been submitted as an update to Fedora 43. https://bodhi.fedoraproject.org/updates/FEDORA-2026-2cebf295af
> @Tim Flink (and AMD team) -- A new release came thru today but failed New release of what? A Silverblue snapshot? I need to know what details are in this "release". * MES F/W version * rocr-runtime version (7.1.1-XXX) What's the XXX? It needs to be -2 or newer to pick up the GFX1151 patch IIUC.
(In reply to Mario Limonciello from comment #14) > > @Tim Flink (and AMD team) -- A new release came thru today but failed > > New release of what? A Silverblue snapshot? I need to know what details I suspect they means the new upstream linux-firmware.
It's in updates-testing so will be a stable update later this week
FEDORA-2026-2cebf295af has been pushed to the Fedora 43 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2026-2cebf295af` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2026-2cebf295af See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2026-1d240112ff has been pushed to the Fedora 42 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2026-1d240112ff` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2026-1d240112ff See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
Tim: can you confirm this looks good to you?
@mario - I meant the new silverblue upgrade/snapshot pushed out, it was 43.20260110.0 when originally messaged. Since then 43.20260112.0 has been pushed out, which is also broken. I'll list the underlying components, but as a silverblue user, treating it as "one release". From what I'm reading this could be an issue in kernel (6.18.4-200.fc43.x86_64 vs 6.17.1-300.fc43.x86_64). Details below. Side note, we actually moved this lab machine from workstation to silverblue for "greater stability" (which is sort of true? we can quickly rollback/upgrade trivially). I'll do my best to gather more helpful info, that AMD Strix Halo box is the only one right now, so intrusive to stop everything -> upgrade -> test -> rollback and resume actual workloads. ------------------------------------------------------------- Working: Fedora 43 Silverblue snapshot: 43.1.6 sid@vega:~$ rpm-ostree status State: idle Deployments: ● fedora:fedora/43/x86_64/silverblue Version: 43.1.6 (2025-10-23T03:11:18Z) Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8 GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531 sid@vega:~$ rpm -q linux-firmware Kernel: Linux 6.17.1-300.fc43.x86_64 Firmware: linux-firmware-20251021-1.fc43.noarch ------------------------------------------------------------- Broken: Fedora 43 Silverblue snapshot: 43.20260112.0 sid@vega:~$ rpm-ostree status State: idle Deployments: ● fedora:fedora/43/x86_64/silverblue Version: 43.20260112.0 (2026-01-12T00:27:07Z) Commit: 15edf9df6181db7fe6d70cd704d4dfda85edaf64f95317698386495b6f00e99a GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531 Kernel: Linux 6.18.4-200.fc43.x86_64 Firmware: linux-firmware-20251125-1.fc43.noarch Failure Rate: 100% (10/10; immediately ) sid@vega:~$ toolbox enter llama-rocm-7.1.1 ⬢ [sid@toolbx ~]$ llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | Segmentation fault (core dumped) llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf ⬢ [sid@toolbx ~]$ llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | Segmentation fault (core dumped) llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf ⬢ [sid@toolbx ~]$
OK, that confirms you don't have the updated linux-firmware with the fix in the broken image. Once it migrates out of testing and you get a new snapshot you /should/ be good to go.
Thanks Mario. Any indicators of that 'stable build'? Like a version # (of immutable silverblue snapshot or linux-firmware) or an expected date? Could I also recommend a gatekeeper test around llama-bench on your CI? It's quick, yet stressful, relatively isolated for end to end tests as llama-cpp is a single folder app. kyuz0's `amd-strix-halo-toolboxes` make it very trivial for runtime switching. And even smaller model would work (e.g. gemma3 4b Q4_K_M is 2.4GB).
(In reply to Peter Robinson from comment #19) > Tim: can you confirm this looks good to you? The changelog has all the relevant changes listed so everything should be good. I don't have access to any of the relevant HW myself but I'm trying to find someone who can at at least run some basic tests to confirm that the quickly-testable issues have disappeared.
(In reply to Sid from comment #22) > Thanks Mario. Any indicators of that 'stable build'? Like a version # (of > immutable silverblue snapshot or linux-firmware) or an expected date? > > Could I also recommend a gatekeeper test around llama-bench on your CI? It's > quick, yet stressful, relatively isolated for end to end tests as llama-cpp > is a single folder app. kyuz0's `amd-strix-halo-toolboxes` make it very > trivial for runtime switching. And even smaller model would work (e.g. > gemma3 4b Q4_K_M is 2.4GB). I'm not terribly familiar with silverblue but I believe that it's built on the bits that have gone stable in the relevant Fedora release repos. I doubt that you'll see a change in silverblue until this linux-firmware update has gone stable but once it does get pushed stable, I imagine that the next silverblue build/update after that will have the firmware changes in it. As far as an indication, the "rpm-ostree status" command shows the linux-firmware build used. Once that says "linux-firmware-20260110-1.fc43.noarch", the firmware fixes should be present.
That is correct, silverblue is behind the rest off Fedora. It goes updates-testing -> updates -> silverblue. That last bit is at some point in the future because they have their own testing cycles. We are currently at updates-testing, I will push it to updates later in the week once I am happy the firmware update as a whole has had wide enough testing. Go an ask the silverblue people what happens from there because it's out of scope for this bug.
(In reply to Sid from comment #20) > @mario - I meant the new silverblue upgrade/snapshot pushed out, it was > 43.20260110.0 when originally messaged. Since then 43.20260112.0 has been For future reference, if a bug you are looking for a fix for as a silverblue user is anything but CLOSED -> ERRATA you won't have the fix. The fact this is currently ON_QA means it's not even in Fedora stable updates yet. Silverblue always trails.
So I've clarified that atomic desktops will get the update when it goes stable with the rest of Fedora, CoreOS releases every two weeks so will get the update on their next release after the update goes stable.
Thanks Peter, 20260110-1 is working on my workflows, on 44 rawhide with rocm 7.11.0. Sid: I've also ran llama-bench with your parameters without issue.
I have been running amd-gpu-firmware-20260110-1.fc43.noarch for about 24 hours and has been stable with no crashes and system freezes during normal graphical usage.
Thanks for the help guys. I'm coming from Debian (personal) and RHEL (enterprise) backgrounds, new to Silverblue processes. If there's a better place to discuss this, please redirect me. I thought kernels and linux-firmware's were reasonably decoupled but I'm following up since I read at https://github.com/kyuz0/amd-strix-halo-toolboxes: > Stable configuration > Kernel: 6.18.3-200 > Firmware: 20251111 > "NEWER KERNELS SUCH AS 6.18.4 BREAKS ROCm except for nightly builds" (summarized emphasis mine) Tim mentioned linux-firmware 20260110-1.fc43.noarch contains the needed fixes. So, is it a linux-firmware only fix? If not, what kernel version pairs correctly with this firmware? Arthur is using a rawhide kernel (F44), and Louis' kernel version is unclear. Silverblue updates are now kernel:6.18.4-200.fc43 (today's prod push; also broken) so when Silverblue ships with the updated firmware (>= 20260110-1.fc43.noarch), can you'll verify it pairs with a compatible kernel to avoid re-spinning? Appreciate it!
FEDORA-2026-1d240112ff (linux-firmware-20260110-1.fc42) has been pushed to the Fedora 42 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2026-2cebf295af (linux-firmware-20260110-1.fc43) has been pushed to the Fedora 43 stable repository. If problem still persists, please make note of it in this bug report.
> Tim mentioned linux-firmware 20260110-1.fc43.noarch contains the needed > fixes. So, is it a linux-firmware only fix? If not, what kernel version > pairs correctly with this firmware? Arthur is using a rawhide kernel (F44), > and Louis' kernel version is unclear. It's independent to linux-firmware. rawhide has had the new firmware since about last Sat. > Silverblue updates are now kernel:6.18.4-200.fc43 (today's prod push; also > broken) so when Silverblue ships with the updated firmware (>= > 20260110-1.fc43.noarch), can you'll verify it pairs with a compatible kernel > to avoid re-spinning? Appreciate it! It's in today's stable push (Thurs) so it will be in updates heading out to mirrors later, likely check updates later today or tomorrow and it should be good. All those details are in the above comment ;-)
New firmware arrived, but sadly still freezes / crashes happening: $ rpm -qa --last | grep -i -E “amd.*(firmware|microcode)|kernel-[0-9]” kernel-6.18.5-200.fc43.x86_64 Do 15 Jan 2026 23:13:31 CET amd-ucode-firmware-20260110-1.fc43.noarch Do 15 Jan 2026 23:13:31 CET amd-gpu-firmware-20260110-1.fc43.noarch Do 15 Jan 2026 23:13:31 CET kernel-6.17.1-300.fc43.x86_64 Mo 12 Jan 2026 04:30:47 CET kernel-6.18.4-200.fc43.x86_64 Mo 12 Jan 2026 03:39:37 CET $ uname -r 6.18.5-200.fc43.x86_64 $ sudo dmidecode -t system -t baseboard -t processor | grep -E "Manufacturer|Product Name|Version|Family" Family: Zen Manufacturer: Advanced Micro Devices, Inc. Signature: Family 26, Model 96, Stepping 0 Version: AMD Ryzen AI 7 PRO 350 w/ Radeon 860M Manufacturer: LENOVO Product Name: 21RMCTO1WW Version: ThinkPad X13 Gen 6 Family: ThinkPad X13 Gen 6 Manufacturer: LENOVO Product Name: 21RMCTO1WW Version: Not Defined $ journalctl -k --since 2026-01-14 --grep=amdgpu --case-sensitive=no [...] Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32780) Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Process code pid 385501 thread code:cs0 pid 385506 Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: in page starting at address 0x000000003f800000 from client 10 Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201430 Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa) Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: MORE_FAULTS: 0x0 Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: WALKER_ERROR: 0x0 Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: PERMISSION_FAULTS: 0x3 Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: MAPPING_ERROR: 0x0 Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: RW: 0x0 Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: [drm] AMDGPU device coredump file has been created Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=7522206, emitted seq=7522208 Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Process code pid 385501 thread code:cs0 pid 385506 Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Starting gfx_0.0.0 ring reset Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=RESET Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: failed to reset legacy queue Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: reset via MES failed and try pipe reset -110 Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Ring gfx_0.0.0 reset failed Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: GPU reset begin! Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: failed to unmap legacy queue Jan 15 23:07:42 tpx13 kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: MODE2 reset Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: GPU reset succeeded, trying to resume Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: SMU is resuming... Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: SMU is resumed successfully! Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09003100 Jan 15 23:07:43 tpx13 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! Jan 15 23:07:47 tpx13 kernel: amdgpu 0000:c4:00.0: [drm] *ERROR* Step 2 of creating MST payload for 00000000fe84e6e9 failed: -5 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: [drm] *ERROR* Step 2 of creating MST payload for 00000000a79f0fff failed: -5 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 1 on hub 8 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring vpe uses VM inv eng 4 on hub 8 Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: GPU reset(4) succeeded! Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: [drm] device wedged, but recovered through reset Jan 15 23:07:51 tpx13 gnome-shell[3985]: amdgpu: The CS has cancelled because the context is lost. This context is innocent. Module libdrm_amdgpu.so.1 from rpm libdrm-2.4.128-3.fc43.x86_64 Module libdrm_amdgpu.so.1 from rpm libdrm-2.4.128-3.fc43.x86_64 #1 0x00007f41e9820dc0 _ZL30amdgpu_ctx_set_sw_reset_statusP17radeon_winsys_ctx17pipe_reset_statusPKcz (libgallium-25.2.7.so + 0xa20dc0) #2 0x00007f41e9825021 _Z19amdgpu_cs_submit_ibIL10queue_type0EEvPvS1_i (libgallium-25.2.7.so + 0xa25021) #2 0x00007f41e981ee3c amdgpu_bo_destroy (libgallium-25.2.7.so + 0xa1ee3c) #4 0x00007f41e981f7a0 amdgpu_bo_create (libgallium-25.2.7.so + 0xa1f7a0) Jan 15 23:07:56 tpx13 abrt-notification[658572]: Process 2524 (gnome-shell) crashed in amdgpu_ctx_query_reset_status(radeon_winsys_ctx*, bool, bool*, bool*)()
You have what appears to be a different issue. You're on a Kracken Point system. Are you 100% sure it's the firmware that caused the regression and not mesa or kernel? Please open a new issue and CC me and let's work through details on it.
no crashes for 4 days on AMD Ryzen 9 7950X3D 16-Core Processor | fedora 6.18.5-300.vanilla.fc43.x86_64 | amd-gpu-firmware-0:20260110-1.fc43.noarch though still get `amdgpu 0000:15:00.0: [drm] *ERROR* LTTPR count is nonzero but invalid lane count reported. Assuming no LTTPR present.` which is I believe an irrelevant issue
Sorry guys - still fails as before. IMO, it's clear Fedora Silverblue's quality pipeline needs a retrospect. ☠️ rocm 6.4.4 ☠️ rocm 7.1.1 ✅ rocm 7 nightlies ---------------------------------------------------- sid@vega:~$ rpm -q linux-firmware kernel linux-firmware-20260110-1.fc43.noarch kernel-6.18.5-200.fc43.x86_64 ---------------------------------------------------- sid@vega:~$ ./run-rocm-smoketest.sh Model: /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf Toolbox: llama-rocm-6.4.4 toolbox run -c llama-rocm-6.4.4 -- llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 | model | size | params | backend | ngl | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: | /opt/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: ROCm error /usr/local/lib64/libggml-base.so.0(+0x35a5) [0x7f54d0ec85a5] /usr/local/lib64/libggml-base.so.0(ggml_print_backtrace+0x1eb) [0x7f54d0ec896b] /usr/local/lib64/libggml-base.so.0(ggml_abort+0x11f) [0x7f54d0ec8aef] /usr/local/lib64/libggml-hip.so.0(+0x1c07e2) [0x7f54d11457e2] /usr/local/lib64/libggml-hip.so.0(+0x1c5774) [0x7f54d114a774] /usr/local/lib64/libllama.so.0(_ZN18llama_model_loader13load_all_dataEP12ggml_contextRSt13unordered_mapIjP19ggml_backend_bufferSt4hashIjESt8equal_toIjESaISt4pairIKjS4_EEEPSt6vectorISt10unique_ptrI11llama_mlockSt14default_deleteISH_EESaISK_EEPFbfPvESO_+0x10ad) [0x7f54d3fce8ad] /usr/local/lib64/libllama.so.0(_ZN11llama_model12load_tensorsER18llama_model_loader+0x3cafe) [0x7f54d4025a5e] /usr/local/lib64/libllama.so.0(+0x264e8) [0x7f54d3f424e8] /usr/local/lib64/libllama.so.0(llama_model_load_from_file+0xac) [0x7f54d3f4334c] llama-bench() [0x40787d] /lib64/libc.so.6(+0x35b5) [0x7f54d085e5b5] /lib64/libc.so.6(__libc_start_main+0x88) [0x7f54d085e668] llama-bench() [0x409e85] sid@vega:~$
> Sorry guys - still fails as before. IMO, it's clear Fedora Silverblue's quality pipeline needs a retrospect. I believe you're seeing a different regression, and it's actually an issue that there is a mismatch with kernel and userspace. Can you rebuild your ROCm containers with a patch? If so - add this patch to your ROCm containers and it should fix the issue. https://github.com/ROCm/rocm-systems/commit/09ba45b3f43ec333a84a0ca178fcd1e3ea9400a9
(In reply to Sid from comment #37) > Sorry guys - still fails as before. IMO, it's clear Fedora Silverblue's > quality pipeline needs a retrospect. > > ☠️ rocm 6.4.4 > ☠️ rocm 7.1.1 > ✅ rocm 7 nightlies > ---------------------------------------------------- > sid@vega:~$ rpm -q linux-firmware kernel > linux-firmware-20260110-1.fc43.noarch > kernel-6.18.5-200.fc43.x86_64 > ---------------------------------------------------- > sid@vega:~$ ./run-rocm-smoketest.sh > Model: > /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/ > snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf > Toolbox: llama-rocm-6.4.4 > toolbox run -c llama-rocm-6.4.4 -- llama-bench -fa 1 -ngl 99 -mmp 0 -m > /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/ > snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf > ggml_cuda_init: found 1 ROCm devices: > Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 > | model | size | params | backend | > ngl | fa | mmap | test | t/s | > | ------------------------------ | ---------: | ---------: | ---------- | > --: | -: | ---: | --------------: | -------------------: | > /opt/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: ROCm error > /usr/local/lib64/libggml-base.so.0(+0x35a5) [0x7f54d0ec85a5] > /usr/local/lib64/libggml-base.so.0(ggml_print_backtrace+0x1eb) > [0x7f54d0ec896b] > /usr/local/lib64/libggml-base.so.0(ggml_abort+0x11f) [0x7f54d0ec8aef] > /usr/local/lib64/libggml-hip.so.0(+0x1c07e2) [0x7f54d11457e2] > /usr/local/lib64/libggml-hip.so.0(+0x1c5774) [0x7f54d114a774] > /usr/local/lib64/libllama.so. > 0(_ZN18llama_model_loader13load_all_dataEP12ggml_contextRSt13unordered_mapIjP > 19ggml_backend_bufferSt4hashIjESt8equal_toIjESaISt4pairIKjS4_EEEPSt6vectorISt > 10unique_ptrI11llama_mlockSt14default_deleteISH_EESaISK_EEPFbfPvESO_+0x10ad) > [0x7f54d3fce8ad] > /usr/local/lib64/libllama.so. > 0(_ZN11llama_model12load_tensorsER18llama_model_loader+0x3cafe) > [0x7f54d4025a5e] > /usr/local/lib64/libllama.so.0(+0x264e8) [0x7f54d3f424e8] > /usr/local/lib64/libllama.so.0(llama_model_load_from_file+0xac) > [0x7f54d3f4334c] > llama-bench() [0x40787d] > /lib64/libc.so.6(+0x35b5) [0x7f54d085e5b5] > /lib64/libc.so.6(__libc_start_main+0x88) [0x7f54d085e668] > llama-bench() [0x409e85] > sid@vega:~$ As far as I understand that's expected. Older versions of ROCm are not compatible any longer with the newer kernels. Right now the ROCm nightly builds are what works on Linux, until the stable 7.2 is released. Mario, am I correct? Might be worth also pushing the fix to the 6.4 branch, maybe giving people a 6.4.5 option with the patch.
That's correct. For more context, this patch fixes stability issues that many people have raised with workloads that need a context switch. It's unfortunate the hard dependency moving together. In terms of using pre built binaries for the legacy releases our hands are tied. Either rebuild them with the patch or switch to the rock releases. The rock based releases will work with either new or older kernels.
Could you please clarify > Older versions of ROCm are not compatible any longer with the newer kernels. The crashes are happening with rocm 6.4.4 and 7.1.1. Looking at https://rocm.docs.amd.com/en/latest/release/versions.html, 6.4.4 is missing but 6.4.3 was an August 7, 2025 and 7.1.1 was a Nov 26, 2025 release. Less than _two months old_. Are you saying these are considered "older unsupported versions" on the current kernels being pushed out? That's doesn't add up for me, especially considering nvidia's cuda support cycles. To be fair, I also did noticed that Fedora is NOT officially supported at https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html. kyuz0 has really make very helpful Strix Halo toolboxes https://github.com/kyuz0/amd-strix-halo-toolboxes and those are Fedora centrix, drawing many devs that way. It's not bad, but would be best if AMD has their own fully supported Docker/distrobox containers with different llama.cpp + rocm/vulkan runtimes. Maybe contract/hire kyuz0 for that :) ? Personally, I care less about Fedora/Ubuntu etc, they're tools, I can adjust. But vendors supported reliability is important, so we can experiment on the layers above.
(In reply to Mario Limonciello from comment #35) > You have what appears to be a different issue. You're on a Kracken Point system. > Are you 100% sure it's the firmware that caused the regression and not mesa > or kernel? Please open a new issue and CC me and let's work through details > on it. Ah OK. Sorry, I didn't mean to hijack the issue. I think I was distracted by the fact that discussion threads about the GPU freezing referred to this very bug... :-| So I'll open a new one if I run into problems again. But (good news): I'm currently having no more issues. Apparently, the hardware doesn't just need a reset; at least complete power-off (cold boot) is also a good idea after applying the patches. The latest firmware with the latest kernel seems to have fixed some bugs after a complete cold boot: $ rpm -qa --last | grep -i -E "amd.*(firmware|microcode)|kernel-[0-9]" amd-ucode-firmware-20260110-1.fc43.noarch Fri Jan 16, 2026 2:10:32 PM CET amd-gpu-firmware-20260110-1.fc43.noarch Fri Jan 16, 2026 2:10:32 PM CET kernel-6.18.5-200.fc43.x86_64 Thu Jan 15, 2026 11:13:31 PM CET kernel-6.17.1-300.fc43.x86_64 Mon Jan 12 2026 04:30:47 CET kernel-6.18.4-200.fc43.x86_64 Mon, Jan 12, 2026 03:39:37 CET $ uname -r 6.18.5-200.fc43.x86_64 Anyway, the laptop has been running continuously for over 48 hours without crashing (Just like it always was before, except for December). Hope it stays like that. :-)
> But (good news): I'm currently having no more issues. Apparently, the hardware doesn't just need a reset; at least complete power-off (cold boot) is also a good idea after applying the patches. The latest firmware with the latest kernel seems to have fixed some bugs after a complete cold boot: Happy to hear. > The crashes are happening with rocm 6.4.4 and 7.1.1. Looking at https://rocm.docs.amd.com/en/latest/release/versions.html, 6.4.4 is missing but 6.4.3 was an August 7, 2025 and 7.1.1 was a Nov 26, 2025 release. Less than _two months old_. Are you saying these are considered "older unsupported versions" on the current kernels being pushed out? That's doesn't add up for me, especially considering nvidia's cuda support cycles. There has been a fundamental mistake in the VGPR size for a very long time for Strix Halo and it has been leading to instability for a while as well. It is specific to workloads with context switches. For example, Comfy UI could reproduce it easily. We tried a lot of things to fix this issue, but it eventually boiled down to this issue is that VGPR size was hardcoded both in userspace and kernel space. If they are wrong or out of sync things don't work properly. So we've fixed it in the kernel to use the correct size: https://github.com/gregkh/linux/commit/7445db6a7d5a0242d8214582b480600b266cba9e We've also added support to export that size to userspace so that it doesn't need to be hardcoded in userspace anymore: https://github.com/gregkh/linux/commit/7445db6a7d5a0242d8214582b480600b266cba9e TheRock builds are using this new interface if available and thus they will "work" both with older and newer kernels, but the fundamental stability issue I mention above still exists. If VGPR size is wrong context switch doesn't work. > To be fair, I also did noticed that Fedora is NOT officially supported at https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html. The stability matrix reflects what is tested and AMD officially supports. But AMD does also work on native packaging in distros, these just don't get official support. FWIW the Fedora ROCm packages ARE picking up the VGPR size patch. > But vendors supported reliability is important, so we can experiment on the layers above. I wish we could have fixed this 6 months ago. The biggest challenge is that using a debugger like rocgdb ALSO causes a context switch. So, this required some even lower-level tools to identify the mismatch. Using a container built from the older series branch is totally fine, just pick up that patch and add it while building. It's literally a one-line change to take the correct VGPR size. > kyuz0 has really make very helpful Strix Halo toolboxes https://github.com/kyuz0/amd-strix-halo-toolboxes These are phenomenal and I am really glad they make ROCm more accessible. But I do want to say - Fedora bug tracker is not a discussion forum. If you want to keep talking about this, we should move the conversation somewhere else. I would love any creative ideas that would allow us to let this work in more combinations if you have them. Feel free to tag me somewhere else if you want to continue the conversation.