Bug 2420062 - requesting updated build to fix issues with AMD APU platforms
Summary: requesting updated build to fix issues with AMD APU platforms
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: linux-firmware
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: David Woodhouse
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-12-08 17:33 UTC by Tim Flink
Modified: 2026-01-19 16:04 UTC (History)
17 users (show)

Fixed In Version: linux-firmware-20260110-1.fc42 linux-firmware-20260110-1.fc43
Clone Of:
Environment:
Last Closed: 2026-01-15 00:52:34 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Comment 1 Peter Robinson 2025-12-08 17:47:57 UTC
Groan, so good to see the AMD CI is non-existent :-/

Comment 2 Mario Limonciello 2025-12-08 18:13:33 UTC
Unfortunately; linux-firmware is totally a blind spot.  There is ROCm CI for all things ROCm, there is IGT for all things GFX.

These interactions are where the problems are :/

I think the right answer is going to be adding runners to the ROCm CI that linux-firmware CI can contact.

Comment 3 louisgtwo 2025-12-11 18:27:26 UTC
I think I'm hitting this bug on fedora 43. With amd-ucode-firmware-20251125 and amd-gpu-firmware-20251125 gnome was crashing multiple times. I tried kde and xfce with same result. When I downgraded to amd-ucode-firmware-20251021 and amd-gpu-firmware-20251021, the firmware that was released with fedora 43 and rebuilt initramfs system was stable. This is the second time amd firmware bit me.

Comment 4 Peter Robinson 2025-12-11 18:45:36 UTC
I don't think amd-ucode-firmware has anything to with the crashing, I suspect if you upgrade amd-ucode-firmware and leave the GPU FW downgraded you'll be fine.

But also this bug isssues with ROCm so I'm not sure if your issue is directly related.

Comment 5 Tim Flink 2025-12-11 18:49:58 UTC
2 of the linked issues are rocm specific and are why I started digging into the negative karma but https://gitlab.freedesktop.org/drm/amd/-/issues/4737 is more general, AFAIK. It details crashes and system freezes during normal graphical usage.

Comment 6 Mario Limonciello 2025-12-11 19:06:39 UTC
> I don't think amd-ucode-firmware has anything to with the crashing, I suspect if you upgrade amd-ucode-firmware and leave the GPU FW downgraded you'll be fine.

I agree.

> But also this bug isssues with ROCm so I'm not sure if your issue is directly related.
> 2 of the linked issues are rocm specific and are why I started digging into the negative karma but https://gitlab.freedesktop.org/drm/amd/-/issues/4737 is more general, AFAIK. It details crashes and system freezes during normal graphical usage.

There's definitely a real issue.  ROCm probably just tickled it more easily.

Comment 7 luke 2025-12-26 00:09:40 UTC
Hello everyone, 
Merry Christmas!

Is there any way we could push out the reverted version from Mario? I am currently stuck with old packages thanks to an atomic system (silverblue).

I am new to the processes here - if there is anything I can do to help out / expedite, just leave me some info.

Comment 8 Peter Robinson 2025-12-26 05:13:41 UTC
> Is there any way we could push out the reverted version from Mario? I am
> currently stuck with old packages thanks to an atomic system (silverblue).

You should be able to do a dnf downgrade to drop back to working firmware. The FW will be updated when things are coordinated, it is holiday season so things take a little longer at times.

Comment 9 Donato Capitella 2026-01-07 12:12:06 UTC
Happy New Year! Just checking in to see if there's an update on the expected timeline. I am part of a community of Strix Halo users and wrote many tutorial based on Fedora, and right now ever user who's intalling an updated version of Fedora has a broken ROCm implementation. I have been advising users to downgrade the Linux firmware, but as you can imagine this creates a lot of confusion.

Comment 10 Peter Robinson 2026-01-07 12:14:53 UTC
Aiming for Friday

Comment 11 Sid 2026-01-11 02:09:01 UTC
@Tim Flink (and AMD team) -- A new release came thru today but failed

```
sid@vega:~$ ./run-rocm-smoketest.sh 
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
Memory access fault by GPU node-1 (Agent handle: 0x2ccbbe20) on address 0x7f45a612a000. Reason: Page not present or supervisor privilege.
sid@vega:~$ rpm-ostree status
State: idle
Deployments:
● fedora:fedora/43/x86_64/silverblue
                  Version: 43.20260110.0 (2026-01-10T00:28:27Z)
                   Commit: 3a0477ec79f1edb269ba6ed2844d86777b2d5a0d70624be1efa4c5530c9161c6
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531

  fedora:fedora/43/x86_64/silverblue
                  Version: 43.1.6 (2025-10-23T03:11:18Z)
                   Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531

  fedora:fedora/43/x86_64/silverblue
                  Version: 43.1.6 (2025-10-23T03:11:18Z)
                   Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531
                   Pinned: yes
sid@vega:~$
```

Rollback makes it working again
```
```
sid@vega:~$ rpm-ostree status
State: idle
Deployments:
● fedora:fedora/43/x86_64/silverblue
                  Version: 43.1.6 (2025-10-23T03:11:18Z)
                   Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531

  fedora:fedora/43/x86_64/silverblue
                  Version: 43.20260110.0 (2026-01-10T00:28:27Z)
                   Commit: 3a0477ec79f1edb269ba6ed2844d86777b2d5a0d70624be1efa4c5530c9161c6
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531

  fedora:fedora/43/x86_64/silverblue
                  Version: 43.1.6 (2025-10-23T03:11:18Z)
                   Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531
                   Pinned: yes
sid@vega:~$ ./run-rocm-smoketest.sh 
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | ROCm       |  99 |  1 |    0 |           pp512 |      2541.22 ± 29.97 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | ROCm       |  99 |  1 |    0 |           tg128 |         68.67 ± 0.10 |

build: 9e41884dc (7687)
sid@vega:~$
```
```
What the quick smoke test runs ... 
```
sid@vega:~$ cat ./run-rocm-smoketest.sh 
#!/usr/bin/env bash
set -uo pipefail

toolbox run -c llama-rocm-7.1.1 -- /usr/local/bin/llama-bench  -fa 1 -ngl 99 -mmp 0 -m /mnt/data/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
```

Comment 12 Fedora Update System 2026-01-11 04:14:13 UTC
FEDORA-2026-1d240112ff (linux-firmware-20260110-1.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2026-1d240112ff

Comment 13 Fedora Update System 2026-01-11 04:14:26 UTC
FEDORA-2026-2cebf295af (linux-firmware-20260110-1.fc43) has been submitted as an update to Fedora 43.
https://bodhi.fedoraproject.org/updates/FEDORA-2026-2cebf295af

Comment 14 Mario Limonciello 2026-01-11 18:17:58 UTC
> @Tim Flink (and AMD team) -- A new release came thru today but failed

New release of what?  A Silverblue snapshot?  I need to know what details are in this "release".
* MES F/W version
* rocr-runtime version (7.1.1-XXX)  What's the XXX?  It needs to be -2 or newer to pick up the GFX1151 patch IIUC.

Comment 15 Peter Robinson 2026-01-12 00:27:58 UTC
(In reply to Mario Limonciello from comment #14)
> > @Tim Flink (and AMD team) -- A new release came thru today but failed
> 
> New release of what?  A Silverblue snapshot?  I need to know what details

I suspect they means the new upstream linux-firmware.

Comment 16 Peter Robinson 2026-01-12 00:34:00 UTC
It's in updates-testing so will be a stable update later this week

Comment 17 Fedora Update System 2026-01-12 01:34:11 UTC
FEDORA-2026-2cebf295af has been pushed to the Fedora 43 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2026-2cebf295af`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2026-2cebf295af

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 18 Fedora Update System 2026-01-12 01:55:46 UTC
FEDORA-2026-1d240112ff has been pushed to the Fedora 42 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2026-1d240112ff`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2026-1d240112ff

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 19 Peter Robinson 2026-01-12 10:33:47 UTC
Tim: can you confirm this looks good to you?

Comment 20 Sid 2026-01-12 17:40:06 UTC
@mario - I meant the new silverblue upgrade/snapshot pushed out, it was 43.20260110.0 when originally messaged. Since then 43.20260112.0 has been pushed out, which is also broken. I'll list the underlying components, but as a silverblue user, treating it as "one release". From what I'm reading this could be an issue in kernel (6.18.4-200.fc43.x86_64 vs 6.17.1-300.fc43.x86_64). Details below.

Side note, we actually moved this lab machine from workstation to silverblue for "greater stability" (which is sort of true? we can quickly rollback/upgrade trivially). I'll do my best to gather more helpful info, that AMD Strix Halo box is the only one right now, so intrusive to stop everything -> upgrade -> test -> rollback and resume actual workloads. 
-------------------------------------------------------------
Working:
Fedora 43 Silverblue snapshot: 43.1.6
sid@vega:~$ rpm-ostree status
State: idle
Deployments:
● fedora:fedora/43/x86_64/silverblue
                  Version: 43.1.6 (2025-10-23T03:11:18Z)
                   Commit: 4d40d281be93a88f3d559b5756df602f454f932f3c809a6a4250b91049ce40e8
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531

sid@vega:~$ rpm -q linux-firmware
Kernel: Linux 6.17.1-300.fc43.x86_64
Firmware: linux-firmware-20251021-1.fc43.noarch
-------------------------------------------------------------
Broken:
Fedora 43 Silverblue snapshot: 43.20260112.0 
sid@vega:~$ rpm-ostree status
State: idle
Deployments:
● fedora:fedora/43/x86_64/silverblue
                  Version: 43.20260112.0 (2026-01-12T00:27:07Z)
                   Commit: 15edf9df6181db7fe6d70cd704d4dfda85edaf64f95317698386495b6f00e99a
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531

Kernel: Linux 6.18.4-200.fc43.x86_64
Firmware: linux-firmware-20251125-1.fc43.noarch

Failure Rate: 100% (10/10; immediately )
sid@vega:~$ toolbox enter llama-rocm-7.1.1
⬢ [sid@toolbx ~]$ llama-bench  -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
Segmentation fault         (core dumped) llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
⬢ [sid@toolbx ~]$ llama-bench  -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
Segmentation fault         (core dumped) llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
⬢ [sid@toolbx ~]$

Comment 21 Mario Limonciello 2026-01-12 18:48:18 UTC
OK, that confirms you don't have the updated linux-firmware with the fix in the broken image.  Once it migrates out of testing and you get a new snapshot you /should/ be good to go.

Comment 22 Sid 2026-01-12 19:16:17 UTC
Thanks Mario. Any indicators of that 'stable build'? Like a version # (of immutable silverblue snapshot or linux-firmware) or an expected date?

Could I also recommend a gatekeeper test around llama-bench on your CI? It's quick, yet stressful, relatively isolated for end to end tests as llama-cpp is a single folder app. kyuz0's `amd-strix-halo-toolboxes` make it very trivial for runtime switching. And even smaller model would work (e.g. gemma3 4b Q4_K_M is 2.4GB).

Comment 23 Tim Flink 2026-01-12 19:17:00 UTC
(In reply to Peter Robinson from comment #19)
> Tim: can you confirm this looks good to you?

The changelog has all the relevant changes listed so everything should be good.

I don't have access to any of the relevant HW myself but I'm trying to find someone who can at at least run some basic tests to confirm that the quickly-testable issues have disappeared.

Comment 24 Tim Flink 2026-01-12 19:21:01 UTC
(In reply to Sid from comment #22)
> Thanks Mario. Any indicators of that 'stable build'? Like a version # (of
> immutable silverblue snapshot or linux-firmware) or an expected date?
> 
> Could I also recommend a gatekeeper test around llama-bench on your CI? It's
> quick, yet stressful, relatively isolated for end to end tests as llama-cpp
> is a single folder app. kyuz0's `amd-strix-halo-toolboxes` make it very
> trivial for runtime switching. And even smaller model would work (e.g.
> gemma3 4b Q4_K_M is 2.4GB).

I'm not terribly familiar with silverblue but I believe that it's built on the bits that have gone stable in the relevant Fedora release repos. I doubt that you'll see a change in silverblue until this linux-firmware update has gone stable but once it does get pushed stable, I imagine that the next silverblue build/update after that will have the firmware changes in it.

As far as an indication, the "rpm-ostree status" command shows the linux-firmware build used. Once that says "linux-firmware-20260110-1.fc43.noarch", the firmware fixes should be present.

Comment 25 Peter Robinson 2026-01-13 00:46:35 UTC
That is correct, silverblue is behind the rest off Fedora. It goes updates-testing -> updates -> silverblue. That last bit is at some point in the future because they have their own testing cycles. We are currently at updates-testing, I will push it to updates later in the week once I am happy the firmware update as a whole has had wide enough testing. Go an ask the silverblue people what happens from there because it's out of scope for this bug.

Comment 26 Peter Robinson 2026-01-13 00:49:47 UTC
(In reply to Sid from comment #20)
> @mario - I meant the new silverblue upgrade/snapshot pushed out, it was
> 43.20260110.0 when originally messaged. Since then 43.20260112.0 has been

For future reference, if a bug you are looking for a fix for as a silverblue user is anything but CLOSED -> ERRATA you won't have the fix. The fact this is currently ON_QA means it's not even in Fedora stable updates yet. Silverblue always trails.

Comment 27 Peter Robinson 2026-01-13 02:37:15 UTC
So I've clarified that atomic desktops will get the update when it goes stable with the rest of Fedora, CoreOS releases every two weeks so will get the update on their next release after the update goes stable.

Comment 28 Arthur Sore 2026-01-13 08:27:52 UTC
Thanks Peter, 20260110-1 is working on my workflows, on 44 rawhide with rocm 7.11.0.

Sid: I've also ran llama-bench with your parameters without issue.

Comment 29 louisgtwo 2026-01-13 16:44:31 UTC
I have been running amd-gpu-firmware-20260110-1.fc43.noarch for about 24 hours and has been stable with no crashes and system freezes during normal graphical usage.

Comment 30 Sid 2026-01-14 20:22:55 UTC
Thanks for the help guys. I'm coming from Debian (personal) and RHEL (enterprise) backgrounds, new to Silverblue processes. If there's a better place to discuss this, please redirect me. I thought kernels and linux-firmware's were reasonably decoupled but I'm following up since I read at https://github.com/kyuz0/amd-strix-halo-toolboxes:

> Stable configuration
> Kernel: 6.18.3-200
> Firmware: 20251111
> "NEWER KERNELS SUCH AS 6.18.4 BREAKS ROCm except for nightly builds" (summarized emphasis mine) 

Tim mentioned linux-firmware 20260110-1.fc43.noarch contains the needed fixes. So, is it a linux-firmware only fix? If not, what kernel version pairs correctly with this firmware? Arthur is using a rawhide kernel (F44), and Louis' kernel version is unclear.

Silverblue updates are now kernel:6.18.4-200.fc43 (today's prod push; also broken) so when Silverblue ships with the updated firmware (>= 20260110-1.fc43.noarch), can you'll verify it pairs with a compatible kernel to avoid re-spinning? Appreciate it!

Comment 31 Fedora Update System 2026-01-15 00:52:34 UTC
FEDORA-2026-1d240112ff (linux-firmware-20260110-1.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 32 Fedora Update System 2026-01-15 01:12:54 UTC
FEDORA-2026-2cebf295af (linux-firmware-20260110-1.fc43) has been pushed to the Fedora 43 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 33 Peter Robinson 2026-01-15 04:16:05 UTC
> Tim mentioned linux-firmware 20260110-1.fc43.noarch contains the needed
> fixes. So, is it a linux-firmware only fix? If not, what kernel version
> pairs correctly with this firmware? Arthur is using a rawhide kernel (F44),
> and Louis' kernel version is unclear.

It's independent to linux-firmware. rawhide has had the new firmware since about last Sat.

> Silverblue updates are now kernel:6.18.4-200.fc43 (today's prod push; also
> broken) so when Silverblue ships with the updated firmware (>=
> 20260110-1.fc43.noarch), can you'll verify it pairs with a compatible kernel
> to avoid re-spinning? Appreciate it!

It's in today's stable push (Thurs) so it will be in updates heading out to mirrors later, likely check updates later today or tomorrow and it should be good. All those details are in the above comment ;-)

Comment 34 Andreas Haerter 2026-01-16 01:15:02 UTC
New firmware arrived, but sadly still freezes / crashes happening:


$ rpm -qa --last | grep -i -E “amd.*(firmware|microcode)|kernel-[0-9]”
kernel-6.18.5-200.fc43.x86_64                 Do 15 Jan 2026 23:13:31 CET
amd-ucode-firmware-20260110-1.fc43.noarch     Do 15 Jan 2026 23:13:31 CET
amd-gpu-firmware-20260110-1.fc43.noarch       Do 15 Jan 2026 23:13:31 CET
kernel-6.17.1-300.fc43.x86_64                 Mo 12 Jan 2026 04:30:47 CET
kernel-6.18.4-200.fc43.x86_64                 Mo 12 Jan 2026 03:39:37 CET


$ uname -r
6.18.5-200.fc43.x86_64


$ sudo dmidecode -t system -t baseboard -t processor | grep -E "Manufacturer|Product Name|Version|Family"
	Family: Zen
	Manufacturer: Advanced Micro Devices, Inc.
	Signature: Family 26, Model 96, Stepping 0
	Version: AMD Ryzen AI 7 PRO 350 w/ Radeon 860M          
	Manufacturer: LENOVO
	Product Name: 21RMCTO1WW
	Version: ThinkPad X13 Gen 6
	Family: ThinkPad X13 Gen 6
	Manufacturer: LENOVO
	Product Name: 21RMCTO1WW
	Version: Not Defined


$  journalctl -k --since 2026-01-14 --grep=amdgpu --case-sensitive=no
[...]
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32780)
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:  Process code pid 385501 thread code:cs0 pid 385506
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:   in page starting at address 0x000000003f800000 from client 10
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201430
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:          MORE_FAULTS: 0x0
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:          WALKER_ERROR: 0x0
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:          MAPPING_ERROR: 0x0
Jan 15 23:07:27 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:          RW: 0x0
Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Dumping IP State
Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Dumping IP State Completed
Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=7522206, emitted seq=7522208
Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu:  Process code pid 385501 thread code:cs0 pid 385506
Jan 15 23:07:38 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Starting gfx_0.0.0 ring reset
Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=RESET
Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: failed to reset legacy queue
Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: reset via MES failed and try pipe reset -110
Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: Ring gfx_0.0.0 reset failed
Jan 15 23:07:40 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: GPU reset begin!
Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: failed to unmap legacy queue
Jan 15 23:07:42 tpx13 kernel: [drm:gfx_v11_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: MODE2 reset
Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: GPU reset succeeded, trying to resume
Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: SMU is resuming...
Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: SMU is resumed successfully!
Jan 15 23:07:42 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: [drm] DMUB hardware initialized: version=0x09003100
Jan 15 23:07:43 tpx13 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jan 15 23:07:47 tpx13 kernel: amdgpu 0000:c4:00.0: [drm] *ERROR* Step 2 of creating MST payload for 00000000fe84e6e9 failed: -5
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: [drm] *ERROR* Step 2 of creating MST payload for 00000000a79f0fff failed: -5
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring jpeg_dec_0 uses VM inv eng 1 on hub 8
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: ring vpe uses VM inv eng 4 on hub 8
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: amdgpu: GPU reset(4) succeeded!
Jan 15 23:07:51 tpx13 kernel: amdgpu 0000:c4:00.0: [drm] device wedged, but recovered through reset
Jan 15 23:07:51 tpx13 gnome-shell[3985]: amdgpu: The CS has cancelled because the context is lost. This context is innocent.
                                                Module libdrm_amdgpu.so.1 from rpm libdrm-2.4.128-3.fc43.x86_64
                                                Module libdrm_amdgpu.so.1 from rpm libdrm-2.4.128-3.fc43.x86_64
                                                #1  0x00007f41e9820dc0 _ZL30amdgpu_ctx_set_sw_reset_statusP17radeon_winsys_ctx17pipe_reset_statusPKcz (libgallium-25.2.7.so + 0xa20dc0)
                                                #2  0x00007f41e9825021 _Z19amdgpu_cs_submit_ibIL10queue_type0EEvPvS1_i (libgallium-25.2.7.so + 0xa25021)
                                                #2  0x00007f41e981ee3c amdgpu_bo_destroy (libgallium-25.2.7.so + 0xa1ee3c)
                                                #4  0x00007f41e981f7a0 amdgpu_bo_create (libgallium-25.2.7.so + 0xa1f7a0)
Jan 15 23:07:56 tpx13 abrt-notification[658572]: Process 2524 (gnome-shell) crashed in amdgpu_ctx_query_reset_status(radeon_winsys_ctx*, bool, bool*, bool*)()

Comment 35 Mario Limonciello 2026-01-16 03:42:50 UTC
You have what appears to be a different issue.  You're on a Kracken Point system. 
Are you 100% sure it's the firmware that caused the regression and not mesa or kernel?  Please open a new issue and CC me and let's work through details on it.

Comment 36 devitt.cs 2026-01-16 22:52:39 UTC
no crashes for 4 days on AMD Ryzen 9 7950X3D 16-Core Processor | fedora 6.18.5-300.vanilla.fc43.x86_64 | amd-gpu-firmware-0:20260110-1.fc43.noarch

though still get `amdgpu 0000:15:00.0: [drm] *ERROR* LTTPR count is nonzero but invalid lane count reported. Assuming no LTTPR present.`  which is I believe an irrelevant issue

Comment 37 Sid 2026-01-17 01:05:37 UTC
Sorry guys - still fails as before. IMO, it's clear Fedora Silverblue's quality pipeline needs a retrospect. 

☠️ rocm 6.4.4 
☠️ rocm 7.1.1
✅ rocm 7 nightlies
----------------------------------------------------
sid@vega:~$ rpm -q linux-firmware kernel
linux-firmware-20260110-1.fc43.noarch
kernel-6.18.5-200.fc43.x86_64
----------------------------------------------------
sid@vega:~$ ./run-rocm-smoketest.sh 
Model: /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
  Toolbox: llama-rocm-6.4.4
toolbox run -c llama-rocm-6.4.4 -- llama-bench -fa 1 -ngl 99 -mmp 0 -m /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
/opt/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: ROCm error
/usr/local/lib64/libggml-base.so.0(+0x35a5) [0x7f54d0ec85a5]
/usr/local/lib64/libggml-base.so.0(ggml_print_backtrace+0x1eb) [0x7f54d0ec896b]
/usr/local/lib64/libggml-base.so.0(ggml_abort+0x11f) [0x7f54d0ec8aef]
/usr/local/lib64/libggml-hip.so.0(+0x1c07e2) [0x7f54d11457e2]
/usr/local/lib64/libggml-hip.so.0(+0x1c5774) [0x7f54d114a774]
/usr/local/lib64/libllama.so.0(_ZN18llama_model_loader13load_all_dataEP12ggml_contextRSt13unordered_mapIjP19ggml_backend_bufferSt4hashIjESt8equal_toIjESaISt4pairIKjS4_EEEPSt6vectorISt10unique_ptrI11llama_mlockSt14default_deleteISH_EESaISK_EEPFbfPvESO_+0x10ad) [0x7f54d3fce8ad]
/usr/local/lib64/libllama.so.0(_ZN11llama_model12load_tensorsER18llama_model_loader+0x3cafe) [0x7f54d4025a5e]
/usr/local/lib64/libllama.so.0(+0x264e8) [0x7f54d3f424e8]
/usr/local/lib64/libllama.so.0(llama_model_load_from_file+0xac) [0x7f54d3f4334c]
llama-bench() [0x40787d]
/lib64/libc.so.6(+0x35b5) [0x7f54d085e5b5]
/lib64/libc.so.6(__libc_start_main+0x88) [0x7f54d085e668]
llama-bench() [0x409e85]
sid@vega:~$

Comment 38 Mario Limonciello 2026-01-17 02:39:05 UTC
> Sorry guys - still fails as before. IMO, it's clear Fedora Silverblue's quality pipeline needs a retrospect. 

I believe you're seeing a different regression, and it's actually an issue that there is a mismatch with kernel and userspace.  Can you rebuild your ROCm containers with a patch?  If so - add this patch to your ROCm containers and it should fix the issue.

https://github.com/ROCm/rocm-systems/commit/09ba45b3f43ec333a84a0ca178fcd1e3ea9400a9

Comment 39 Donato Capitella 2026-01-17 07:01:01 UTC
(In reply to Sid from comment #37)
> Sorry guys - still fails as before. IMO, it's clear Fedora Silverblue's
> quality pipeline needs a retrospect. 
> 
> ☠️ rocm 6.4.4 
> ☠️ rocm 7.1.1
> ✅ rocm 7 nightlies
> ----------------------------------------------------
> sid@vega:~$ rpm -q linux-firmware kernel
> linux-firmware-20260110-1.fc43.noarch
> kernel-6.18.5-200.fc43.x86_64
> ----------------------------------------------------
> sid@vega:~$ ./run-rocm-smoketest.sh 
> Model:
> /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/
> snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
>   Toolbox: llama-rocm-6.4.4
> toolbox run -c llama-rocm-6.4.4 -- llama-bench -fa 1 -ngl 99 -mmp 0 -m
> /mnt/data/projects/ai/models/hub/models--ggml-org--gemma-3-4b-it-GGUF/
> snapshots/d0976223747697cb51e056d85c532013931fe52e/gemma-3-4b-it-Q4_K_M.gguf
> ggml_cuda_init: found 1 ROCm devices:
>   Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
> | model                          |       size |     params | backend    |
> ngl | fa | mmap |            test |                  t/s |
> | ------------------------------ | ---------: | ---------: | ---------- |
> --: | -: | ---: | --------------: | -------------------: |
> /opt/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:96: ROCm error
> /usr/local/lib64/libggml-base.so.0(+0x35a5) [0x7f54d0ec85a5]
> /usr/local/lib64/libggml-base.so.0(ggml_print_backtrace+0x1eb)
> [0x7f54d0ec896b]
> /usr/local/lib64/libggml-base.so.0(ggml_abort+0x11f) [0x7f54d0ec8aef]
> /usr/local/lib64/libggml-hip.so.0(+0x1c07e2) [0x7f54d11457e2]
> /usr/local/lib64/libggml-hip.so.0(+0x1c5774) [0x7f54d114a774]
> /usr/local/lib64/libllama.so.
> 0(_ZN18llama_model_loader13load_all_dataEP12ggml_contextRSt13unordered_mapIjP
> 19ggml_backend_bufferSt4hashIjESt8equal_toIjESaISt4pairIKjS4_EEEPSt6vectorISt
> 10unique_ptrI11llama_mlockSt14default_deleteISH_EESaISK_EEPFbfPvESO_+0x10ad)
> [0x7f54d3fce8ad]
> /usr/local/lib64/libllama.so.
> 0(_ZN11llama_model12load_tensorsER18llama_model_loader+0x3cafe)
> [0x7f54d4025a5e]
> /usr/local/lib64/libllama.so.0(+0x264e8) [0x7f54d3f424e8]
> /usr/local/lib64/libllama.so.0(llama_model_load_from_file+0xac)
> [0x7f54d3f4334c]
> llama-bench() [0x40787d]
> /lib64/libc.so.6(+0x35b5) [0x7f54d085e5b5]
> /lib64/libc.so.6(__libc_start_main+0x88) [0x7f54d085e668]
> llama-bench() [0x409e85]
> sid@vega:~$

As far as I understand that's expected. Older versions of ROCm are not compatible any longer with the newer kernels. Right now the ROCm nightly builds are what works on Linux, until the stable 7.2 is released.

Mario, am I correct? Might be worth also pushing the fix to the 6.4 branch, maybe giving people a 6.4.5 option with the patch.

Comment 40 Mario Limonciello 2026-01-17 12:13:00 UTC
That's correct. For more context, this patch fixes stability issues that many people have raised with workloads that need a context switch. It's unfortunate the hard dependency moving together.

In terms of using pre built binaries for the legacy releases our hands are tied. Either rebuild them with the patch or switch to the rock releases.

The rock based releases will work with either new or older kernels.

Comment 41 Sid 2026-01-17 21:36:54 UTC
Could you please clarify 

> Older versions of ROCm are not compatible any longer with the newer kernels. 

The crashes are happening with rocm 6.4.4 and 7.1.1. Looking at https://rocm.docs.amd.com/en/latest/release/versions.html, 6.4.4 is missing but 6.4.3 was an August 7, 2025 and 7.1.1 was a Nov 26, 2025 release. Less than _two months old_. Are you saying these are considered "older unsupported versions" on the current kernels being pushed out? That's doesn't add up for me, especially considering nvidia's cuda support cycles.

To be fair, I also did noticed that Fedora is NOT officially supported at https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html. 

kyuz0 has really make very helpful Strix Halo toolboxes https://github.com/kyuz0/amd-strix-halo-toolboxes and those are Fedora centrix, drawing many devs that way. It's not bad, but would be best if AMD has their own fully supported Docker/distrobox containers with different llama.cpp + rocm/vulkan runtimes. Maybe contract/hire kyuz0 for that :) ? Personally, I care less about Fedora/Ubuntu etc, they're tools, I can adjust. But vendors supported reliability is important, so we can experiment on the layers above.

Comment 42 Andreas Haerter 2026-01-17 22:49:27 UTC
(In reply to Mario Limonciello from comment #35)
> You have what appears to be a different issue.  You're on a Kracken Point system.
> Are you 100% sure it's the firmware that caused the regression and not mesa
> or kernel?  Please open a new issue and CC me and let's work through details
> on it.


Ah OK. Sorry, I didn't mean to hijack the issue. I think I was distracted by the fact that discussion threads about the GPU freezing referred to this very bug... :-|
So I'll open a new one if I run into problems again.

But (good news): I'm currently having no more issues. Apparently, the hardware doesn't just need a reset; at least complete power-off (cold boot) is also a good idea after applying the patches. The latest firmware with the latest kernel seems to have fixed some bugs after a complete cold boot:

$ rpm -qa --last | grep -i -E "amd.*(firmware|microcode)|kernel-[0-9]"
amd-ucode-firmware-20260110-1.fc43.noarch Fri Jan 16, 2026 2:10:32 PM CET
amd-gpu-firmware-20260110-1.fc43.noarch Fri Jan 16, 2026 2:10:32 PM CET
kernel-6.18.5-200.fc43.x86_64 Thu Jan 15, 2026 11:13:31 PM CET
kernel-6.17.1-300.fc43.x86_64 Mon Jan 12 2026 04:30:47 CET
kernel-6.18.4-200.fc43.x86_64 Mon, Jan 12, 2026 03:39:37 CET

$ uname -r

6.18.5-200.fc43.x86_64

Anyway, the laptop has been running continuously for over 48 hours without crashing (Just like it always was before, except for December). Hope it stays like that. :-)

Comment 43 Mario Limonciello 2026-01-19 16:04:06 UTC
> But (good news): I'm currently having no more issues. Apparently, the hardware doesn't just need a reset; at least complete power-off (cold boot) is also a good idea after applying the patches. The latest firmware with the latest kernel seems to have fixed some bugs after a complete cold boot:

Happy to hear.

> The crashes are happening with rocm 6.4.4 and 7.1.1. Looking at https://rocm.docs.amd.com/en/latest/release/versions.html, 6.4.4 is missing but 6.4.3 was an August 7, 2025 and 7.1.1 was a Nov 26, 2025 release. Less than _two months old_. Are you saying these are considered "older unsupported versions" on the current kernels being pushed out? That's doesn't add up for me, especially considering nvidia's cuda support cycles.

There has been a fundamental mistake in the VGPR size for a very long time for Strix Halo and it has been leading to instability for a while as well.  It is specific to workloads with context switches.  For example, Comfy UI could reproduce it easily.

We tried a lot of things to fix this issue, but it eventually boiled down to this issue is that VGPR size was hardcoded both in userspace and kernel space.  If they are wrong or out of sync things don't work properly.

So we've fixed it in the kernel to use the correct size:
https://github.com/gregkh/linux/commit/7445db6a7d5a0242d8214582b480600b266cba9e

We've also added support to export that size to userspace so that it doesn't need to be hardcoded in userspace anymore:

https://github.com/gregkh/linux/commit/7445db6a7d5a0242d8214582b480600b266cba9e

TheRock builds are using this new interface if available and thus they will "work" both with older and newer kernels, but the fundamental stability issue I mention above still exists.  If VGPR size is wrong context switch doesn't work.

> To be fair, I also did noticed that Fedora is NOT officially supported at https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html. 

The stability matrix reflects what is tested and AMD officially supports.  But AMD does also work on native packaging in distros, these just don't get official support.

FWIW the Fedora ROCm packages ARE picking up the VGPR size patch.

> But vendors supported reliability is important, so we can experiment on the layers above.

I wish we could have fixed this 6 months ago.  The biggest challenge is that using a debugger like rocgdb ALSO causes a context switch.  So, this required some even lower-level tools to identify the mismatch.

Using a container built from the older series branch is totally fine, just pick up that patch and add it while building.  It's literally a one-line change to take the correct VGPR size.

> kyuz0 has really make very helpful Strix Halo toolboxes https://github.com/kyuz0/amd-strix-halo-toolboxes

These are phenomenal and I am really glad they make ROCm more accessible.

But I do want to say - Fedora bug tracker is not a discussion forum.  If you want to keep talking about this, we should move the conversation somewhere else.

I would love any creative ideas that would allow us to let this work in more combinations if you have them.

Feel free to tag me somewhere else if you want to continue the conversation.


Note You need to log in before you can comment on or make changes to this bug.