Bug 2240859 - amdgpu crash: kernel 6.5.x ([drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout)
Summary: amdgpu crash: kernel 6.5.x ([drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 39
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker
: 2242506 (view as bug list)
Depends On:
Blocks: F39FinalBlocker
TreeView+ depends on / blocked
 
Reported: 2023-09-27 00:35 UTC by Warren Togami
Modified: 2023-10-11 22:40 UTC (History)
28 users (show)

Fixed In Version: kernel-6.5.6-300.fc39
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-10-09 22:26:09 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
freedesktop.org Gitlab drm amd issues 2830 0 None opened Computer randomly crashes with kernel 6.5 when browsing the web ([drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low... 2023-09-27 01:47:51 UTC

Description Warren Togami 2023-09-27 00:35:47 UTC
Reproduce Procedure
===================
1. Boot kernel-6.5.x on affected AMD APU's.
2. Run Google Chrome or Chromium.
3. maps.google.com
4. Zoom all the way out. Search for a street address. Pan and zoom the map. Plot a driving route.
5. After one or more tries amdgpu crashes with this ...

* Other people reported this crash happening in Firefox or with other desktop apps Discord or Vulcan-enabled games.
* I was not able to reliably see it crash in Firefox myself. I found the following procedure reproduces the crash 100% in Google Chrome or Chromium.

Sep 24 21:58:52 thinkpad kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=1817, emitted seq=1819
Sep 24 21:58:52 thinkpad kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chromium-browse pid 4219 thread chromium-b:cs0 pid 4293
Sep 24 21:58:52 thinkpad kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
Sep 24 21:58:53 thinkpad kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
Sep 24 21:58:53 thinkpad kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
Sep 24 21:58:53 thinkpad kernel: [drm] PCIE GART of 1024M enabled.
Sep 24 21:58:53 thinkpad kernel: [drm] PTB located at 0x000000F43FC00000
Sep 24 21:58:53 thinkpad kernel: [drm] PSP is resuming...
Sep 24 21:58:53 thinkpad kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR
Sep 24 21:58:54 thinkpad kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
Sep 24 21:58:54 thinkpad kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
Sep 24 21:58:54 thinkpad kernel: amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Sep 24 21:58:54 thinkpad kernel: amdgpu 0000:05:00.0: amdgpu: SMU is resuming...
Sep 24 21:58:54 thinkpad kernel: amdgpu 0000:05:00.0: amdgpu: SMU is resumed successfully!

CONFIRMED WORKING KERNEL VERSIONS
=================================
kernel-6.4.x
kernel-6.6rc2+ - sometime between 6.5 and 6.6rc2 it was fixed but it has not yet been identified for backport

CONFIRMED BROKEN KERNEL VERSIONS
================================
kernel-6.5.0
kernel-6.5.1
kernel-6.5.2
kernel-6.5.3
kernel-6.5.4
kernel-6.5.5

Confirmed Crashing AMD APU's
============================
Ryzen 5 3550H https://gitlab.freedesktop.org/drm/amd/-/issues/2830#note_2102054
Ryzen 7 3700U https://fedorapeople.org/~wtogami/a/2023/kernel-6.5.4-amdgpu-crash-ryzen-3700U.log
Ryzen 5 4600GE
Ryzen 7 5850U https://fedorapeople.org/~wtogami/a/2023/kernel-6.5.5-200.fc38-amdgpu-crash.log

Confirmed Unaffected AMD APU's
==============================
Ryzen 5 5650GE (unclear why, it's a similar Cezanne with 5850U)
Ryzen 7 6850U

Xorg Possible Workaround?
=========================
* In my experience amdgpu doesn't seem to crash with GNOME Xorg while it crashes consistently with GNOME Wayland. chromium>about:flags>Ozone X11 mode switches the browser to use Xwayland instead of native wayland. Xwayland is equally crashy as native wayland Chromium while the same browser in Xorg seems to work.
* The 3550H user says it crashes for him in Xorg so maybe not.

https://gitlab.freedesktop.org/drm/amd/-/issues/2830
Upstream ticket

Comment 1 Neal Gompa 2023-09-27 01:48:20 UTC
This affects 39 too, and for procedural reasons, I'm shifting it there.

Comment 2 Fedora Blocker Bugs Application 2023-09-27 01:52:22 UTC
Proposed as a Blocker for 39-final by Fedora user ngompa using the blocker tracking app because:

 This violates the criterion for default application functionality, as usage of preloaded applications using GPU functionality can cause graphical system freezes and crashes, leading to unrecoverable situations.

Comment 3 Neal Gompa 2023-09-29 15:03:32 UTC
I've tested this scratch build from Justin Forbes: https://koji.fedoraproject.org/koji/taskinfo?taskID=106879719

So far, things have been good and I have not experienced any crashes playing games, video calls, or anything else.

Operating System: Fedora Linux 39
KDE Plasma Version: 5.27.8
KDE Frameworks Version: 5.109.0
Qt Version: 5.15.10
Kernel Version: 6.5.5-301.fc39.x86_64 (64-bit)
Graphics Platform: Wayland
Processors: 8 × AMD Ryzen 5 3550H with Radeon Vega Mobile Gfx
Memory: 13.3 GiB of RAM
Graphics Processor: AMD Radeon Vega 8 Graphics
Manufacturer: BESSTAR TECH LIMITED
Product Name: DMAF5
System Version: V1.0

Tested apps: Firefox, Chrome, Discord
Tested games: Sonic Origins, and Sonic Adventure 2

Comment 4 richou672005 2023-09-29 16:55:19 UTC
affected too, on fedora 38, ryzen 7 57000u, every vulkan game on 6.5.x kernel crashes, 6.6 not tested

Comment 5 Dylan Soesman 2023-10-02 03:25:56 UTC
Unaffected on Fedora 39 with a Ryzen 7 7840U and Radeon 780M.
 
ThinkPad P14s Gen 4 
OS: Fedora release 39 (Thirty Nine) x86_64 
Kernel: 6.5.5-300.fc39.x86_64 
DE: Plasma 5.27.8 
WM: kwin 
CPU: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics (16) @ 5.289GHz 
GPU: AMD ATI 64:00.0 Phoenix1 
Memory: 32 GiB of RAM (including video memory)

Comment 6 Kamil Páral 2023-10-02 12:21:02 UTC
Accepted as F39 Final blocker in https://pagure.io/fedora-qa/blocker-review/issue/1348

Comment 8 Justin M. Forbes 2023-10-03 17:03:30 UTC
It has also been in the Fedora 6.5 tree for a few days, was waiting for 6.5.6 for the build:

https://gitlab.com/cki-project/kernel-ark/-/commit/afdab9b20ab7455f752527125b57c92d24601c6e

The scratch build Neal linked above has it included.

Comment 9 Florian Apolloner 2023-10-04 07:34:46 UTC
I am seeing the same/similar issue with a "high" timeout, is this the same or shall I open a new one?
Okt 04 09:27:04 apollo13 kernel: [drm] ring 0 timeout to preempt ib
Okt 04 09:27:14 apollo13 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=69159, emitted seq=69161
Okt 04 09:27:14 apollo13 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 3036 thread gnome-shel:cs0 pid 3105
Okt 04 09:27:14 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
Okt 04 09:27:14 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: MODE2 reset
Okt 04 09:27:14 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
Okt 04 09:27:14 apollo13 kernel: [drm] PCIE GART of 1024M enabled.
Okt 04 09:27:14 apollo13 kernel: [drm] PTB located at 0x000000F43FC00000
Okt 04 09:27:14 apollo13 kernel: [drm] PSP is resuming...
Okt 04 09:27:15 apollo13 kernel: [drm] reserve 0x400000 from 0xf43f400000 for PSP TMR
Okt 04 09:27:15 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
Okt 04 09:27:15 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
Okt 04 09:27:15 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Okt 04 09:27:15 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
Okt 04 09:27:15 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
Okt 04 09:27:15 apollo13 kernel: [drm] DMUB hardware initialized: version=0x01010027
Okt 04 09:27:16 apollo13 kernel: [drm] kiq ring mec 2 pipe 1 q 0
Okt 04 09:27:16 apollo13 kernel: amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
Okt 04 09:27:16 apollo13 kernel: [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
Okt 04 09:27:16 apollo13 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
Okt 04 09:27:16 apollo13 kernel: [drm] Skip scheduling IBs!
Okt 04 09:27:16 apollo13 kernel: [drm] Skip scheduling IBs!
Okt 04 09:27:16 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(2) failed
Okt 04 09:27:16 apollo13 kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset end with ret = -110
Okt 04 09:27:17 apollo13 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
Okt 04 09:27:17 apollo13 firefox.desktop[4167]: amdgpu: amdgpu_cs_query_fence_status failed.
Okt 04 09:27:17 apollo13 firefox.desktop[4167]: Crash Annotation GraphicsCriticalError: |[0][GFX1-]: GFX: RenderThread detected a device reset in PostUpdate (t=3350.44) [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate
Okt 04 09:27:17 apollo13 gnome-shell[3036]: amdgpu: amdgpu_cs_query_fence_status failed.
Okt 04 09:27:17 apollo13 kernel: amdgpu_cs_ioctl: 46 callbacks suppressed
Okt 04 09:27:17 apollo13 kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Comment 10 Adam Williamson 2023-10-04 15:36:33 UTC
Hum, good question. I guess the easiest way to test would be to try the scratch build Neal linked. If that works, I guess it *was* the same problem. If not, new bug.

Comment 11 Florian Apolloner 2023-10-05 06:48:08 UTC
the scratch build seems to help. Can't say for sure yet since I installed it after the first crash so I don't know how frequently it would crash, but let's see.

Comment 12 Justin M. Forbes 2023-10-06 16:50:25 UTC
*** Bug 2242506 has been marked as a duplicate of this bug. ***

Comment 13 Fedora Update System 2023-10-06 22:05:53 UTC
FEDORA-2023-830d9ec624 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-830d9ec624

Comment 14 Fedora Update System 2023-10-06 22:06:18 UTC
FEDORA-2023-50bd7c9c12 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2023-50bd7c9c12

Comment 15 Fedora Update System 2023-10-06 22:09:09 UTC
FEDORA-2023-c3bb819677 has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-c3bb819677

Comment 16 Adam Williamson 2023-10-06 23:27:30 UTC
I dropped the association of the F37 and F38 updates with this bug, as this bug is an F39 release blocker, so we do not want it being closed by an F37 or F38 update being pushed stable. Now only the F39 update going stable will close this report.

Comment 17 Fedora Update System 2023-10-07 02:33:19 UTC
FEDORA-2023-c3bb819677 has been pushed to the Fedora 39 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-c3bb819677`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-c3bb819677

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 18 Fedora Update System 2023-10-09 22:26:09 UTC
FEDORA-2023-c3bb819677 has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 19 Maksym Putkaradze 2023-10-11 21:38:22 UTC
Swapping mesa-va-drivers-freeworld with mesa-va-drivers fixed this issue.

ThinkPad P14s Gen 2
OS: Fedora release 39 (Thirty Nine) x86_64 
Kernel: 6.5.6-300.fc39.x86_64 
DE: GNOME 45 
WM: mutter
CPU: AMD Ryzen 5 PRO 5650U
GPU: AMD ATI Cezanne
Memory: 32 GiB of RAM (including video memory)

Comment 20 Adam Williamson 2023-10-11 22:40:39 UTC
that would point to https://fosstodon.org/@knurd42@social.linux.pizza/111215664021438216 , a slightly different bug.


Note You need to log in before you can comment on or make changes to this bug.