1966384 – amdgpu: regression: daily display lockup requires hard reset

Bug 1966384 - amdgpu: regression: daily display lockup requires hard reset

Summary: amdgpu: regression: daily display lockup requires hard reset

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	33
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-01 03:40 UTC by Dimitris
Modified:	2021-11-30 19:09 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-11-30 19:09:46 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
kernel oops at time of hard lockup (3.04 KB, text/plain) 2021-06-01 03:40 UTC, Dimitris	no flags	Details
Another instance a few minutes ago (12.51 KB, text/plain) 2021-06-07 22:59 UTC, Dimitris	no flags	Details
journalctl after screen freeze happened (17.10 KB, text/plain) 2021-07-02 10:06 UTC, Norbert Jurkeit	no flags	Details
View All

Description Dimitris 2021-06-01 03:40:17 UTC

Created attachment 1788410 [details]
kernel oops at time of hard lockup

1. Please describe the problem:

On a roughly daily basis, sometimes more, the display locks up and the machine becomes unresponsive.  Only way to recover is to hold down the power button.

This is on a ThinkPad T495, "AMD Ryzen 7 PRO 3700U w/ Radeon Vega Mobile Gfx"

2. What is the Version-Release number of the kernel:

5.12.8-200.fc33.x86_64, but it may have started with 5.12.7.  Definitely not seen with the 5.11 series.

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Yes, this has not happened in anything approaching this frequency (had been extremely sporadic, ~ 1/month or even less) before this kernel series.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Seems to happen when I'm on a Firefox page with a certain kind of animation.  Specifically, github "spinner" icons used to show running github actions.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Haven't been able to test yet.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Dimitris 2021-06-01 23:46:30 UTC

Also found this in the logs:

May 31 20:08:15 angua firefox-wayland.desktop[6063]: amdgpu: amdgpu_cs_query_fence_status failed.

Comment 2 billgrzanich 2021-06-02 01:07:05 UTC

I believe I''m also seeing this problem with my Lenovo E585, AMD Ryzen 7 2700U with Radeon Vega Mobile Gfx,
Linux  5.12.7-300.fc34.x86_64 #1 SMP Wed May 26 12:58:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I'm running the default Gnome Shell and Wayland.

The problem began after the upgrade to Fedora 34, and perhaps just in the past week or so.  I see that Arch users are also experiencing similar problems:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjiqsnp0_fwAhWRbs0KHczaDWEQFjAAegQIBBAD&url=https%3A%2F%2Fbbs.archlinux.org%2Fviewtopic.php%3Fid%3D266358&usg=AOvVaw3QdsrbUMFqzrEIjYF4wiHP

They seem to think it's mesa or linux-firmware related.  In my case, I have:
linux-firmware.noarch                             20210511-120.fc34                    @updates                  
linux-firmware-whence.noarch                      20210511-120.fc34       
mesa-dri-drivers.i686                             21.1.1-1.fc34                        @updates                  
mesa-dri-drivers.x86_64                           21.1.1-1.fc34                        @updates                  
mesa-filesystem.i686                              21.1.1-1.fc34                        @updates                  
mesa-filesystem.x86_64                            21.1.1-1.fc34                        @updates                  
mesa-libEGL.x86_64                                21.1.1-1.fc34                        @updates                  
mesa-libGL.i686                                   21.1.1-1.fc34                        @updates                  
mesa-libGL.x86_64                                 21.1.1-1.fc34                        @updates                  
mesa-libgbm.x86_64                                21.1.1-1.fc34                        @updates                  
mesa-libglapi.i686                                21.1.1-1.fc34                        @updates                  
mesa-libglapi.x86_64                              21.1.1-1.fc34                        @updates                  
mesa-libxatracker.x86_64                          21.1.1-1.fc34                        @updates                  
mesa-vulkan-drivers.i686                          21.1.1-1.fc34                        @updates                  
mesa-vulkan-drivers.x86_64                        21.1.1-1.fc34                        @updates   

The most recent instance occurred when I moved the mouse cursor after leaving the machine idle for many minutes, perhaps 30.  The screen froze, then went black and I was forced to power down and reboot.  On previous occasions, the lockup was preceded by severe performance degradation and corrupted screen image.  This has happened several times today alone. 

The most recent log contains the following:

19:00:18 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
19:00:18 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
19:00:08 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
19:00:08 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
19:00:08 kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
19:00:08 kernel: [drm] kiq ring mec 2 pipe 1 q 0
19:00:07 kernel: amdgpu 0000:05:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
19:00:07 kernel: [drm] reserve 0x400000 from 0xf40fc00000 for PSP TMR
19:00:07 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
19:00:07 kernel: [drm] free PSP TMR buffer
19:00:07 kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=05:00.0 domain=0x0000 address=0x10dc40000 flags=0x0070]
19:00:07 kernel: amd_iommu_report_page_fault: 21 callbacks suppressed
19:00:07 kernel: amdgpu 0000:05:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10dc40000 flags=0x0070]
19:00:07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 3042 thread firefox:cs0 pid 3115
19:00:07 kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c07000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c09000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c04000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c05000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c06000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c02000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c08000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c03000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c00000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 RW: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x5
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 WALKER_ERROR: 0x0
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 MORE_FAULTS: 0x1
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00641051
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800110c01000 from client 27
18:59:57 kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:6 pasid:32774, for process firefox pid 3042 thread firefox:cs0 pid 3115)
18:59:57 kernel: gmc_v9_0_process_interrupt: 105 callbacks suppressed

Comment 3 Dimitris 2021-06-07 22:59:55 UTC

Created attachment 1789296 [details]
Another instance a few minutes ago

Just captured one more instance, see attachment.  Looks very similar to the one by billgrzanich.

Comment 4 Dimitris 2021-06-08 21:05:41 UTC

This kernel.org bug seems related, has a possible workaround:  https://bugzilla.kernel.org/show_bug.cgi?id=211157

Comment 5 Dimitris 2021-06-11 16:45:26 UTC

FWIW the workaround reported there doesn't work.  I added this line to the TLP conf:

RUNTIME_PM_DRIVER_BLACKLIST="mei_me nouveau nvidia pcieport radeon"

but still experiencing this.  Kind ofmakes sense, I don't see why/how *enabling* PM on the driver would have improved stability.

Comment 6 Dimitris 2021-06-11 23:21:24 UTC

"Better", identical really, including machine type (ThinkPad T495) upstream bug here: https://bugzilla.kernel.org/show_bug.cgi?id=213391

Comment 7 Aaron Sowry 2021-06-13 21:31:09 UTC

Same issue on F34, under swaywm, on a ThinkPad X395:

$ rpm -qa | grep mesa
mesa-libGLU-9.0.1-4.fc34.x86_64
mesa-libglapi-21.1.1-2.fc34.x86_64
mesa-libgbm-21.1.1-2.fc34.x86_64
mesa-filesystem-21.1.1-2.fc34.x86_64
mesa-dri-drivers-21.1.1-2.fc34.x86_64
mesa-libEGL-21.1.1-2.fc34.x86_64
mesa-libGL-21.1.1-2.fc34.x86_64
mesa-libxatracker-21.1.1-2.fc34.x86_64
mesa-vulkan-drivers-21.1.1-2.fc34.x86_64

$ rpm -qa | grep linux-firmware
linux-firmware-whence-20210511-120.fc34.noarch
linux-firmware-20210511-120.fc34.noarch

$ glxinfo
...
    Vendor: AMD (0x1002)
    Device: AMD Radeon(TM) Vega 10 Graphics (RAVEN, DRM 3.40.0, 5.12.9-300.fc34.x86_64, LLVM 12.0.0) (0x15d8)
    Version: 21.1.1
    Accelerated: yes
    Video memory: 2048MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
...

$ journalctl -b-1
...
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=36189, emitted seq=36190
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:67:crtc-0] flip_done timed out

$ journalctl -b-4
...
kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32769, for process sway pid 1913 thread sway:cs0 pid 1917)
kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800104a00000 from client 27
kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32769, for process sway pid 1913 thread sway:cs0 pid 1917)
kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address 0x800104a01000 from client 27
kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0
...

$ journalctl -b-5
...
kernel: amdgpu_cs_ioctl: 5 callbacks suppressed
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
...

$ journalctl -b-9
...
kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:67:crtc-0] flip_done timed out
kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CONNECTOR:78:eDP-1] flip_done timed out
kernel: [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:55:plane-3] flip_done timed out
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 4 PID: 243367 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7960 amdgpu_dm_atomic_commit_tail+0x2529/0x25a0 [amdgpu]
kernel: Modules linked in: uas usb_storage uinput rfcomm snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp bridge stp llc ccm cmac uv>
kernel:  kvm snd_seq snd_seq_device irqbypass iwlwifi rapl snd_pcm squashfs joydev pcspkr loop snd_rn_pci_acp3x wmi_bmof k10temp cfg80211 i2c_piix4 snd_pci_acp3x thinkpad_acpi snd_timer pla>
kernel: CPU: 4 PID: 243367 Comm: kworker/4:2 Tainted: G        W         5.12.9-300.fc34.x86_64 #1
kernel: Hardware name: LENOVO 20NM000FAU/20NM000FAU, BIOS R13ET49W(1.23 ) 11/24/2020
kernel: Workqueue: events console_callback
kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2529/0x25a0 [amdgpu]
kernel: Code: b8 fd ff ff 01 c7 85 b4 fd ff ff 37 00 00 00 c7 85 bc fd ff ff 20 00 00 00 e8 83 94 12 00 e9 08 fb ff ff 0f 0b e9 33 f9 ff ff <0f> 0b e9 a5 f9 ff ff 0f 0b 0f 0b e9 bc f9 ff ff>
kernel: RSP: 0018:ffffb2834b0bf8b8 EFLAGS: 00010002
kernel: RAX: 0000000000000002 RBX: 0000000000003f1d RCX: ffff894a4e92c918
kernel: RDX: 0000000000000001 RSI: 0000000000000297 RDI: ffff894a4eb80178
kernel: RBP: ffffb2834b0bfba0 R08: ffffb2834b0bf80c R09: 0000000000000000
kernel: R10: ffffb2834b0bf838 R11: ffffb2834b0bf83c R12: 0000000000000206
kernel: R13: ffff894a4e92c800 R14: ffff894a73d12600 R15: ffff894b5709b100
kernel: FS:  0000000000000000(0000) GS:ffff894cf0b00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000561da559d188 CR3: 00000001238ba000 CR4: 00000000003506e0
kernel: Call Trace:
kernel:  commit_tail+0x94/0x120 [drm_kms_helper]
kernel:  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
kernel:  drm_client_modeset_commit_atomic+0x1c4/0x200 [drm]
kernel:  drm_client_modeset_commit_locked+0x56/0x150 [drm]
kernel:  drm_fb_helper_pan_display+0xdc/0x210 [drm_kms_helper]
kernel:  fb_pan_display+0x83/0x100
kernel:  bit_update_start+0x1a/0x40
kernel:  fbcon_switch+0x31d/0x4c0
kernel:  redraw_screen+0xd7/0x210
kernel:  ? fbcon_cursor+0x109/0x130
kernel:  complete_change_console+0x3a/0x120
kernel:  console_callback+0x14b/0x150
kernel:  ? __cond_resched+0x16/0x40
kernel:  process_one_work+0x1ec/0x380
kernel:  worker_thread+0x53/0x3e0
kernel:  ? process_one_work+0x380/0x380
kernel:  kthread+0x11b/0x140
kernel:  ? kthread_associate_blkcg+0xa0/0xa0
kernel:  ret_from_fork+0x22/0x30
kernel: ---[ end trace ae9524d7c29cb9eb ]---
...
kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:67:crtc-0] flip_done timed out
...

As you can see, I get a fun variety of different error messages in my logs, but the same result each time - a frozen or black screen which requires either a hard power down, or SSHing into the laptop to try and reboot it.

Comment 8 Aaron Sowry 2021-06-18 01:41:40 UTC

After downgrading linux-firmware and running an older kernel, I've not had a crash now for about a week. The versions I am running are:

kernel-5.11.12-300.fc34.x86_64
linux-firmware-20210315-119.fc34.noarch

Comment 9 Norbert Jurkeit 2021-07-02 10:06:07 UTC

Created attachment 1797116 [details]
journalctl after screen freeze happened

Same issue here with the latest Fedora 34 kernels on a desktop PC with Ryzen 3 3200G using integrated graphics, although not that often, perhaps once a week. For no obvious reason the screen freezes (although mouse pointer still moves) and sometimes the screen goes black.

The last occurence captured in enclosed attachment happened during a banking session in Firefox on Xfce with kernel 5.12.13-300.fc34.x86_64.

No stability issues encountered before with this PC since I got it a year ago and installed Fedora 32.

Comment 10 Brian Lane 2021-07-08 13:47:56 UTC

Same issue here with a Thinkpad E595 running Fedora 33 with 5.12.12-200.fc33.x86_64 

I'll try downgrading the kernel and firmware and see if it improves things.

Comment 11 Norbert Jurkeit 2021-09-14 08:42:22 UTC

(In reply to Norbert Jurkeit from comment #9)

The issue seems to be related to firmware rather than kernel, at least in my case with Picasso hardware. It started around the time when linux-firmware-20210511-120.fc34 became available and has not occurred since upgrade to linux-firmware-20210818-122.fc34 3 weeks ago, where the later reverted some amdgpu files to those of linux-firmware-20210315-119.fc34.

Comment 12 billgrzanich 2021-09-21 02:27:06 UTC

I'm reluctant to say it's fixed, I can say that I have not experienced the problem in several days, perhaps even since the  firmware package update that Norbert describes in comment 11.  Fingers crossed.

Comment 13 Aaron Sowry 2021-09-21 02:47:19 UTC

(In reply to billgrzanich from comment #12)
> I'm reluctant to say it's fixed, I can say that I have not experienced the
> problem in several days, perhaps even since the  firmware package update
> that Norbert describes in comment 11.  Fingers crossed.

I was just about to uncork the champagne as well, but I'm still seeing intermittent freezes with linux-firmware-20210818-122.fc34.noarch. They are less common, but they don't seem to soft recover like before either.

These crashes are all similar (identical?) to the "flip_done timed out" trace shown in comment #7.

Comment 14 Norbert Jurkeit 2021-09-21 09:27:02 UTC

(In reply to Aaron Sowry from comment #13)
> 
> I was just about to uncork the champagne as well, but I'm still seeing
> intermittent freezes with linux-firmware-20210818-122.fc34.noarch. They are
> less common, but they don't seem to soft recover like before either.
> 
> These crashes are all similar (identical?) to the "flip_done timed out"
> trace shown in comment #7.

With the questionable firmware I only got "VM_L2_PROTECTION_FAULT_STATUS:0x00101031" or "VM_L2_PROTECTION_FAULT_STATUS:0x00141051" in the journal, but nothing with "flip_done timed out", although my graphics hardware looks similar to yours according to glxinfo:

    Vendor: AMD (0x1002)
    Device: AMD Radeon(TM) Vega 8 Graphics (RAVEN, DRM 3.41.0, 5.13.16-200.fc34.x86_64, LLVM 12.0.1) (0x15d8)
    Version: 21.1.8
    Accelerated: yes
    Video memory: 2048MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2

The used desktop is XFCE which might also make a difference.

Perhaps it helps to post your comprehensive information from comment #7 to gitlab.freedesktop.org where it can get the attention of upstream maintainers. See e.g. https://gitlab.freedesktop.org/drm/amd/-/issues/1609.

Comment 15 Ben Cotton 2021-11-04 13:43:00 UTC

This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 33 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 16 Ben Cotton 2021-11-04 14:12:30 UTC

This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 33 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 17 Ben Cotton 2021-11-04 15:10:02 UTC

This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 33 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 18 Ben Cotton 2021-11-30 19:09:46 UTC

Fedora 33 changed to end-of-life (EOL) status on 2021-11-30. Fedora 33 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

aaron
acaringi
adscvr
airlied
alciregi
billgrzanich
bskeggs
hdegoede
jarodwilson
jeremy
jglisse
jonathan
josef
kernel-maint
lgoncalv
linville
masami256
mchehab
norbert.jurkeit
ptalbert
steved