Bug 2224121

Summary: Radeon RX6600 GPU hang leading to Xserver crash
Product: [Fedora] Fedora Reporter: cb-rhbugz
Component: mesaAssignee: Adam Jackson <ajax>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 38CC: ajax, bskeggs, igor.raits, j, lyude, mail, ofourdan, rhughes, rstrode, tstellar, walter.pete, xgl-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
full dmesg output
none
Xorg log of crashed session none

Description cb-rhbugz 2023-07-19 21:02:33 UTC
Every 2-3 days or so, my X server freezes for about a minute and then exits. It seems to be related to new output causing scrolling in xterm, although I have also seen it in ghidra occasionally.

My hardware is an old i7-3770k with a recently fitted Radeon RX6600, so I don't know if this is a regression. It did not happen with the Intel iGPU.


Reproducible: Sometimes

Steps to Reproduce:
1. Run XFCE desktop with extensive use of xterm (the real old-fashion X11 xterm)
2. Generally use the desktop for web browsing, youtube, software development for a couple of days, using suspend-to-RAM overnight
3. Every so often, do something which results in the output scrolling in xterm
Actual Results:  
The X server freezes and after a minute or so, crashes back to the greeter/user login prompt.

In the frozen state, there is almost always an xterm in the process of scrolling where the new line is a corrupted black+white pattern instead of new text.

The dmesg has a lot of repeated instances of this:
[458041.735598] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32772, for process Xorg pid 390300 thread Xorg:cs0 pid 390320)
[458041.735611] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800109e12000 from client 0x1b (UTCL2)
[458041.735615] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[458041.735618] amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID: TCP (0x8)
[458041.735621] amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x1
[458041.735623] amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x0
[458041.735625] amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[458041.735627] amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x0
[458041.735629] amdgpu 0000:03:00.0: amdgpu:     RW: 0x0


Expected Results:  
desktop should not crash

$ uname -a
Linux stando.fishzet.co.uk 6.3.11-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Jul  2 13:17:31 UTC 2023 x86_64 GNU/Linux

$ rpm -qa | grep Xorg
xorg-x11-server-Xorg-1.20.14-23.fc38.x86_64

Comment 1 cb-rhbugz 2023-07-19 21:03:30 UTC
Created attachment 1976606 [details]
full dmesg output

Comment 2 Olivier Fourdan 2023-07-20 07:32:08 UTC
Can you please also post the Xorg logs as well for completeness?

Comment 3 cb-rhbugz 2023-07-20 08:36:21 UTC
Created attachment 1976661 [details]
Xorg log of crashed session

Comment 4 Olivier Fourdan 2023-07-20 09:27:35 UTC
To my untrained eyes, the error looks like https://gitlab.freedesktop.org/drm/amd/-/issues/1598

There are a few other similar reports around as well.

Comment 5 Michel Dänzer 2023-07-20 11:04:17 UTC
It's a GPU hang, likely caused by a Mesa issue.

Comment 6 Olivier Fourdan 2023-07-20 11:37:16 UTC
(In reply to Michel Dänzer from comment #5)
> It's a GPU hang, likely caused by a Mesa issue.

Alright, let's move the bug to Mesa then.