Bug 2224121 - Radeon RX6600 GPU hang leading to Xserver crash
Summary: Radeon RX6600 GPU hang leading to Xserver crash
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: mesa
Version: 38
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Adam Jackson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-19 21:02 UTC by cb-rhbugz
Modified: 2023-07-20 11:37 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)
full dmesg output (302.91 KB, text/plain)
2023-07-19 21:03 UTC, cb-rhbugz
no flags Details
Xorg log of crashed session (102.06 KB, text/plain)
2023-07-20 08:36 UTC, cb-rhbugz
no flags Details

Description cb-rhbugz 2023-07-19 21:02:33 UTC
Every 2-3 days or so, my X server freezes for about a minute and then exits. It seems to be related to new output causing scrolling in xterm, although I have also seen it in ghidra occasionally.

My hardware is an old i7-3770k with a recently fitted Radeon RX6600, so I don't know if this is a regression. It did not happen with the Intel iGPU.


Reproducible: Sometimes

Steps to Reproduce:
1. Run XFCE desktop with extensive use of xterm (the real old-fashion X11 xterm)
2. Generally use the desktop for web browsing, youtube, software development for a couple of days, using suspend-to-RAM overnight
3. Every so often, do something which results in the output scrolling in xterm
Actual Results:  
The X server freezes and after a minute or so, crashes back to the greeter/user login prompt.

In the frozen state, there is almost always an xterm in the process of scrolling where the new line is a corrupted black+white pattern instead of new text.

The dmesg has a lot of repeated instances of this:
[458041.735598] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32772, for process Xorg pid 390300 thread Xorg:cs0 pid 390320)
[458041.735611] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800109e12000 from client 0x1b (UTCL2)
[458041.735615] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[458041.735618] amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID: TCP (0x8)
[458041.735621] amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x1
[458041.735623] amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x0
[458041.735625] amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[458041.735627] amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x0
[458041.735629] amdgpu 0000:03:00.0: amdgpu:     RW: 0x0


Expected Results:  
desktop should not crash

$ uname -a
Linux stando.fishzet.co.uk 6.3.11-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Jul  2 13:17:31 UTC 2023 x86_64 GNU/Linux

$ rpm -qa | grep Xorg
xorg-x11-server-Xorg-1.20.14-23.fc38.x86_64

Comment 1 cb-rhbugz 2023-07-19 21:03:30 UTC
Created attachment 1976606 [details]
full dmesg output

Comment 2 Olivier Fourdan 2023-07-20 07:32:08 UTC
Can you please also post the Xorg logs as well for completeness?

Comment 3 cb-rhbugz 2023-07-20 08:36:21 UTC
Created attachment 1976661 [details]
Xorg log of crashed session

Comment 4 Olivier Fourdan 2023-07-20 09:27:35 UTC
To my untrained eyes, the error looks like https://gitlab.freedesktop.org/drm/amd/-/issues/1598

There are a few other similar reports around as well.

Comment 5 Michel Dänzer 2023-07-20 11:04:17 UTC
It's a GPU hang, likely caused by a Mesa issue.

Comment 6 Olivier Fourdan 2023-07-20 11:37:16 UTC
(In reply to Michel Dänzer from comment #5)
> It's a GPU hang, likely caused by a Mesa issue.

Alright, let's move the bug to Mesa then.


Note You need to log in before you can comment on or make changes to this bug.