Bug 1646796 - Frequent amdgpu GPU crashes during desktop usage and gaming
Summary: Frequent amdgpu GPU crashes during desktop usage and gaming
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 29
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-06 02:58 UTC by Stewart Smith
Modified: 2019-05-06 09:45 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-21 21:06:34 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Stewart Smith 2018-11-06 02:58:07 UTC
Description of problem:
I am experiencing frequent GPU crashes with amdgpu and a Radeon RX580.

The symptom of which is that the display stops updating. I can still SSH to the machine and otherwise use it, but all graphical activity stops.

While this *can* (and indeed does, after 20-100 minutes of use) occur during normal desktop activity (using Firefox to read text on web sites, using Shotwell, or a terminal), it has proven most reproducible with the game "Cities: Skylines" when starting a new game with the "Alpine Village" Scenario. In this case, it crashes within the first few frames of gameplay, or best case scenario, a couple of minutes in.

The end result is a system that is near useless as a desktop computer, as you have to get *really* familiar with the reboot button (a simple "sudo reboot" via ssh does not reboot the machine).

Version-Release number of selected component (if applicable):
Fedora 28: stock fedora kernel and Mesa
Fedora 29: (late beta), stock fedora kernel and Mesa
Fedora 29: 4.19 kernel, stock Mesa
Fedora 29 4.19 kernel, upstream Mesa
Fedora 29: 4.20 snapshot kernel, upstream Mesa
plus the above Fedora 29 combinations including latest (as of 1 week ago) amdgpu firmware blobs from linux-firmware.git

Basically, pick a combination and it crashes.

It will also occur with Ubuntu 18.04 and the open source drivers.

It does *NOT* occur with Ubuntu 18.04 and the proprietary drivers. The AMDGPU-Pro drivers do not experience such a problem, and playing the game for at least an hour is possible.

How reproducible:
100%

Steps to Reproduce:
Method 1:
1. Use random desktop applications for between 10-100 minutes
Method 2:
1. Install  Cities Skylines from Steam, start an new game with the "Alpine Village" scenario, wait 4 frames to 2 minutes.

Actual results:
Graphics freeze, a Sunday full of pain and suffering.

Expected results:
A Sunday of gaming.

Additional info:

I've tried various troubleshooting things from around the internet, none of which have helped.

Here's what I tried to little/no avail:

amdgpu.vm_update_mode=3:

[  213.870923] gmc_v8_0_process_interrupt: 63 callbacks suppressed
[  213.870927] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0c00480c
[  213.870929] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000180
[  213.870930] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04800C
[  213.870932] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 5, pasid 32773) at page 384, read from 'TC4' (0x54433400) (72)
[  213.870936] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0c00440c
[  213.870937] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000019F
[  213.870938] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A00400C
[  213.870939] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 5, pasid 32773) at page 415, read from 'TC1' (0x54433100) (4)
[  213.870943] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0c00080c
[  213.870944] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000001DE
[  213.870944] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04400C
[  213.870946] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 5, pasid 32773) at page 478, read from 'TC5' (0x54433500) (68)
[  226.681747] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=108386, last emitted seq=108389


With the following script (and editing it accordingly) to clock down the GPU:

> #!/bin/bash
> cd /sys/class/drm/card0/device
> echo manual >power_dpm_force_performance_level
> # low
> echo 0 >pp_dpm_mclk 
> echo 0 >pp_dpm_sclk
> # medium
> #echo 1 >pp_dpm_mclk 
> #echo 1 >pp_dpm_sclk
> # high
> #echo 1 >pp_dpm_mclk 
> #echo 6 >pp_dpm_sclk

high perf:

[  321.232691] gmc_v8_0_process_interrupt: 27 callbacks suppressed
[  321.232695] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0e00480c
[  321.232696] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000001C0
[  321.232697] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0604800C
[  321.232699] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 3, pasid 32773) at page 448, read from 'TC4' (0x54433400) (72)
[  321.232703] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0e00440c
[  321.232704] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000001DF
[  321.232705] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0604400C
[  321.232707] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 3, pasid 32773) at page 479, read from 'TC5' (0x54433500) (68)
[  321.232710] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0e00080c
[  321.232711] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000204
[  321.232712] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0600400C
[  321.232713] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 3, pasid 32773) at page 516, read from 'TC1' (0x54433100) (4)
[  341.860631] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=57103, last emitted seq=57105
[  341.860633] [drm] GPU recovery disabled.

Both (vm_update_mode=3, force perf settings):
[  243.116382] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=61886, last emitted seq=61889
[  243.116384] [drm] GPU recovery disabled.


4.19.0 - upstream

[  138.359258] [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[  138.359274] [drm] GPU recovery disabled.
[  148.839717] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=52269, emitted seq=52271
[  148.839722] [drm] GPU recovery disabled.

mesa devel: (che/mesa repos for llvm and mesa)
[  835.915171] [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[  835.915184] [drm] GPU recovery disabled.
[  846.358554] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=47496, last emitted seq=47498
[  846.358557] [drm] GPU recovery disabled.

4.19.0 + vm_update_mode=3 + force perf settings:
lasted a few frames longer:
[  147.717430] gmc_v8_0_process_interrupt: 64 callbacks suppressed
[  147.717434] amdgpu 0000:20:00.0: GPU fault detected: 147 0x00004802 for process Cities.x64 pid 3916 thread Cities.x64:cs0 pid 3918
[  147.717439] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0003F800
[  147.717440] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048002
[  147.717443] amdgpu 0000:20:00.0: VM fault (0x02, vmid 3, pasid 32773) at page 260096, read from 'TC4' (0x54433400) (72)
[  151.253321] [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[  151.253329] [drm] GPU recovery disabled.
[  161.478296] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=94836, emitted seq=94839
[  161.478300] [drm] GPU recovery disabled.

4.20-git + vm_update_mode=3 + force perf settings medium:
[  159.204731] gmc_v8_0_process_interrupt: 62 callbacks suppressed
[  159.204735] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0ed0880c for process Cities.x64 pid 4047 thread Cities.x64:cs0 pid 4049
[  159.204740] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0024BFDA
[  159.204742] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A08800C
[  159.204745] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 5, pasid 32773) at page 2408410, read from 'TC6' (0x54433600) (136)
[  159.204751] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0ed0880c for process Cities.x64 pid 4047 thread Cities.x64:cs0 pid 4049
[  159.204753] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0024BFDC
[  159.204755] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C400C
[  159.204757] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 5, pasid 32773) at page 2408412, read from 'TC3' (0x54433300) (196)

But it keeps going... for about 10 seconds

[  197.166697] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=49964, emitted seq=49967
[  197.166702] [drm] GPU recovery disabled.

4.20-git + vm_update_mode=3 + dc=0 + vm_debug=1 + force perf medium:
[  172.290989] gmc_v8_0_process_interrupt: 26 callbacks suppressed
[  172.290994] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0c684804 for process Cities.x64 pid 4059 thread Cities.x64:cs0 pid 4061
[  172.290999] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010218D
[  172.291001] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048004
[  172.291003] amdgpu 0000:20:00.0: VM fault (0x04, vmid 3, pasid 32773) at page 1057165, read from 'TC4' (0x54433400) (72)
[  173.866580] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0c68c804 for process Cities.x64 pid 4059 thread Cities.x64:cs0 pid 4061
[  173.866584] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010218D
[  173.866586] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x040C8004
[  173.866588] amdgpu 0000:20:00.0: VM fault (0x04, vmid 2, pasid 32773) at page 1057165, read from 'TC2' (0x54433200) (200)
[  173.866594] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0c68c804 for process Cities.x64 pid 4059 thread Cities.x64:cs0 pid 4061
[  173.866595] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010218D
[  173.866596] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x04048004
[  173.866598] amdgpu 0000:20:00.0: VM fault (0x04, vmid 2, pasid 32773) at page 1057165, read from 'TC4' (0x54433400) (72)
[  199.627346] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=37648, emitted seq=37651
[  199.627350] [drm] GPU recovery disabled.

4.20-git + vm_update_mode=3 + gpu_recovery=1 + force perf medium:
[  161.307589] gmc_v8_0_process_interrupt: 64 callbacks suppressed
[  161.307594] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0fd8c40c for process Cities.x64 pid 4098 thread Cities.x64:cs0 pid 4100
[  161.307599] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000001FB
[  161.307601] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x080C400C
[  161.307604] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 4, pasid 32773) at page 507, read from 'TC3' (0x54433300) (196)
[  173.047423] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=69536, emitted seq=69539
[  173.047429] amdgpu 0000:20:00.0: GPU reset begin!
[  183.286696] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:43:crtc-0] hw_done or flip_done timed out

4.20-git + force perf low:
Worked for a period of time, enough to then try:
raise mclk, still okay.
raise sclk and hangs soon after
[  475.140263] amdgpu 0000:20:00.0: GPU fault detected: 147 0x0cd04801 for process Cities.x64 pid 3963 thread Cities.x64:cs0 pid 3965
[  475.140265] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x03F0259A
[  475.140266] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048001
[  475.140269] amdgpu 0000:20:00.0: VM fault (0x01, vmid 3, pasid 32773) at page 66069914, read from 'TC4' (0x54433400) (72)
[  485.593133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=44837, emitted seq=44839
[  485.593136] [drm] GPU recovery disabled.

4.20-git + force perf low, default boot options:
[  167.551963] [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[  167.551969] [drm] GPU recovery disabled.
[  178.033662] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=19997, emitted seq=19999
[  178.033664] [drm] GPU recovery disabled.

4.20-git + vm_update_mode=3 + force perf low:
[  183.873105] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=47272, emitted seq=47273
[  183.873109] [drm] GPU recovery disabled.

4.20-git + iommu=off + export LIBGL_NO_DRAWARRAYS=true MESA_NO_ERROR=true MESA_GLSL_CACHE_DISABLE=true MESA_NO_MINMAX_CACHE=true RADEON_NO_TCL=true DRAW_NO_FSE=true DRAW_USE_LLVM=0
[  307.824148] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0c023d14 for process Cities.x64 pid 5779 thread Cities.x64:cs0 pid 5782
[  307.824153] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00226D80
[  307.824155] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0503D014
[  307.824158] amdgpu 0000:20:00.0: VM fault (0x14, vmid 2, pasid 32773) at page 2256256, write from 'SDM1' (0x53444d31) (61)
[  446.714847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=85368, emitted seq=85370
[  446.714852] [drm] GPU recovery disabled.

4.20 + mesa_glthread=false
while loading,not even into gameplay:
[  192.535850] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=38795, emitted seq=38798
[  192.535852] [drm] GPU recovery disabled.

4.20 + si_support=1 cik_support=1 vm_update_mode=3 + no vblank sync
[  196.611395] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=124856, emitted seq=124859
[  196.611398] [drm] GPU recovery disabled.

With some desperation, I installed the whole thing under WINE to see if we'd perhaps hit different code paths.

under WINE:
[ 1147.136434] [drm:generic_reg_wait [amdgpu]] *ERROR* REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:922
[ 1147.136500] WARNING: CPU: 5 PID: 2391 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:254 generic_reg_wait+0xe7/0x160 [amdgpu]
[ 1147.136500] Modules linked in: binfmt_misc fuse xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast devlink xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat snd_hda_codec_realtek edac_mce_amd snd_usb_audio snd_hda_codec_generic ppdev wmi_bmof kvm_amd snd_hda_codec_hdmi snd_usbmidi_lib kvm snd_rawmidi snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq irqbypass snd_seq_device joydev snd_pcm pcspkr snd_timer snd k10temp sp5100_tco soundcore parport_pc i2c_piix4 parport wmi gpio_amdpt gpio_generic xfs libcrc32c dm_crypt amdkfd amd_iommu_v2 amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper ttm crct10dif_pclmul crc32_pclmul crc32c_intel drm
[ 1147.136523]  ghash_clmulni_intel r8169 ccp nvme nvme_core pinctrl_amd
[ 1147.136527] CPU: 5 PID: 2391 Comm: gnome-shell Not tainted 4.20.0-0.rc0.git4.1.vanilla.knurd.1.fc29.x86_64 #1
[ 1147.136528] Hardware name: Micro-Star International Co., Ltd MS-7A34/B350 PC MATE (MS-7A34), BIOS A.E0 05/02/2018
[ 1147.136566] RIP: 0010:generic_reg_wait+0xe7/0x160 [amdgpu]
[ 1147.136567] Code: 44 24 58 8b 54 24 48 89 de 44 89 4c 24 08 48 8b 4c 24 50 48 c7 c7 68 8e 66 c0 e8 74 23 de ff 83 7d 18 01 44 8b 4c 24 08 74 02 <0f> 0b 48 83 c4 10 44 89 c8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 0f
[ 1147.136568] RSP: 0018:ffffa7940b0bb8a8 EFLAGS: 00010297
[ 1147.136569] RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000000
[ 1147.136570] RDX: 0000000000000000 RSI: ffff9b864eb56828 RDI: ffff9b864eb56828
[ 1147.136570] RBP: ffff9b863f8f0900 R08: 0000000000000084 R09: 0000000000010200
[ 1147.136571] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000bb9
[ 1147.136572] R13: 0000000000004fa4 R14: 0000000000010000 R15: 0000000000000000
[ 1147.136573] FS:  00007f4ab79f7d00(0000) GS:ffff9b864eb40000(0000) knlGS:0000000000000000
[ 1147.136573] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1147.136574] CR2: 00007f49ff2be000 CR3: 00000003c76b0000 CR4: 00000000003406e0
[ 1147.136575] Call Trace:
[ 1147.136620]  dce110_stream_encoder_dp_blank+0x12c/0x1a0 [amdgpu]
[ 1147.136662]  core_link_disable_stream+0x54/0x220 [amdgpu]
[ 1147.136704]  dce110_reset_hw_ctx_wrap+0xc1/0x1e0 [amdgpu]
[ 1147.136746]  dce110_apply_ctx_to_hw+0x45/0x650 [amdgpu]
[ 1147.136791]  ? dm_pp_apply_display_requirements+0x191/0x1a0 [amdgpu]
[ 1147.136832]  ? dce110_set_bandwidth+0x20b/0x230 [amdgpu]
[ 1147.136872]  dc_commit_state+0x2dc/0x550 [amdgpu]
[ 1147.136917]  amdgpu_dm_atomic_commit_tail+0x388/0xdb0 [amdgpu]
[ 1147.136921]  ? __wake_up_common_lock+0x89/0xc0
[ 1147.136923]  ? _cond_resched+0x15/0x30
[ 1147.136925]  ? wait_for_completion_timeout+0x3a/0x190
[ 1147.136926]  ? wait_for_completion_interruptible+0x35/0x1d0
[ 1147.136932]  commit_tail+0x3d/0x70 [drm_kms_helper]
[ 1147.136937]  drm_atomic_helper_commit+0x103/0x110 [drm_kms_helper]
[ 1147.136950]  drm_atomic_connector_commit_dpms+0xdb/0x100 [drm]
[ 1147.136961]  drm_mode_obj_set_property_ioctl+0x177/0x2a0 [drm]
[ 1147.136971]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[ 1147.136980]  drm_ioctl_kernel+0xa1/0xf0 [drm]
[ 1147.136990]  drm_ioctl+0x1fc/0x390 [drm]
[ 1147.137000]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[ 1147.137033]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 1147.137036]  do_vfs_ioctl+0xa4/0x620
[ 1147.137038]  ksys_ioctl+0x60/0x90
[ 1147.137039]  __x64_sys_ioctl+0x16/0x20
[ 1147.137041]  do_syscall_64+0x5b/0x160
[ 1147.137044]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1147.137045] RIP: 0033:0x7f4abb52af7b
[ 1147.137046] Code: 0f 1e fa 48 8b 05 0d bf 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd be 0c 00 f7 d8 64 89 01 48
[ 1147.137047] RSP: 002b:00007ffdf257c2c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1147.137048] RAX: ffffffffffffffda RBX: 0000555fca1c02c0 RCX: 00007f4abb52af7b
[ 1147.137049] RDX: 00007ffdf257c300 RSI: 00000000c01864ba RDI: 000000000000000b
[ 1147.137049] RBP: 00007ffdf257c300 R08: 0000000000000003 R09: 0000555fca1cbee0
[ 1147.137050] R10: 0000555fca1bc8a8 R11: 0000000000000246 R12: 00000000c01864ba
[ 1147.137050] R13: 000000000000000b R14: 00007ffdf257c500 R15: 00007f4abc35dda0
[ 1147.137052] ---[ end trace f894edad9e5f3f95 ]---
[ 1270.697892] gmc_v8_0_process_interrupt: 64 callbacks suppressed
[ 1270.697898] amdgpu 0000:20:00.0: GPU fault detected: 146 0x08e8c80c for process Cities.exe pid 6964 thread Cities.exe:cs0 pid 6974
[ 1270.697900] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0000031D
[ 1270.697902] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C800C
[ 1270.697904] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 1, pasid 32769) at page 797, read from 'TC2' (0x54433200) (200)
[ 1295.099660] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=27890, emitted seq=27892
[ 1295.099663] [drm] GPU recovery disabled.


Under Xorg (after updating PC BIOS, just in case there was anything interesting not mentioned in the release notes):
[   48.727814] [drm:generic_reg_wait [amdgpu]] *ERROR* REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:922
[   48.727950] WARNING: CPU: 6 PID: 979 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:254 generic_reg_wait+0xe7/0x160 [amdgpu]
[   48.727951] Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter devlink ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc vfat fat ppdev wmi_bmof edac_mce_amd snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_codec_hdmi snd_usb_audio snd_hda_intel snd_hda_codec snd_usbmidi_lib irqbypass snd_rawmidi snd_hda_core snd_hwdep snd_seq joydev snd_seq_device snd_pcm snd_timer pcspkr k10temp sp5100_tco snd i2c_piix4 soundcore parport_pc parport gpio_amdpt wmi pcc_cpufreq gpio_generic acpi_cpufreq binfmt_misc xfs libcrc32c dm_crypt amdkfd amd_iommu_v2 amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper ttm crct10dif_pclmul crc32_pclmul drm
[   48.727994]  crc32c_intel ghash_clmulni_intel r8169 ccp uas nvme nvme_core usb_storage pinctrl_amd
[   48.728004] CPU: 6 PID: 979 Comm: kworker/6:2 Not tainted 4.20.0-0.rc0.git4.1.vanilla.knurd.1.fc29.x86_64 #1
[   48.728005] Hardware name: Micro-Star International Co., Ltd MS-7A34/B350 PC MATE (MS-7A34), BIOS A.G0 09/27/2018
[   48.728031] Workqueue: events drm_mode_rmfb_work_fn [drm]
[   48.728119] RIP: 0010:generic_reg_wait+0xe7/0x160 [amdgpu]
[   48.728122] Code: 44 24 58 8b 54 24 48 89 de 44 89 4c 24 08 48 8b 4c 24 50 48 c7 c7 68 de 61 c0 e8 74 33 e1 ff 83 7d 18 01 44 8b 4c 24 08 74 02 <0f> 0b 48 83 c4 10 44 89 c8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 0f
[   48.728124] RSP: 0018:ffffa4c042247a20 EFLAGS: 00010297
[   48.728126] RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000000
[   48.728127] RDX: 0000000000000000 RSI: ffff91800eb96828 RDI: ffff91800eb96828
[   48.728129] RBP: ffff9180010e2280 R08: 0000000000000084 R09: 0000000000010200
[   48.728130] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000bb9
[   48.728132] R13: 0000000000004fa4 R14: 0000000000010000 R15: 0000000000000000
[   48.728134] FS:  0000000000000000(0000) GS:ffff91800eb80000(0000) knlGS:0000000000000000
[   48.728136] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   48.728137] CR2: 0000565400f9bdbc CR3: 0000000407438000 CR4: 00000000003406e0
[   48.728138] Call Trace:
[   48.728243]  dce110_stream_encoder_dp_blank+0x12c/0x1a0 [amdgpu]
[   48.728334]  core_link_disable_stream+0x54/0x220 [amdgpu]
[   48.728425]  dce110_reset_hw_ctx_wrap+0xc1/0x1e0 [amdgpu]
[   48.728515]  dce110_apply_ctx_to_hw+0x45/0x650 [amdgpu]
[   48.728612]  ? dm_pp_apply_display_requirements+0x191/0x1a0 [amdgpu]
[   48.728701]  ? dce110_set_bandwidth+0x20b/0x230 [amdgpu]
[   48.728790]  dc_commit_state+0x2dc/0x550 [amdgpu]
[   48.728888]  amdgpu_dm_atomic_commit_tail+0x388/0xdb0 [amdgpu]
[   48.728894]  ? _cond_resched+0x15/0x30
[   48.728897]  ? wait_for_completion_timeout+0x3a/0x190
[   48.728900]  ? wait_for_completion_interruptible+0x35/0x1d0
[   48.728913]  commit_tail+0x3d/0x70 [drm_kms_helper]
[   48.728924]  drm_atomic_helper_commit+0x103/0x110 [drm_kms_helper]
[   48.728948]  drm_framebuffer_remove+0x357/0x3d0 [drm]
[   48.728972]  drm_mode_rmfb_work_fn+0x4f/0x60 [drm]
[   48.728978]  process_one_work+0x1a1/0x3a0
[   48.728982]  worker_thread+0x1c9/0x380
[   48.728986]  ? drain_workqueue+0x130/0x130
[   48.728988]  kthread+0x112/0x130
[   48.728992]  ? kthread_create_worker_on_cpu+0x70/0x70
[   48.728995]  ret_from_fork+0x22/0x40
[   48.728999] ---[ end trace 8d06bb780cd34c88 ]---
[  557.546258] gmc_v8_0_process_interrupt: 66 callbacks suppressed
[  557.546264] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0e90880c for process Cities.x64 pid 6279 thread Cities.x64:cs0 pid 6298
[  557.546266] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003D2
[  557.546267] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A08800C
[  557.546270] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 5, pasid 32773) at page 978, read from 'TC6' (0x54433600) (136)
[  560.550236] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0fd8080c for process Cities.x64 pid 6279 thread Cities.x64:cs0 pid 6298
[  560.550240] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003FB
[  560.550242] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E00800C
[  560.550246] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 7, pasid 32773) at page 1019, read from 'TC0' (0x54433000) (8)
[  572.234607] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=75649, emitted seq=75651
[  572.234610] [drm] GPU recovery disabled.

Comment 1 Stewart Smith 2018-11-06 04:01:41 UTC
With Mesa 19.0.0-0.57.git5d517a5.fc29 from https://copr.fedorainfracloud.org/coprs/che/mesa/ on kernel 4.18.16-300.fc29 today I managed to run the Cities Skylines test above for a decent amount of time (~1hr), although mild desktop usage resulted in:

[ 3389.767997] [drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[ 3389.768008] gmc_v8_0_process_interrupt: 66 callbacks suppressed
[ 3389.768013] amdgpu 0000:20:00.0: GPU fault detected: 146 0x0000480c
[ 3389.768015] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 3389.768018] amdgpu 0000:20:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204800C
[ 3389.768022] amdgpu 0000:20:00.0: VM fault (0x0c, vmid 1, pasid 32770) at page 0, read from 'TC4' (0x54433400) (72)
[ 3389.768038] [drm] GPU recovery disabled.
[ 3400.178733] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=425508, last emitted seq=425510
[ 3400.178737] [drm] GPU recovery disabled.

Comment 2 Vasilis Keramidas 2018-11-11 16:01:30 UTC
Same problem here

Comment 3 Jeremy Cline 2018-12-03 17:35:16 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.
 
Fedora 29 has now been rebased to 4.19.5-300.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you experience different issues, please open a new bug report for those.

Comment 4 Vasilis Keramidas 2018-12-04 17:18:35 UTC
Problem still persist with the latest kernel.

uname -a
Linux keramidopc 4.19.5-300.fc29.x86_64 #1 SMP Tue Nov 27 19:29:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 5 Justin M. Forbes 2019-01-29 16:14:11 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 4.20.5-200.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 6 Justin M. Forbes 2019-02-21 21:06:34 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Comment 7 xom 2019-05-06 09:45:28 UTC
Problem is still here on 5.1.0-0.rc5

Apr 24 18:25:14 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0000480c for process gnome-shell pid 4536 thread gnome-shel:cs0 pi>
Apr 24 18:25:14 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Apr 24 18:25:14 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C04800C
Apr 24 18:25:14 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 6, pasid 32778) at page 0, read from 'TC4' (0x54433400) (72)
Apr 24 18:25:20 abyss /usr/libexec/gdm-x-session[4290]: (II) event5  - Kingsis Peripherals ZOWIE Gaming mouse: SYN_DROPPED event - some input e>
Apr 24 18:25:20 abyss kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
Apr 24 18:25:20 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0c023d10 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:20 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101780
Apr 24 18:25:20 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0F03D010
Apr 24 18:25:20 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x10, vmid 7, pasid 32777) at page 1054592, write from 'SDM1' (0x53444d31) (61)
Apr 24 18:25:24 abyss kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Apr 24 18:25:25 abyss /usr/libexec/gdm-x-session[4290]: (II) event5  - Kingsis Peripherals ZOWIE Gaming mouse: SYN_DROPPED event - some input e>
Apr 24 18:25:29 abyss firefox.desktop[4536]: Fontconfig warning: Directory/file mtime in the future. New fonts may not be detected.
Apr 24 18:25:30 abyss kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
Apr 24 18:25:35 abyss kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out.
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d080408 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A1
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A004008
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x08, vmid 5, pasid 32777) at page 929, read from 'TC1' (0x54433100) (4)
Apr 24 18:25:35 abyss kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d088808 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A1
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088008
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x08, vmid 5, pasid 32777) at page 929, read from 'TC6' (0x54433600) (136)
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d088808 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A1
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088008
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x08, vmid 5, pasid 32777) at page 929, read from 'TC6' (0x54433600) (136)
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d088408 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A1
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A084008
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x08, vmid 5, pasid 32777) at page 929, read from 'TC7' (0x54433700) (132)
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d08c808 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A1
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C8008
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x08, vmid 5, pasid 32777) at page 929, read from 'TC2' (0x54433200) (200)
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d08c808 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A1
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C8008
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x08, vmid 5, pasid 32777) at page 929, read from 'TC2' (0x54433200) (200)
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d08c408 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A1
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C4008
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x08, vmid 5, pasid 32777) at page 929, read from 'TC3' (0x54433300) (196)
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d08c408 for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A1
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C4008
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x08, vmid 5, pasid 32777) at page 929, read from 'TC3' (0x54433300) (196)
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d10480c for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A2
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04800C
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32777) at page 930, read from 'TC4' (0x54433400) (72)
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d10480c for process Xorg pid 4292 thread Xorg:cs0 pid 4293
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000003A2
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04800C
Apr 24 18:25:35 abyss kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5, pasid 32777) at page 930, read from 'TC4' (0x54433400) (72)
-- Reboot --
-- Reboot --


Note You need to log in before you can comment on or make changes to this bug.