Bug 1780800
Summary: | [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Chris Murphy <bugzilla> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 31 | CC: | airlied, amessina, asogukpi, bhoefer, bskeggs, cbredesen, chemobejk, diego.ce, dimhen, dkaylor, fahmed, fcami, fweimer, goodmirek, hdegoede, hkario, ichavero, itamar, ivica.perovic, iweiss, jarodwilson, jcubic, jeremy, jforbes, jglisse, jlmagee, john.j5live, jonathan, josef, kernel-maint, linville, lists, lmiccini, lslebodn, mailinglists35, marcel.raad, marko.bevc, masami256, massi.ergosum, mchehab, mihai, mjg59, mparkins, myeservices+fedoraproject.org, oholy, pachoramos1, pep, pweil, rafsoon, redhat, redhat, sanjay.ankur, sb, sbroz, seldridg, steved, szidek, vitaly, votava, vrutkovs, youling257 | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-02-24 08:46:04 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Chris Murphy
2019-12-07 00:50:29 UTC
00:02.0 VGA compatible controller [0300]: Intel Corporation Skylake GT2 [HD Graphics 520] [8086:1916] (rev 07) (prog-if 00 [VGA controller]) Subsystem: Hewlett-Packard Company Device [103c:81a0] model name : Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz Excerpts for search. Dec 06 17:39:54 flap.local kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0 Dec 06 17:39:54 flap.local kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Dec 06 17:39:54 flap.local kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} Dec 06 17:39:57 flap.local kernel: Asynchronous wait on fence i915:gnome-shell[1470]:6952 timed out (hint:intel_atomic_commit_ready+0x0/0x50 [i915]) Remove dup bug ID, replace with actual. Comment containing commit reference that fixes this: https://gitlab.freedesktop.org/drm/intel/issues/673#note_359912 Still happens with 5.4.5-300.fc31.x86_64, but not 5.5.0rc1 or rc2. The patch was submitted to stable and rejected because it doesn't apply to 5.4. I will give it a little time to see if it is properly backported before doing a 5.4.6 build. I have similar problem with kernel 5.5 rc3. [ 541.644847] Asynchronous wait on fence i915:surfaceflinger[1495]:119c8 timed out (hint:intel_atomic_commit_ready+0x0/0x50 [i915]) [ 546.268573] i915 0000:00:02.0: GPU HANG: ecode 8:1:0x84dfbffe, in surfaceflinger [1495], stopped heartbeat on rcs0 [ 546.268622] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 546.268689] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 546.268755] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 546.268821] The GPU crash dump is required to analyze GPU hangs, so please always attach it. [ 546.268887] GPU crash dump saved to /sys/class/drm/card0/error [ 546.372596] i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0 Hi, I'm experiencing the same issue on Fedora 31 with kernel 5.4.7-200. Computer: Lenovo ThinkPad T580 GPU: Intel UHD 620 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07) (prog-if 00 [VGA controller]) Subsystem: Lenovo Device 225a Flags: bus master, fast devsel, latency 0, IRQ 153 Memory at eb000000 (64-bit, non-prefetchable) [size=16M] Memory at a0000000 (64-bit, prefetchable) [size=256M] I/O ports at e000 [size=64] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: <access denied> Kernel driver in use: i915 Kernel modules: i915 CPU: Intel Core i7-8650U BIOS version: N27ET36W (1.22) Reverting to 5.3.16-300 for the time being since it doesn't have this issue. Upstream issue reports that backporting the fix from 5.5 to 5.4 is non-trivial. And now there are a few attempts at reverting the change that introduced the problem, so even the revert is apparently not straightforward. Skylake and Kabylake CPUs are affected, but I'm not sure if it's all or a subset of those. kernel: 5.4.8-200.fc31.x86_64 CPU: i5-8400 GPU: UHD Intel 630 i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0 GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel drm/i915 developers can then reassign to the right component if it's not a kernel issue. jThe GPU crash dump is required to analyze GPU hangs, so please always attach it. GPU crash dump saved to /sys/class/drm/card0/error i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} i915 0000:00:02.0: Resetting chip for hang on rcs0 [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} I experience the same issue. It never happened with 5.3 kernel. My kernel: ``` Jan 08 10:04:23 localhost.localdomain kernel: microcode: microcode updated early to revision 0xca, date = 2019-09-26 Jan 08 10:04:23 localhost.localdomain kernel: Linux version 5.4.7-200.fc31.x86_64 (mockbuild.fedoraproject.org) (gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC)) #1 SMP Tue Dec 31 22:25:12 UTC 2019 Jan 08 10:04:23 localhost.localdomain kernel: Command line: BOOT_IMAGE=(hd0,gpt5)/vmlinuz-5.4.7-200.fc31.x86_64 root=/dev/mapper/luks-b6994190-43c4-42f1-bc49-ab5cd4717038 ro rd.luks.uuid=luks-b6994190-43c4-42f1-bc49-ab5cd4717038 rd.lvm.lv=outer/fedora scsi_mod.use_blk_mq=1 noibrsnoibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off mitigations=off ``` My hardware is i5-7200U. I have similar issue on Fedora 30 kernel: 5.4.8-100.fc30.x86_64. Hardware: Laptop Dell Inspiron 15 5570 i7-8550U But in my case I was able to switch to tty after few tries. It sometimes freezing on ScreenSaver and sometimes don't (it's random). I've just upgraded to Fedora 30 from 29 few days ago, was not having issues with Fedora 29. I experienced a complete hang without being able to do anything a few times over the past weeks with several kernel 5.4.X versions. Today, after update to 5.4.10, I experienced a hang which was released after a few seconds. Logs: Jan 14 16:56:00 x1 kernel: i915 0000:00:02.0: Resetting rcs0 for stuck wait on rcs0 Jan 14 16:56:03 x1 kernel: Asynchronous wait on fence i915:gnome-shell[2073]:2672e timed out (hint:intel_atomic_commit_ready+0x0/0x50 [i915]) Jan 14 16:56:08 x1 kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0x85dfbfff, in code [3378], hang on rcs0 Jan 14 16:56:08 x1 kernel: GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Jan 14 16:56:08 x1 kernel: Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Jan 14 16:56:08 x1 kernel: drm/i915 developers can then reassign to the right component if it's not a kernel issue. Jan 14 16:56:08 x1 kernel: The GPU crash dump is required to analyze GPU hangs, so please always attach it. Jan 14 16:56:08 x1 kernel: GPU crash dump saved to /sys/class/drm/card0/error Jan 14 16:56:08 x1 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Hardware: Lenovo ThinkPad X1 Carbon 5th Gen $ lspci -vs 00:02 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 620 (rev 02) (prog-if 00 [VGA controller]) Subsystem: Lenovo ThinkPad X1 Carbon 5th Gen Flags: bus master, fast devsel, latency 0, IRQ 130 Memory at eb000000 (64-bit, non-prefetchable) [size=16M] Memory at 60000000 (64-bit, prefetchable) [size=256M] I/O ports at e000 [size=64] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: <access denied> Kernel driver in use: i915 Kernel modules: i915 $ lscpu | grep 'Model name' Model name: Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz AFAICS, the backport patch "drm/i915/gt: Detect if we miss WaIdleLiteRestore" has been added to F31 2 days ago: https://src.fedoraproject.org/rpms/kernel/c/9607b5faaa81022ed8b97f517c766202f9680744?branch=f31 It should be part of kernel-5.4.11-202.fc31: https://bodhi.fedoraproject.org/updates/FEDORA-2020-3738c94456 And the new kernel-5.4.12-200.fc31: https://bodhi.fedoraproject.org/updates/FEDORA-2020-e328697628 In my opinion this patch should be reverted in Fedora kernels. It makes the problem unquestionably worse: it takes longer to experience the problem, but once it happens, it's a hard crash. I can't ssh in. I can't switch to a VT. System gets hot, fans go to max, and I have to force power off. I've always install new kernels, not sure which one it was (I think it was 5.4.10-100.fc30.x86_64, I've installed 5.4.11-102.fc30.x86_64 but didn't rebooted the system to take effect), but also got hard crash. I was not able to switch to TTY like previously, with few key hits. I experienced hard crashes exactly like you described (including overheating) also with earlier versions, I think in both 5.4.7 and 5.4.8. Had that issue on Manjaro with KDE. Seems like problem doesn't happen with the LTS Kernel Version 4.19.96-1-MANJARO Had that issue on Manjaro with KDE. Seems like problem doesn't happen with the LTS Kernel Version 4.19.96-1-MANJARO I face similar issues with 5.4.12-200.fc31.x86_64 but NOT with 5.4.10-200.fc31.x86_64 Using xorg-x11-drv-intel-2.99.917-43.20180618.fc31.x86_64 The issue seems to reproduce more easily with google-chrome or KDE kontact (QTWebEngine) 00:02.0 VGA compatible controller: Intel Corporation Iris Plus Graphics 650 (rev 06) (prog-if 00 [VGA controller]) DeviceName: CPU Subsystem: Intel Corporation Device 2068 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 128 Region 0: Memory at db000000 (64-bit, non-prefetchable) [size=16M] Region 2: Memory at 90000000 (64-bit, prefetchable) [size=256M] Region 4: I/O ports at f000 [size=64] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee00018 Data: 0000 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Process Address Space ID (PASID) PASIDCap: Exec- Priv-, Max PASID Width: 14 PASIDCtl: Enable- Exec- Priv- Capabilities: [200 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable-, Smallest Translation Unit: 00 Capabilities: [300 v1] Page Request Interface (PRI) PRICtl: Enable- Reset- PRISta: RF- UPRGI- Stopped+ Page Request Capacity: 00008000, Page Request Allocation: 00000000 Kernel driver in use: i915 Kernel modules: i915 [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. perf: interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 perf: interrupt took too long (3155 > 3140), lowering kernel.perf_event_max_sample_rate to 63000 perf: interrupt took too long (3946 > 3943), lowering kernel.perf_event_max_sample_rate to 50000 i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0 GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel drm/i915 developers can then reassign to the right component if it's not a kernel issue. The GPU crash dump is required to analyze GPU hangs, so please always attach it. GPU crash dump saved to /sys/class/drm/card0/error i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} i915 0000:00:02.0: Resetting chip for hang on rcs0 [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 show_signal_msg: 58 callbacks suppressed GpuWatchdog[17683]: segfault at 0 ip 00005591410d6ded sp 00007f61f5d9b500 error 6 in chrome[55913d19b000+7171000] Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 a9 5a 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 6d a4 03 01 80 7d 8f 00 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. i915 0000:00:02.0: Resetting chip for hang on rcs0 i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. i915 0000:00:02.0: Resetting chip for hang on rcs0 [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. GpuWatchdog[18003]: segfault at 0 ip 0000557e368added sp 00007fc750c74500 error 6 in chrome[557e32972000+7171000] Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 a9 5a 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 6d a4 03 01 80 7d 8f 00 [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. i915 0000:00:02.0: Resetting chip for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. i915 0000:00:02.0: Resetting chip for hang on rcs0 i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. i915 0000:00:02.0: Resetting chip for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Asynchronous wait on fence i915:kwin_x11[2220]:8ef3a timed out (hint:intel_atomic_commit_ready+0x0/0x50 [i915]) i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. i915 0000:00:02.0: Resetting chip for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Asynchronous wait on fence i915:Xorg[1360]:cfa86 timed out (hint:intel_atomic_commit_ready+0x0/0x50 [i915]) i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. Lockdown: Xorg: raw io port access is restricted; see man kernel_lockdown.7 broken atomic modeset userspace detected, disabling atomic i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. i915 0000:00:02.0: Resetting chip for hang on rcs0 This issue affects me for at least all versions of 5.4.10 and upwards (just happened with 5.4.12 again). I've reverted to a 5.3.16 kernel which is stable from this point of view. In my case, it always happens when I have an external monitor connected over USB-C through my Dell TB dock. It may take a few minutes, it may take an hour, but eventually the whole desktop will hang. It happens on both X11 and Wayland. I have not seen this happen when running without external monitors, but perhaps I've not used it untethered for long enough. Perhaps unrelated: since the 5.4 kernels, I often have problems suspending (typical case: time to go home, disconnect dock, press power button to put laptop to sleep, fail). With 5.4.10-100.fc30.x86_64 I got different type of error. The screen was flickering (like really slow refresh rate) I was able to move the cursor it it wa changing when I hover over input field but it was no responsive in other way (I was not able to move the windows) I was able to switch to TTY and after restarting display-server service, my system was continue running. Here is end of dmesg: sty 21 16:00:36 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:00:38 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:00:40 kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. sty 21 16:00:40 kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0 sty 21 16:00:41 kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000). sty 21 16:00:41 kernel: amdgpu: [powerplay] can't get the mac of 5 sty 21 16:00:47 kernel: amdgpu: [powerplay] VI should always have 2 performance levels sty 21 16:00:47 kernel: amdgpu 0000:01:00.0: GPU pci config reset sty 21 16:00:48 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:00:50 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:00:52 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:00:56 kernel: i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0 sty 21 16:00:58 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:00 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:02 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:04 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:06 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:08 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:10 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:12 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:14 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:16 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:18 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:20 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:22 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:24 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:26 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:28 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:30 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:32 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:34 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:36 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:38 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:40 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:42 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:01:44 kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. sty 21 16:01:44 kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0 sty 21 16:01:45 kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000). sty 21 16:01:45 kernel: amdgpu: [powerplay] can't get the mac of 5 sty 21 16:01:46 kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. sty 21 16:01:46 kernel: i915 0000:00:02.0: Resetting chip for no progress on rcs0 sty 21 16:01:52 kernel: amdgpu: [powerplay] VI should always have 2 performance levels sty 21 16:01:53 kernel: amdgpu 0000:01:00.0: GPU pci config reset sty 21 16:01:54 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:02:02 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 sty 21 16:18:20 kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000). sty 21 16:18:20 kernel: amdgpu: [powerplay] can't get the mac of 5 sty 21 16:18:32 kernel: amdgpu: [powerplay] VI should always have 2 performance levels sty 21 16:18:32 kernel: amdgpu 0000:01:00.0: GPU pci config reset few lines before that, I've got this line: sty 21 15:55:05 kernel: GpuWatchdog[2933]: segfault at 0 ip 000055878691877d sp 00007fdd3139e480 error 6 in chrome[5587829dd000+7170000] sty 21 15:55:05 kernel: Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 19 61 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 f1 6b a4 03 01 80 7d 8f 00 I'm not sure if this is related but from some time I got something with temerature of CPU, not sure if there is something with fan sty 16 13:15:09 kernel: mce: CPU4: Core temperature above threshold, cpu clock throttled (total events = 17) sty 16 13:15:09 kernel: mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 17) sty 16 13:15:09 kernel: mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 20) sty 16 13:15:09 kernel: mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 20) sty 16 13:15:09 kernel: mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 20) sty 16 13:15:09 kernel: mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 20) sty 16 13:15:09 kernel: mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 20) sty 16 13:15:09 kernel: mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 20) sty 16 13:15:09 kernel: mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 20) sty 16 13:15:09 kernel: mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 20) I've seen the problem with and without an external display connected. The consistent trigger in my case, is a particular chat app (yakyak). It uses npm and electron. I'd characterize it as "will inevitably result in a GPU hang", whereas I can't right now think of any other activity that does trigger it. https://github.com/yakyak/yakyak I've seen the "hang on rcs0" messages since 5.4.8-200.fc31.x86_64 and they too appear to be triggered by a chat application ("Rambox") using the Electron framework. Now I have 5.4.12-200.fc31.x86_64 and experienced a complete hang of the machine, no log messages and no reaction to sysrq. Netconsole won't work because this laptop is connected via WiFi. The Freedesktop issue #673 is closed - any chance we can get Linux 5.5 into updates-testing maybe? # lspci -s 00:02.0 -vvv 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 620 (rev 02) (prog-if 00 [VGA controller]) Subsystem: Lenovo Device 2245 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 129 Region 0: Memory at eb000000 (64-bit, non-prefetchable) [size=16M] Region 2: Memory at a0000000 (64-bit, prefetchable) [size=256M] Region 4: I/O ports at e000 [size=64] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee00018 Data: 0000 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Process Address Space ID (PASID) PASIDCap: Exec- Priv-, Max PASID Width: 14 PASIDCtl: Enable- Exec- Priv- Capabilities: [200 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable-, Smallest Translation Unit: 00 Capabilities: [300 v1] Page Request Interface (PRI) PRICtl: Enable- Reset- PRISta: RF- UPRGI- Stopped+ Page Request Capacity: 00008000, Page Request Allocation: 00000000 Kernel driver in use: i915 Kernel modules: i915 5.5 kernels are only available in rawhide, so they won't go to updates-testing, but you can grab them off koji. They install and run fine on F31 (and I imagine F30 too but havne't tested it), which is what I'm doing. https://koji.fedoraproject.org/koji/packageinfo?packageID=8 My suggestion is grab kernel, kernel-core, kernel-modules arch specific RPMs, and just do $ sudo dnf install Downloads/*rpm The ones with git0 in the name are no debug (the debug version is explicitly named), where as git1, git2, git3 are all debug kernels and run a bit slower. So, today I tried 5.4.13-201. Took me all of 3 minutes to hang it completely. At this point, I have given up on the 5.4 kernel series completely. It's garbage. I don't care if it's only the Intel driver that's causing issues. It's making 5.4 unusable and therefore it's garbage. I tried 5.5 for a while, which works (no hangs), but causes suspend to fail at some point. This is very undesirable, because a running laptop in a snug backpack is a recipe for overheating. The workaround is to shut down when I leave the office and cold boot when I get home. So 5.5 is also garbage, although less toxic than 5.4 (and, incidentally, 5.4 had the smae suspend issues). I am now back on 5.3.16-300. No hangs, no suspend failures I'm not installing kernel updates anymore. The issue is still observed in 5.4.15, so I'm also staying with 5.3 series for now. I've got this also, with the same error listed above. Happy to provide any debugging information required. Workaround is the same for me, 5.3 kernel. So I wasn't imaging things... In my case I've mainly experienced temporary hangs, i.e. I just needed to wait 30-60 seconds for the system to come back. Highly annoying when you are working on something. I'll install 5.3.16 and 5.5.2 from koji and re-test. System Information Manufacturer: LENOVO Product Name: 20NX000EMX Version: ThinkPad T490s ... SKU Number: LENOVO_MT_20NX_BU_Think_FM_ThinkPad T490s Family: ThinkPad T490s 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (Whiskey Lake) (prog-if 00 [VGA controller]) Subsystem: Lenovo Device 2286 model name : Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz The 5.5.2 kernel build from https://koji.fedoraproject.org/koji/buildinfo?buildID=1457411 fixed the issue on my Thinkpad T490S which exhibited the issue running 5.4.x. Thank you Francois - did you install only kernel-5.5 or also kernel-modules-*? what version were those? I will probably wait till this hits updates-testing... kernel, kernel-core and kernel-modules using rpm -ivh <rpms>, all downloaded from the url above (so same version: 5.5.2-200). I'm running kernel-5.5.2-200.fc31.x86_64 from on my T490s, uptime about 6 days (OK, with hibernation periods in-between :-). kernel log doesn't show the error message any more and I haven't experienced any desktop lockups during that time. Stefan, exactly same config here and 5.4 still seems broken :( Is this going to be backported or need to wait until 5.5? Thanks. Hi Stefan, Marko Same here. Installed kernel 5.5.3-200.fc31.x86_64 and running normally for the past 15 minutes as compared to 5.4.x, which crashes within 5 mins of use. Yeah, did the same here yesterday. 5.5.x tree seems fine so far. Few days ago I've installed 5.0.9 lastest on main fedora repo (as shown by dnf), and I've also got crash. In my case it's little bit longer like a day or two before the crash, but it also happen on stable fedora repo. I've hoped that installing 5.0.9 will fix the issue, but I guess I need to try 5.5.x, Yesterday morning I installed kernel-*-5.5.5-200.fc31 from koji. Haven't had a single issue since. That kernel appears to be in the updates repo this morning. FWIW, the system seems a bit snappier than it was on the last few 5.4.x kernels. This is on a Lenovo P51 Thinkpad with this hardware lspci 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05) 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04) 00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model 00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31) 00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31) 00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31) 00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31) 00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1) 00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1) 00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1) 00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1) 00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1) 00:1d.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #13 (rev f1) 00:1f.0 ISA bridge: Intel Corporation CM238 Chipset LPC/eSPI Controller (rev 31) 00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31) 00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31) 00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31) 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (5) I219-LM (rev 31) 01:00.0 3D controller: NVIDIA Corporation GM107GLM [Quadro M1200 Mobile] (rev a2) 01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1) 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 04:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78) 3e:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 3f:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01) The fix is available in 5.5.x kernels which are available now. Closing due to comments by Stefan, Wayne, Marko, & John + my own experience. Please feel free to reopen if you are running a 5.5.x Fedora kernel or later and experience the same issue. Is new kernel only available for Fedora 31? What about Fedora 30? I've just run `dnf update` and got update with `5.4.21-100.fc30` do I need to install this one to get 5.5? |