Bug 1547612

Summary:	gpu hang
Product:	[Fedora] Fedora	Reporter:	Robert Story <rs>
Component:	xorg-x11-drv-intel	Assignee:	Adam Jackson <ajax>
Status:	CLOSED EOL	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	27	CC:	airlied, ajax, bskeggs, bugzillaredhat-56f0, ewk, hdegoede, ichavero, itamar, jarodwilson, jglisse, john.j5live, jonathan, josef, kernel-maint, labbott, linux, linville, martineau, mchehab, mjg59, steved, williams, xgl-maint
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-30 23:44:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robert Story 2018-02-21 15:55:01 UTC

Description of problem:

GPU hangs and kill my X server:

Feb 21 07:45:18 titan plasmashell[27211]: QXcbClipboard::setMimeData: Cannot set X11 selection owner
Feb 21 07:45:19 titan plasmashell[27211]: QXcbClipboard::setMimeData: Cannot set X11 selection owner
Feb 21 07:45:30 titan kernel: [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [26722], reason: Hang on rcs0, action: reset
Feb 21 07:45:30 titan kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Feb 21 07:45:30 titan kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Feb 21 07:45:30 titan kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Feb 21 07:45:30 titan kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Feb 21 07:45:30 titan kernel: [drm] GPU crash dump saved to /sys/class/drm/card1/error
Feb 21 07:45:30 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Feb 21 07:45:38 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Feb 21 07:45:46 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Feb 21 07:45:54 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Feb 21 07:46:02 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Feb 21 07:46:03 titan konsole[29061]: The X11 connection broke: I/O error (code 1)
Feb 21 07:46:03 titan at-spi-bus-launcher[27258]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
Feb 21 07:46:03 titan at-spi-bus-launcher[27258]:      after 6070 requests (6070 known processed) with 0 events remaining.
Feb 21 07:46:03 titan konsole[28698]: The X11 connection broke: I/O error (code 1)


Version-Release number of selected component (if applicable):

this all started happening after a dnf update, which included:

 kernel               4.15.3-300.fc27
 kernel-core          4.15.3-300.fc27
 kernel-modules       4.15.3-300.fc27
 kernel-modules-extra 4.15.3-300.fc27
 mesa-dri-drivers     17.3.4-1.fc27
 mesa-filesystem      17.3.4-1.fc27
 mesa-libEGL          17.3.4-1.fc27
 mesa-libGL           17.3.4-1.fc27
 mesa-libOpenCL       17.3.4-1.fc27
 mesa-libgbm          17.3.4-1.fc27
 mesa-libglapi        17.3.4-1.fc27
 mesa-libwayland-egl  17.3.4-1.fc27
 mesa-libxatracker    17.3.4-1.fc27
 xorg-x11-server-Xorg   1.19.6-3.fc27
 xorg-x11-server-common 1.19.6-3.fc27

How reproducible:
Easily.

Steps to Reproduce:
1. start emacs
2. try to copy/paste in a buffer

Note: this is with KDE spin of F27.

Actual results:
screen freezes, then after several sessions X crashes and dumps me back at logon screen.

Expected results:
No crash

Additional info:

Comment 1 Robert Story 2018-02-21 16:33:08 UTC

so I spent some more time trying to reproduce this, and came up with the sequence of events that causes the gpu hang every time:

1) start emacs
2) open a file with at least 4 'pages' of data, where a 'page' is the number of lines that youre current emacs window displays. For testing I created a file of 500 80 character lines and tested with emacs window sizes of 33 and 96.
3) press page down once
4) press ctrl-space to start selection
5) press page down twice to select 2 'pages'
6) press ctrl-w to 'cut' selection
7) press up arrow to scroll up one line

at this point my screen hangs and /var/log/messages will report

Feb 21 11:24:09 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Feb 21 11:24:17 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Feb 21 11:24:25 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
...
Feb 21 11:24:41 titan kernel: i915 0000:00:02.0: Resetting rcs0 after gpu hang
Feb 21 11:24:41 titan at-spi-bus-launcher[812]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
Feb 21 11:24:41 titan at-spi-bus-launcher[812]:      after 54 requests (54 known processed) with 0 events remaining.
and X crashes.

Comment 2 Laura Abbott 2018-02-21 16:35:15 UTC

Moving to the graphics team for tracking

Comment 3 Mat Martineau 2018-02-22 19:54:28 UTC

I have also experienced this bug, initially with the 4.15.3-300.fc27.x86_64 kernel.

Using Robert's steps, I can also reproduce the bug using 4.14.18-300.fc27.x86_64

I'm using i915 graphics on this CPU:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 94
model name	: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
stepping	: 3
microcode	: 0xc2

Comment 4 Chris Adams 2018-02-22 20:50:05 UTC

I'm hitting this on an i7-6600U - it seems to hit me when I run lynx in an xterm and scroll rapidly (I don't have quite as reliable of a test case as above).  I have tried updating to the latest packages from F27 updates-testing and still have the same issue.

Comment 5 Mat Martineau 2018-02-22 21:34:25 UTC

Updating to mesa 17.3.5-1.fc27 did not fix the problem, but reverting to mesa 17.2.4-3.fc27 did.

Comment 6 Robert Story 2018-02-23 16:00:29 UTC

Upstream is asking if someone one coulde try Mesa 18.0.0.rc4. I can't, but if anyone else can please post results upstream (or here and I'll share upstream).

Comment 7 Chris Adams 2018-02-24 22:37:28 UTC

I grabbed mesa-18.0.0-0.1.rc4.fc28 from koji and rebuilt it for F27 in mock. It does appear to have fixed the lockups for me.

Comment 8 Clark Williams 2018-03-01 19:26:59 UTC

*** Bug 1550679 has been marked as a duplicate of this bug. ***

Comment 9 Roberto Ragusa 2018-11-05 10:38:35 UTC

Started happening to me on F27 after upgrading

from kernel-4.18.9-100.fc27.x86_64
  to kernel-4.18.16-100.fc27.x86_64

while keeping mesa at mesa-dri-drivers-17.3.9-1.fc27.

This was quite easily triggered by running an ancient executable with dubious quality (apparently takes some kind of X locks while doing disk or network operations). In any case crashing the X server only started to happen with the new kernel. After reverting to 4.18.9 the problem has not happened again (let's hope it will not...).

Fully updated F27 on a Lenovo P50.
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)

Nov  5 09:14:19 localhost kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Nov  5 09:14:27 localhost kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Nov  5 09:14:35 localhost kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Nov  5 09:14:43 localhost kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Nov  5 09:14:46 localhost kernel: asynchronous wait on fence i915:kwin_x11[168968]/1:ac1b timed out
Nov  5 09:14:51 localhost kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Nov  5 09:14:51 localhost kdeinit5[171604]: The X11 connection broke (error 1). Did the X11 server die?

Comment 10 Roberto Ragusa 2018-11-12 18:34:10 UTC

After more investigation, it doesn't depend on kernel versions.
I have an ancient closed source x11 app that is doing some CPU intensive work in the wrong place (taking locks or something similar).
This can apparently reliably trigger the GPU hang detection and crash the X session.
I'm now running this shameful app inside a nested Xephyr and I get no crashes anymore.
Of course, even if the app is bad, the X11 server should not crash so easily.

Comment 11 Ben Cotton 2018-11-27 13:35:04 UTC

This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Ben Cotton 2018-11-30 23:44:06 UTC

Fedora 27 changed to end-of-life (EOL) status on 2018-11-30. Fedora 27 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.