684097 – [Sandybridge] Frequent GPU hangs (hangcheck timer elapsed)

Bug 684097 - [Sandybridge] Frequent GPU hangs (hangcheck timer elapsed)

Summary: [Sandybridge] Frequent GPU hangs (hangcheck timer elapsed)

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xorg-x11-drv-intel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Adam Jackson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	696798 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-11 05:36 UTC by Amit Shah
Modified:	2018-04-11 16:16 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-07-11 20:27:10 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg dump (72.92 KB, text/plain) 2011-04-02 11:22 UTC, Andy Lawrence	no flags	Details
xorg.0.log (31.39 KB, text/plain) 2011-04-02 11:23 UTC, Andy Lawrence	no flags	Details
messages dump (1.76 MB, text/plain) 2011-04-02 11:24 UTC, Andy Lawrence	no flags	Details
SNB GPU hang fix, part 1 (2.20 KB, patch) 2011-06-24 05:06 UTC, Alex W. Jackson	no flags	Details \| Diff
SNB GPU hang fix, part 2 (1.61 KB, patch) 2011-06-24 05:08 UTC, Alex W. Jackson	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	28882	0	None	None	None	Never

Description Amit Shah 2011-03-11 05:36:36 UTC

Description of problem:

I'm seeing this in my dmesg:


[95566.456090] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[95566.460453] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 276360 at 276359, next 276363)
[95568.716017] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[95569.225052] [drm:i915_reset] *ERROR* Failed to reset chip.

The X session just dies, though I can switch to VTs.

I've put in some information here:

https://bugzilla.kernel.org/show_bug.cgi?id=28882#c4

Chris Wilson mentioned it's a userspace bug in a follow-up comment in that bug report.

I saw this with the kernel-2.6.38-0.rc6.git6.1.fc15.x86_64 kernel on an F14 userspace.  I've not seen this yet with kernel-2.6.38-0.rc8.git0.1.fc15.x86_64.

This bug has triggered thrice: once when mplayer was starting, once when totem was starting, once when I closed the laptop lid (which is set to just blank the screen).

H/w details:
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset
Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
    Subsystem: Lenovo Device 20e4
    Flags: bus master, fast devsel, latency 0, IRQ 47
    Memory at f2000000 (64-bit, non-prefetchable) [size=4M]
    Memory at d0000000 (64-bit, prefetchable) [size=256M]
    I/O ports at 1800 [size=8]
    Expansion ROM at <unassigned> [disabled]
    Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [d0] Power Management version 3
    Kernel driver in use: i915
    Kernel modules: i915

Comment 1 Matěj Cepl 2011-03-17 14:29:16 UTC

Thanks for the bug report.  We have reviewed the information you have provided above, and there is some additional information we require that will be helpful in our diagnosis of this issue.

Please add drm.debug=0x04 to the kernel command line, restart computer, and attach

* your X server config file (/etc/X11/xorg.conf, if available),
* X server log file (/var/log/Xorg.*.log)
* output of the dmesg command, and
* system log (/var/log/messages)

to the bug report as individual uncompressed file attachments using the bugzilla file attachment link above.

We will review this issue again once you've had a chance to attach this information.

Thanks in advance.

Comment 2 Amit Shah 2011-03-24 10:53:47 UTC

I've not seen this since the move to 2.6.38-rc8.  So looks like this was a kernel issue that got fixed before rc8.

I'll re-open if I do see it again.

Comment 3 Peter Lemenkov 2011-03-30 11:58:44 UTC

I was just hit by this bug so it seems that it's back with the kernel-2.6.38.2-8.fc15.i686 package.

Downgrading to previous (kernel-2.6.38.1-6.fc15.i686) solves it.

Mar 29 16:58:16 work kernel: [    2.245334] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
Mar 30 12:18:51 work kernel: [75280.628024] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Mar 30 12:18:51 work kernel: [75280.630762] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 7550520 at 7550508, next 7550521)
Mar 30 12:18:54 work kernel: [75282.768014] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Mar 30 12:18:54 work kernel: [75282.769145] [drm:i915_reset] *ERROR* Failed to reset chip.

Comment 4 Amit Shah 2011-03-30 12:07:00 UTC

Re-opening.  When does it happen for you?  Any sure way of reproducing?

Comment 5 Stanislav Ochotnicky 2011-03-30 12:56:52 UTC

FYI I was able to reproduce this on vanilla 2.6.38 but not 2.6.38.2 (so far at least). This was not on Fedora, but I believe that's irrelevant here.

Comment 6 Peter Lemenkov 2011-03-30 14:40:01 UTC

Unfortunately downgrade of kernel doesn't help - I was just hit by it again.

(In reply to comment #4)
> Re-opening.  When does it happen for you?  Any sure way of reproducing?

So far I don't know how to reproduce it. I'll try to downgrade mesa and see what happens.

Comment 7 Jason D. Clinton 2011-03-30 18:19:29 UTC

Same thing happening to me. Especially bad with GL-on-GL-compositor. (Like running a Clutter app. inside GNOME Shell.)

Comment 8 Andy Lawrence 2011-04-02 11:20:51 UTC

This is 100% reproducible on my Lenovo T520 Sandy Bridge.  Each time I move the mouse to the upper left in Gnome Shell it happens.  F15 all updates as of 4-1-2011.


kernel: [  329.203437] [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... blt ring idle [waiting on 72182, at 72182], missed IRQ?

I will upload the log files.

Comment 9 Andy Lawrence 2011-04-02 11:22:07 UTC

Created attachment 489571 [details]
dmesg dump

Comment 10 Andy Lawrence 2011-04-02 11:23:02 UTC

Created attachment 489572 [details]
xorg.0.log

Comment 11 Andy Lawrence 2011-04-02 11:24:03 UTC

Created attachment 489573 [details]
messages dump

Comment 12 Andy Lawrence 2011-04-02 11:26:58 UTC

Sorry, ^ Linux ace 2.6.38.2-10.fc15.x86_64 #1 SMP Thu Mar 31 03:11:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

Comment 13 Matěj Cepl 2011-04-15 22:13:29 UTC

*** Bug 696798 has been marked as a duplicate of this bug. ***

Comment 14 Jason D. Clinton 2011-05-03 15:18:55 UTC

No change with 2.6.38.5.

Comment 15 Amit Shah 2011-05-03 15:22:18 UTC

2.6.38.2-14.fc15.x86_64 was fine for me for a long time.  I then updated to 2.6.38.5-22.fc15.x86_64 and got this within 5 mins of booting.  I've now dropped to 2.6.38.4-20.fc15.x86_64, this one's working fine for a few hours now.

Comment 16 Peter Lemenkov 2011-05-05 09:41:08 UTC

(In reply to comment #15)
> 2.6.38.2-14.fc15.x86_64 was fine for me for a long time.  I then updated to
> 2.6.38.5-22.fc15.x86_64 and got this within 5 mins of booting.  I've now
> dropped to 2.6.38.4-20.fc15.x86_64, this one's working fine for a few hours
> now.

Unfortunately I cant confirm that any of the kernels listed above worked for me - I'm still experiencing infrequent GPU hangs irrespective of kernel version.

FYI I found that GPU dies quite often when some window pop ups (for example, popup windows in Transmission or Gajim). I even stopped to look at contact's details in Gajim :). Hope it helps fixing this annoying issue.

Comment 17 Amit Shah 2011-05-05 10:39:55 UTC

Yes, it is a userspace bug (also is against the xorg-drv-intel component).  However, for me, looks like some kernels trigger it, some don't.  2.6.38.4 triggered this one as well after a while.  I'm back to 2.6.38.2.

Comment 18 Jason D. Clinton 2011-05-12 02:59:43 UTC

This issue is fixed in 2.6.39-rc6 and -rc7 from rawhide. (Although that kernel adds a new problem.)

Comment 19 Andy Lawrence 2011-05-15 22:21:05 UTC

2.6.39rc7 Vanilla git pull from yesterday, this is still happening for me.

Comment 20 Jeremy Fitzhardinge 2011-05-16 18:54:30 UTC

I'm seeing this message with F15 beta, kernel kernel-2.6.38.6-26.rc1.fc15.x86_64

Mostly I see it when running Minecraft, but I just got it in normal gnome-shell use.

Comment 21 Reinhard 2011-05-22 12:54:42 UTC

Me too:
2.6.38.6-27.fc15.x86_64 kernel

Perhaps
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/761065
is related. They talk about a semaphore issue.

Comment 22 Andy Lawrence 2011-05-22 18:28:10 UTC

Doing the below seems to fix/prevent/work-a-round the issue for me.

As root:

echo 1 > /sys/module/i915/parameters/semaphores

Comment 23 Laurent Aguerreche 2011-05-22 20:58:38 UTC

This workaround doesn't fix entirely the problem. Globally I have much better performances. However, I can now encounter bigger hangs in OpenGL apps.

I tried Bullet demos. When using the program AppAllBulletDemos, switching to another demo is veeerrry long (a few seconds or minutes). The first time I saw that I thought about a hard lock and I rebooted the computer!

When performances go very bad, I see those messages from the kernel:


[  179.786033] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  179.786059] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
[  186.139002] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  186.139027] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
[  192.228321] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[  192.228345] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring

and so on...


So the semaphore trick is related with the problem but there is something else to take into account.

Comment 24 Jeremy Fitzhardinge 2011-05-23 14:17:53 UTC

I tried "echo 1 > /sys/module/i915/parameters/semaphores" and it seemed to help for a while.  But then I got a total system lockup, which required a power-cycle reboot to recover from.  I had never had one of those before.

Comment 25 Laurent Aguerreche 2011-05-23 15:07:23 UTC

Jeremy, maybe it was not a system lockup but a very long hang like I had. On my machine, I had to wait from a few seconds to several minutes (5 minutes ?).

As I said, the program AppAllBulletDemos of Bullet demos shows this problem very well! http://bulletphysics.org/

Comment 26 Jeremy Fitzhardinge 2011-05-23 16:23:20 UTC

Perhaps, I only waited about 10-15 secs.  But it was a complete lockup - no pointer movement, no ping response, no caplock toggle, no magic-sysrq, sound playback buffer looping.

Comment 27 Jason Haar 2011-05-23 19:44:11 UTC

I am seeing exactly this on my fully patched F15-beta2 system (ie it's more F15 than F15-beta2)

It's a Dell Latitude E6320 laptop with Sandybridge and every few days it totally locks up. I have walked up to it first thing in the morning to discover it's hung - so this is no "freezes for 10  minutes" thing - it's dead.

This is with 2.6.38.6-27.fc15.x86_64

Comment 28 Björn Ruberg 2011-06-11 13:45:08 UTC

Have it on an Dell D630 (i965 chipset, no sandybridge) running the current Fedora 15 with kernel 2.6.38.7-30.fc15.x86_64. 
After some time the GUI gets completly unresponsive. It can go away again, but often I just restart. Very annoying.

In the Ubuntu-Bugreport there is a hint to activate semaphores in the intel driver to solve this. I try it.

Comment 29 Jason D. Clinton 2011-06-11 15:26:50 UTC

(In reply to comment #28)
> After some time the GUI gets completly unresponsive. It can go away again, but
> often I just restart. Very annoying.

That is the not the same bug as this bug.

Comment 30 Anthony Horton 2011-06-14 12:37:54 UTC

I'm seeing the same issue with fully updated F15 on a Dell E6320 (Sandybridge), i.e. kernel 2.6.38.7-30.fc15.x86_64, libdrm-2.4.26-1.fc15.x86_64, etc.

I can repeatably and consistently produce serious hanging by attempting to run some OpenGL apps, for example Oolite (http://www.oolite.org) renders both itself and the system unusable.  The suggested workaround simply turns a hard lockup into repeated lockups of several seconds at a time.

The Fedora version for this bug should be changes to 15 as it is definitely still present.

Comment 31 Alex W. Jackson 2011-06-19 02:44:35 UTC

This kernel patch is supposed to fix the issue:

https://patchwork.kernel.org/patch/879532/

Can we get it backported into F15 asap?

Comment 32 Alex W. Jackson 2011-06-19 02:59:17 UTC

Whoops, that appears to be a slightly outdated version of the patch.

The patch we want is 498e720b96379d8ee9c294950a01534a73defcf3 in Linus' git tree.

Comment 33 Alex W. Jackson 2011-06-24 05:05:07 UTC

I've compiled and am currently running a custom kernel based on 2.6.38.8-32.fc15 with the following two patches applied, and I get no more random GPU hangs. Please consider adding both patches to the next kernel update for F15.

Comment 34 Alex W. Jackson 2011-06-24 05:06:40 UTC

Created attachment 509674 [details]
SNB GPU hang fix, part 1

This patch is the abovementioned 498e720b96379d8ee9c294950a01534a73defcf3

Comment 35 Alex W. Jackson 2011-06-24 05:08:33 UTC

Created attachment 509675 [details]
SNB GPU hang fix, part 2

This patch is from the upstream bug report here: https://bugs.freedesktop.org/show_bug.cgi?id=38529

Comment 36 ranjith ruban 2011-06-28 10:49:06 UTC

I am getting this same errors in dell latitude D630 

Jun 27 18:15:03  kernel: [12443.368076] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Jun 27 18:15:03  kernel: [12443.368089] [drm:kick_ring] *ERROR* Kicking stuck wait on render ring

If i lock the screen and then try to login in around 10 mins it is hung and have to do a hard reboot. This happens with 2.6.38.8-32.fc15.x86_64 and 2.6.38.7-30.fc15.x86_64 kernels

right now downgraded to kernel 
2.6.38.6-27.fc15.x86_64

and this does not occurs. Can the above patches be added if its is the fix.

Regards

Ranjith

Comment 37 Philip Allison 2011-07-01 10:52:08 UTC

+1 for adding these patches in please.  I had intermittent hanging on a Dell Optiplex 790, reliably triggered by opening the activities overview or bringing up the notification area in GNOME Shell.  I'm now running a custom kernel (2.6.38.8-32.fc15 plus these two patches) and the problem is gone.

Comment 38 Chuck Ebbert 2011-07-11 20:27:10 UTC

Hmm, bug filed against F14, reporting errors in various F15 and rawhide kernels.

Comment 39 ranjith ruban 2011-09-17 13:48:44 UTC

Linux version 2.6.40.4-5.fc15.x86_64 

Sep 17 18:52:55 rruban kernel: [14580.168147] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Sep 17 18:52:55 rruban kernel: [14580.168160] [drm:kick_ring] *ERROR* Kicking stuck wait on render ring

still hangs. Is there a fix yet ?. Can you tell me which kernel have this fix for f15 ?. 

Thanks 

Ranjith

Comment 40 Andrey Arapov 2011-10-12 08:04:57 UTC

I've got the same problem.
Everytime I play Minecraft and pause the game to browse the internet or something else, X randomly just freezes. GDM restart doesn't help.


Oct 12 09:52:50 arno-ThinkPad-T400 kernel: [ 2689.904090] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Oct 12 09:52:50 arno-ThinkPad-T400 kernel: [ 2689.908614] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -11 (awaiting 4288918 at 4288899, next 4288929)
Oct 12 09:52:52 arno-ThinkPad-T400 kernel: [ 2691.956013] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Oct 12 09:52:52 arno-ThinkPad-T400 kernel: [ 2692.464076] [drm:i915_reset] *ERROR* Failed to reset chip.
Oct 12 09:52:59 arno-ThinkPad-T400 kernel: Kernel logging (proc) stopped.

_____________

arno-ThinkPad-T400 ~ # lsb_release -a
No LSB modules are available.
Distributor ID:	LinuxMint
Description:	Linux Mint 11 Katya
Release:	11
Codename:	katya
arno-ThinkPad-T400 ~ # uname -a
Linux arno-ThinkPad-T400 2.6.38-11-generic-pae #50-Ubuntu SMP Mon Sep 12 22:21:04 UTC 2011 i686 i686 i386 GNU/Linux


Minecraft downloaded from Minecraft.net (paid version)

Running with: java -Xmx1024M -Xms512M -cp minecraft.jar net.minecraft.LauncherFrame >/dev/null 2>&1 &


_____________


arno-ThinkPad-T400 ~ # lspci |grep -i graph
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
arno-ThinkPad-T400 ~ # lspci -vvv -s 00:02.0
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
	Subsystem: Lenovo Device 20e4
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 46
	Region 0: Memory at f4400000 (64-bit, non-prefetchable) [size=4M]
	Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Region 4: I/O ports at 1800 [size=8]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee0300c  Data: 41a9
	Capabilities: [d0] Power Management version 3
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Kernel driver in use: i915
	Kernel modules: i915

arno-ThinkPad-T400 ~ # lspci -vvv -s 00:02.1
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
	Subsystem: Lenovo Device 20e4
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Region 0: Memory at f4200000 (64-bit, non-prefetchable) [size=1M]
	Capabilities: [d0] Power Management version 3
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-



arno-ThinkPad-T400 ~ # lsmod |grep 915
i915                  451068  2 
drm_kms_helper         40745  1 i915
drm                   184164  3 i915,drm_kms_helper
i2c_algo_bit           13184  1 i915
video                  18951  1 i915



arno-ThinkPad-T400 ~ # dmesg |grep -Ei $(lsmod |grep 915 |awk '{print $1}' |tr '\n' '|' |sed 's/.$//g') |grep -i init
[    1.637683] [drm] Initialized drm 1.1.0 20060810
[    2.454311] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0


arno-ThinkPad-T400 ~ # dmesg |grep -Ei $(lsmod |grep 915 |awk '{print $1}' |tr '\n' '|' |sed 's/.$//g') 
[    0.526510] pci 0000:00:02.0: Boot video device
[    1.637683] [drm] Initialized drm 1.1.0 20060810
[    1.912331] i915 0000:00:02.0: power state changed by ACPI to D0
[    1.912335] i915 0000:00:02.0: power state changed by ACPI to D0
[    1.912340] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    1.912345] i915 0000:00:02.0: setting latency timer to 64
[    1.944182] i915 0000:00:02.0: irq 46 for MSI/MSI-X
[    1.944188] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[    1.944190] [drm] Driver supports precise vblank timestamp query.
[    2.437758] fbcon: inteldrmfb (fb0) is primary device
[    2.437843] fb0: inteldrmfb frame buffer device
[    2.437844] drm: registered panic notifier
[    2.454144] input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/LNXVIDEO:00/input/input4
[    2.454183] ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
[    2.454311] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0

Comment 41 Andrey Arapov 2011-10-12 08:14:03 UTC

Seems that this is a post-2.6.37 regression. (Andrew Morton)
https://lkml.org/lkml/2011/3/2/518

Comment 42 Jeremy Fitzhardinge 2011-10-12 15:33:18 UTC

(In reply to comment #40)
> I've got the same problem.
> Everytime I play Minecraft and pause the game to browse the internet or
> something else, X randomly just freezes. GDM restart doesn't help.

I also get graphical glitches even when it's working OK.  I suspect this is actually a regression in Minecraft since 1.8, but I think the driver stack should be a little more resilient.  I often get complete system hangs after a couple of hangcheck warnings.

Note You need to log in before you can comment on or make changes to this bug.

airlied
ajax
amit.shah
andrey.arapov
anthony.horton
awj_in_japan
bfallik
bjoern
brianmury
dr.diesel
jeremy
jhaar
laurent.aguerreche+redhat
lemenkov
mangobrain
mcepl
me
ntl
Reinhard.Scheck
rruban
sochotni
v_mac
xgl-maint