859723 – 3.5.x kernel / Core i7 / pm-hibernate crashes before power down

Bug 859723 - 3.5.x kernel / Core i7 / pm-hibernate crashes before power down

Summary: 3.5.x kernel / Core i7 / pm-hibernate crashes before power down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	kernel_hibernate
TreeView+	depends on / blocked

Reported:	2012-09-23 14:39 UTC by Arne Woerner
Modified:	2013-04-16 18:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-10-12 12:35:56 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
FreeDesktop.org	53137	0	None	None	None	2012-09-23 14:39:33 UTC

Description Arne Woerner 2012-09-23 14:39:33 UTC

Description of problem:
yesterdays kernel with GNOME running crashes on hibernate
with my Core i7, HD Graphics only (no NV, no AMD), ASROCK H61M-ITX, dual headed box...

Version-Release number of selected component (if applicable):
3.5.4-1.fc17.x86_64

How reproducible:
always

Steps to Reproduce:
1. freshly boot the kernel and wait for GNOME
2. # echo reboot > /sys/power/disk
3. # echo disk > /sys/power/state
  
Actual results:
the box crashes with a blinking numlock LED with a kernel oops that is about bad op codes or fatal exceptions in some intel_idle() or lzo compression functions...

Expected results:
it should just hibernate as with 3.4.6-2.fc17.x86_64...

Additional info:
1. `ls /sys/devices/system/cpu/cpu?/online | xargs -I% sh -c "echo 0 > %"` before pm-hibernate does not help
2. when i freshly boot directly to single user mode with kernel parameter i915.modeset=0, it can hibernate... :-)
3. it happens since 3.5.x...

Comment 1 Arne Woerner 2012-09-30 15:59:30 UTC

kernel 3.4.6-2.fc17.x86_64
with 
intel-gpu-tools.x86_64                    2.20.8-1.fc17                 @updates
xorg-x11-drv-intel.x86_64                 2.20.8-1.fc17                 @updates
still cant hibernate
(this time the oops said "trying to kill the idle task" or so and
dumped the stack trace of a tcsh process...)...

Comment 2 Arne Woerner 2012-10-08 13:02:41 UTC

kernel 3.5.5-2.fc17.x86_64 still bad...
is it related to bug #851739 ?

Comment 3 Arne Woerner 2012-10-10 08:02:04 UTC

kernel-3.5.6-1.fc17 still bad...

Comment 4 Bojan Smojver 2012-10-10 09:12:05 UTC

If you'd like to eliminate compression/threading issues, you can boot with hibernate=nocompress.

Comment 5 aaronsloman 2012-10-10 23:08:16 UTC

(In reply to comment #4)
> If you'd like to eliminate compression/threading issues, you can boot with
> hibernate=nocompress.

After reading this comment by Bojan I tried that tip for Bug #862475: "Why do I need maxcpus=1 to resume from pm-hibernate in 32-bit Fedora 16 on Viglen Desktop PC, Fedora 17 on Dell E6410 laptop, both with intel core i5 cpu, intel graphics"

As reported in comment 5 in that bug the nocompress flag seemed to prevent resume crashing, but with a noticeable speed cost.

In contrast, maxcpus=1 (for resume only) had the same benefit, but without the speed cost. So that bug seems not to be caused by compression, but by multi-threaded compression.

Note: my experience is only with 32-bit linux. I am using fc17.i686 and fc16.i686.
So it is possible that the current bug is specific to 64-bit fedora, since for me pm-hibernate has powered down successfully since May on both my machines.

Comment 6 Bojan Smojver 2012-10-10 23:29:18 UTC

(In reply to comment #5)

Being the author of both compression and threading hibernation code, I can tell what changed in 3.5 (as compared to 3.4):

http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=5a21d489fd9541a4a66b9a500659abaca1b19a51

Not much to do with threading - mostly about being more careful with memory allocations. Anyhow, I see from your comment that you are having trouble even with earlier kernels on thaw, so it's probably not 3.5 specific anyway.

You note in the other bug that this may be a bug related to decompression and threading (possibly even some kind of i915 interaction). This code does not interact with i915 directly. So, if there is a decompression/threading bug, it is in the hibernation decompression/threading code. If there is an i915 bug, that would be an entirely separate issue.

Without seeing some kind of screen dump or other kind of debugging info, it will be very difficult to say what could really be casing this.

Comment 7 aaronsloman 2012-10-11 00:23:10 UTC

(In reply to comment #6)

Thanks for rapid response. I suspect this should be in Bug #862475 since it concerns resume.
> (In reply to comment #5)
> 
> Being the author of both compression and threading hibernation code, I can
> tell what changed in 3.5 (as compared to 3.4):
> 
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;
> a=commitdiff;h=5a21d489fd9541a4a66b9a500659abaca1b19a51
> 
> Not much to do with threading - mostly about being more careful with memory
> allocations. Anyhow, I see from your comment that you are having trouble
> even with earlier kernels on thaw, so it's probably not 3.5 specific anyway.

Apologies: I should have made that clear. The bug in resume became apparent only after you fixed the earlier bug that prevented hibernate completing (in May I think). After that I found hibernate totally reliable, but resume was unreliable, and tended to crash immediately after decompression -- screen going blank, and machine rebooting. The bug could not be manifested before that since hibernate did not shut down reliably.

Starting several months ago, I tried various cures for the resume problem, and for a while settled on using acpi=off for resume only.

That worked but seemed to be overkill, until someone mentioned maxcpus=1, which also works reliably, allowing resume to complete.

> You note in the other bug that this may be a bug related to decompression
> and threading (possibly even some kind of i915 interaction). This code does
> not interact with i915 directly. So, if there is a decompression/threading
> bug, it is in the hibernation decompression/threading code. If there is an
> i915 bug, that would be an entirely separate issue.

(I lack your expertise: I mentioned i915 only because it is used on both my machines with the resume bug, and i915 has often been referred to as a source of problems.)

> Without seeing some kind of screen dump or other kind of debugging info, it
> will be very difficult to say what could really be casing this.

I don't know how to get a screen dump, and I don't know if it would help: during the resume process everything proceeds normally and the decompression begins with the notification that 3 threads are being used. Everything seems to work perfectly with the percentage display increasing *very* quickly until decompression seems to be complete (I think it reaches 100%) and the screen goes blank. Then there's a pause and a full reboot starts. I presume there's nothing in any log file because logs can only be created after decompression and successful resume.

In contrast, if I use acpi=off or maxcpus=1 for resume after hibernate, the behaviour during resume is exactly the same except that only 1 thread is reported for decompression. When the 100% decompression is reached the screen still goes blank for an instant then almost immediately is restored to its previous state with an xterm window showing the pm-hibernate command.

So the crashing with multi-thread decompression seems to occur between decompression finishing and the system/screen being restored to its previous state.

This can happen even if I have not started graphic mode, i.e. boot to level 3, then login, run pm-hibernate, then restart, and resume gets to end of decompression and does not restore the previous screen but reboots. So it does not depend on whether X was running when hibernate started. (Resume sometimes succeeds without the special flag to limit the number of threads, but it's random. It crashes and reboots more often than it succeeds.)

I am not a system programmer, but I wonder whether it's possible to insert some instruction to ensure re-synchronisation of the cpus immediately after multi-threaded decompression is complete, and before the pre-hibernate state is reinstated?

Perhaps that question just displays my ignorance?

NOTE: because this discussion is about resume, not hibernate, I've copied this comment from here, Bug #859723 (about hibernate failing) to commment 6 in Bug #862475 (about resume failing), so that others looking at that bug will see this. I hope that causes no inconvenience.

Comment 8 Bojan Smojver 2012-10-11 00:29:11 UTC

(In reply to comment #7)
 
> I suspect this should be in Bug #862475 since it concerns resume.

Yes, we can continue this discussion there.

Comment 9 Arne Woerner 2012-10-11 05:49:43 UTC

this is different from that resume-crash-bug, because: it doesnt even turn off power before it crashes...

btw: that "hibernate=nocompress" trick didnt help... it still crashed and the oops said sth about "invalid op"... seems to b a memory management problem...

Comment 10 Bojan Smojver 2012-10-11 05:56:50 UTC

(In reply to comment #9)
> this is different from that resume-crash-bug, because: it doesnt even turn
> off power before it crashes...
> 
> btw: that "hibernate=nocompress" trick didnt help... it still crashed and
> the oops said sth about "invalid op"... seems to b a memory management
> problem...

OK, thanks. This is then most likely a problem in another part of the kernel, hibernation unrelated. When hibernate=nocompress is passed to the kernel, no threading or compressions is used at all (literally, old functions are being called, instead of the new ones).

Comment 11 Arne Woerner 2012-10-11 06:02:16 UTC

how can it b hibernation unrelated when it happens just during hibernation?
it is not like it crashes every minute... :-)
and it can hibernate when i boot to single user mode (no GNOME)...
is it the "intel" driver again? "intel"linuxgrafix refuses to look at it, because i cant give them the full kernel oops...

Comment 12 Arne Woerner 2012-10-12 12:35:56 UTC

3.6.1-1.fc17.x86_64 hibernates nicely again... :-) *yay*
dunno why...

Note You need to log in before you can comment on or make changes to this bug.