Bug 788433 (thaw)

Summary:	Core i7 cannot pm-hibernate/pm-suspend/thaw properly
Product:	[Fedora] Fedora	Reporter:	Arne Woerner <arne_woerner>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	17	CC:	a.sloman, bug, burghardt, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, marmalodak, maurizio.antillon, mishu
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-08-01 17:19:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	781749

Description Arne Woerner 2012-02-08 08:26:56 UTC

Description of problem:
When I try to suspend after a fresh boot it suspends fine.
But: When I thaw the box, I get a lot of weird problems, that dont happen when i dont suspend...

Version-Release number of selected component (if applicable):
3.2.3-2.fc16.x86_64

How reproducible:
everytime

Steps to Reproduce:
1. suspend
2. thaw
3. use it some time
  
Actual results:
(a) it complains about "list_del corruption. prev->next should be [...] but was [...]"
(b) cant logout out of a GNOME session
(c) it just crashes with a kernel panic
(d) it cant suspend again
(e) it cant reboot
(when b-e happen, the numlock light wont change when i press numlock)

Expected results:
it should work after thaw like before suspend...

Additional info:
i read it might be kernel mode setting (KMS) related...
https://bugs.freedesktop.org/show_bug.cgi?id=40241

Comment 1 Arne Woerner 2012-02-11 15:37:36 UTC

3.2.5-3.fc16.x86_64 has this bug, too:
i just did a "find /sys | grep fan" after 2 otherwise successful hibernate/thaw cycles, and my GNOME crashed (b&w text mode with panic messages) and i had to reboot...
-arne

Comment 2 Arne Woerner 2012-03-02 20:57:26 UTC

3.2.7-1.fc16.x86_64 still allows that bug:
kernel BUG at fs/inode.c:429!
invalid opcode: 0000 [#1] SMP
CPU 1
Pid: 26157, comm: crond Tainted: G          I  3.2.7-1.fc16.x86_64 #1
[...]

after i changed the following:
1. hdd (WDC WD10EARS-00Y5B1) write cache disabled 10+ seconds before pm-hibernate call (my theory was that the harddisc doesnt flush)
2. HIBERNATE_MODE="shutdown" (before it was "platform")
3. HIBERNATE_RESUME_POST_VIDEO="yes" (before it was commented out)
i was able to thaw 7 times without intermediate reboot/panic,
which is much more than before (about every 3rd thaw crashed immediately and the others within some hours).

w00t

-arne

Comment 3 Arne Woerner 2012-03-09 07:13:36 UTC

hum

i forgot to mention, that i also emptied the swap space shortly before hibernate:
        /sbin/swapoff LABEL=SWP1TB
        /sbin/swapon -a

when i didnt, i got this again:
[39770.118546] ------------[ cut here ]------------
[39770.118552] WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0()
[39770.118554] Hardware name: To Be Filled By O.E.M. 
[39770.118556] list_add corruption. next->prev should be prev (ffff88022a04cbf8), but was           (null). (next=ffff88022a04cbf8).
[39770.118557] Modules linked in: tcp_lp bnep bluetooth rfkill ppdev parport_pc lp parport fuse nfs fscache auth_rpcgss nfs_acl lockd nf_conntrack_tftp ipt_LOG nf_conntrack_ipv4 ip6t_REJECT nf_defrag_ipv4 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables coretemp w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore i2c_i801 snd_page_alloc r8169 iTCO_wdt iTCO_vendor_support mii cdc_acm microcode sunrpc uinput i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
[39770.118589] Pid: 455, comm: udevd Tainted: G          I  3.2.9-1.fc16.x86_64 #1    
[39770.118591] Call Trace:
[39770.118597]  [<ffffffff8106e53f>] warn_slowpath_common+0x7f/0xc0
[39770.118599]  [<ffffffff8106e636>] warn_slowpath_fmt+0x46/0x50
[39770.118602]  [<ffffffff812ca9bd>] __list_add+0x6d/0xa0
[39770.118605]  [<ffffffff8118ece8>] __d_instantiate+0x58/0xe0
[39770.118607]  [<ffffffff8118ef27>] d_instantiate+0x47/0x80
[39770.118610]  [<ffffffff811ec473>] sysfs_lookup+0xf3/0x110
[39770.118613]  [<ffffffff811845d5>] d_alloc_and_lookup+0x45/0x90 
[39770.118615]  [<ffffffff81190f05>] ? d_lookup+0x35/0x60
[39770.118618]  [<ffffffff81186ba1>] do_lookup+0x2b1/0x3a0
[39770.118620]  [<ffffffff81186ff1>] link_path_walk+0x141/0x880
[39770.118624]  [<ffffffff8116562c>] ? kmem_cache_alloc_trace+0x10c/0x140
[39770.118627]  [<ffffffff8126ea2a>] ? selinux_file_alloc_security+0x4a/0x80
[39770.118630]  [<ffffffff81185b1d>] ? path_init+0x2cd/0x3a0
[39770.118633]  [<ffffffff81188fa8>] path_openat+0xb8/0x3c0
[39770.118636]  [<ffffffff811b5eab>] ? fsnotify_put_event+0x5b/0xa0 
[39770.118639]  [<ffffffff811893d2>] do_filp_open+0x42/0xa0
[39770.118641]  [<ffffffff81184aeb>] ? getname_flags+0x3b/0x260
[39770.118644]  [<ffffffff8119510f>] ? alloc_fd+0x4f/0x150
[39770.118647]  [<ffffffff81178c57>] do_sys_open+0xf7/0x1d0
[39770.118650]  [<ffffffff81178d50>] sys_open+0x20/0x30
[39770.118653]  [<ffffffff815eaac2>] system_call_fastpath+0x16/0x1b
[39770.118655] ---[ end trace a7919e7f17c0a727 ]---

when i did it again last night, there was no oops...

i commented out the HIBERNATE_RESUME_POST_VIDEO line again...

-arne

Comment 4 Dave Jones 2012-03-09 15:18:17 UTC

that trace is very interesting. We've had other reports of it, and I suspected they might be hibernate related.

The finger of blame is pointing at i915 right now, as many people have noted that booting with nomodeset makes their hibernate problems go away.

Comment 5 Arne Woerner 2012-03-09 16:50:26 UTC

yup
but why does swapoff+swapon change anything here then?
or is it just coincidential, because the bug doesnt happen everytime?

Comment 6 Dave Jones 2012-03-09 18:40:14 UTC

if you could run the kernel-debug build at
http://koji.fedoraproject.org/koji/buildinfo?buildID=304798 that might turn up
a different trace that might be helpful to us to track this down.

(it's going to be considerably slower than the regular build, due to the extra
checking).

Comment 7 Arne Woerner 2012-03-09 19:46:24 UTC

i installed that kernel-debug package...
but i cant reboot before tomorrow (ongoing tv recording)... :-)
as far as i can c the cores r rather idle...

in next night i will test if my hard disc write cache is flushed properly before the disc is powered down... :-)

could it be, that "nomodeset" is the cause for an empty swap area?
on my box the swap space is almost unused (currently just 60KiB), because it has 8GiB main mem... but it seems to be important that it is 100% unused, when hibernation begins...

could it be that my swap area is too big, so that it writes to the wrong parts?
# swapon -s
Filename				Type		Size	Used	Priority
/dev/sda3                               partition	8388604	60	0

Comment 8 Arne Woerner 2012-03-11 22:20:21 UTC

with 3.2.9-1.fc16.x86_64.debug thaw worked good (no oops/panic/crash since hours) without any trick (hdd write cache was on and swap space was not empty when hibernation started, but i used HIBERNATE_MODE "shutdown" instead of "platform")...

Comment 9 Arne Woerner 2012-03-12 20:08:14 UTC

with 3.2.9-2.fc16.x86_64.debug i could produce a bad kernel panic:
it didnt even log it to syslog and i couldnt c the head of the oops...
and i just terminated a process that was started shortly after thaw...

now i will again empty the swap space before hibernate and
c if it crashes...

Comment 10 Arne Woerner 2012-03-15 19:02:47 UTC

emptying the swap space is no workaround...
it crashed today...
now i test tuxonice from the atrpms repo...

Comment 11 Arne Woerner 2012-03-21 07:10:41 UTC

tuxonice causes crashes, too...

Comment 12 Dave Jones 2012-03-22 16:37:24 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 13 Dave Jones 2012-03-22 16:42:31 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 14 Dave Jones 2012-03-22 16:50:44 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 15 Josh Boyer 2012-03-28 18:00:05 UTC

[Mass hibernate bug update]

Dave Airlied has found an issue causing some corruption in the i915 fbdev after a resume from hibernate.  I have included his patch in this scratch build:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3940545

This will probably not solve all of the issues being tracked at the moment, but it is worth testing when the build completes.  If this seems to clear up the issues you see with hibernate, please report your results in the bug.

Comment 16 Arne Woerner 2012-03-28 20:25:51 UTC

seems to work on my box now...
no hibernate related crashes since i updated to 3.3.0-4...

Comment 17 Arne Woerner 2012-04-05 16:00:24 UTC

it did it again (after i used a non-debug kernel again *blush*):
kernel:[212214.655575] ------------[ cut here ]------------
kernel:[212214.655603] kernel BUG at mm/huge_memory.c:2394!
kernel:[212214.655624] invalid opcode: 0000 [#1] SMP 
kernel:[212214.655645] CPU 1 
kernel:[212214.655654] Modules linked in: binfmt_misc tcp_lp ppdev parport_pc lp parport fuse bnep bluetooth rfkill nfs fscache auth_rpcgss nfs_acl ip6t_REJECT nf_conntrack_tftp nf_conntrack_ipv6 nf_defrag_ipv6 ipt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ip6table_filter ip6_tables lockd coretemp w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd iTCO_wdt iTCO_vendor_support r8169 mii microcode soundcore snd_page_alloc i2c_i801 uinput sunrpc i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
kernel:[212214.655926] 
kernel:[212214.655936] Pid: 9468, comm: chrome Not tainted 3.3.0-8.fc16.x86_64 #1 To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-ITX
kernel:[212214.655985] RIP: 0010:[<ffffffff81175aa3>]  [<ffffffff81175aa3>] __split_huge_page_pmd+0xc3/0xf0
kernel:[212214.656028] RSP: 0018:ffff88017fa2fcc8  EFLAGS: 00010282
kernel:[212214.656050] RAX: 80000000028000e7 RBX: ffff88022f7e2d80 RCX: 0000000000000000 
kernel:[212214.656079] RDX: 0000000000000001 RSI: ffff88022979ab98 RDI: 80000000028000e7 
kernel:[212214.656108] RBP: ffff88017fa2fce8 R08: ffff8801590dcff0 R09: 0000000000000100 
kernel:[212214.656136] R10: 0000000000000004 R11: 0000000000000206 R12: ffffea0001020000 
kernel:[212214.656167] R13: ffff8802255ad390 R14: ffff8802255ad390 R15: 0000000000000000 
kernel:[212214.656194] FS:  00007f78c5988980(0000) GS:ffff88023fa40000(0000) knlGS:0000000000000000
kernel:[212214.656215] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
kernel:[212214.656231] CR2: 00007f78cc92ba24 CR3: 000000010b50c000 CR4: 00000000000406e0 
kernel:[212214.656250] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
kernel:[212214.656270] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
kernel:[212214.656290] Process chrome (pid: 9468, threadinfo ffff88017fa2e000, task ffff8802344f4590)
kernel:[212214.656311] Stack:
kernel:[212214.656318]  0000000000000008 00007f78ce48f000 00007f78ce48f000 00007f78ce49e000
kernel:[212214.656342]  ffff88017fa2fe08 ffffffff81145cda ffff88023fdece00 0000000000000000
kernel:[212214.656373]  000000007fa2fd38 ffffffff8112d0bd ffff88017fa2fd38 ffff88017fa2fe80
kernel:[212214.656404] Call Trace:
kernel:[212214.656415]  [<ffffffff81145cda>] unmap_vmas+0x8aa/0x900
kernel:[212214.656432]  [<ffffffff8112d0bd>] ? update_page_reclaim_stat+0x2d/0x70
kernel:[212214.656450]  [<ffffffff8112d87c>] ? __pagevec_lru_add+0x1c/0x20
kernel:[212214.656468]  [<ffffffff81145dd2>] zap_page_range+0xa2/0xd0
kernel:[212214.656485]  [<ffffffff815f7720>] ? do_page_fault+0x200/0x4f0
kernel:[212214.656501]  [<ffffffff81142c96>] sys_madvise+0x296/0x740
kernel:[212214.656517]  [<ffffffff815fbca9>] system_call_fastpath+0x16/0x1b
kernel:[212214.657205] Code: 89 e7 e8 41 7e fb ff 49 8b 7d 00 48 89 f8 66 66 66 90 a8 80 75 15 48 8b 5d e8 4c 8b 65 f0 4c 8b 6d f8 c9 c3 66 83 43 6c 0
1 eb eb <0f> 0b e8 a0 70 47 00 4c 89 e7 0f 1f 00 e8 5b 7e fb ff 84 c0 75 
kernel:[212214.658962] RIP  [<ffffffff81175aa3>] __split_huge_page_pmd+0xc3/0xf0
kernel:[212214.659895]  RSP <ffff88017fa2fcc8>
kernel:[212214.711344] HDMI hot plug event: Codec=3 Pin=7 Presence_Detect=0 ELD_Valid=1
kernel:[212214.712166] HDMI status: Codec=3 Pin=7 Presence_Detect=0 ELD_Valid=0
kernel:[212214.883850] HDMI hot plug event: Codec=3 Pin=7 Presence_Detect=1 ELD_Valid=1
kernel:[212214.884647] HDMI status: Codec=3 Pin=7 Presence_Detect=1 ELD_Valid=1
kernel:[212215.093331] ---[ end trace aaf1961c9b232428 ]---

Comment 18 Dave Jones 2012-04-09 14:03:00 UTC

3.3.0-8 still had the hibernate memory corruption bug.
Update to the latest 3.3.1 build, and see if it's still reproducible.

Comment 19 Arne Woerner 2012-04-09 15:49:38 UTC

yup - i did that already before it was in updates-testing... :-)
running 3.3.1-3.fc16.x86_64 currently...
no crash or oops with 3.3.1 until today...

btw: the cursor is still blinking when it goes to text mode... or is it turned off shortly before the power goes down? or did i misunderstand sth?

Comment 20 Josh Boyer 2012-04-09 16:34:38 UTC

(In reply to comment #19)
> yup - i did that already before it was in updates-testing... :-)
> running 3.3.1-3.fc16.x86_64 currently...
> no crash or oops with 3.3.1 until today...

Erm... but your crash in comment #17 was clearly not from 3.3.1.  So do you have another crash you haven't reported with 3.3.1?

Comment 21 Arne Woerner 2012-04-09 17:22:37 UTC

yes, my last crash was with a pre-3.3.1 kernel (3.3.0-8)...

my "yup" was to copy ur "update" counsel...
then i wanted to say, that there was _no_ crash/oops with 3.3.1 until today...

sorry - i m no native speaker...

Comment 22 Arne Woerner 2012-04-19 21:32:25 UTC

there was no hibernate/thaw related crash since i updated to 3.3.1 until today
(i hibernate 9 times per week).

Comment 23 Arne Woerner 2012-05-07 21:15:39 UTC

no hibernate/thaw related crash since i updated to 3.3.1 until today...

Comment 24 Arne Woerner 2012-05-11 06:10:42 UTC

it is back with 3.3.4-3.fc17.x86_64...
it didnt complete thaw today but rebooted automatically...

Comment 25 aaronsloman 2012-05-11 09:10:27 UTC

(In reply to comment #24)
> it is back with 3.3.4-3.fc17.x86_64...
> it didnt complete thaw today but rebooted automatically...

Your symptoms sound similar to my experiences with 32-bit fedora 16 on both a desktop PC (intel graphics and core i5) and dell latitude E6410 notebook (also intel graphics and core i5). I am currently using  3.3.4-3.fc16.i686 on both machines.

I depend a lot on pm-hibernate (previously tuxonice, but that no longer works for me). My normal mode of working is to boot up very rarely. When I do it's in level 3 (no graphics) which is useful for installing updates, or doing checking.

I then run 'startx' to invoke window manager, either openbox or ctwm (I prefer the latter though some would find it old-fashioned). Then usually at least twice a day I use pm-hibernate instead of shutdown. That way I can go for weeks or months without a reboot, and all my ten virtual desktops with various unfinished tasks keep their state. The laptop is sometimes left hibernated for several days, as I mainly use it for seminars and when travelling.

That mode of working served me well for several years with two previous dell laptops and my previous desktop PC.

But in the last couple of years there have been serious problems with hibernate, apparently related to the i915 module. One major problem recently fixed was that every now and again hibernate would fail to complete, requiring a forcible shutdown and reboot. That problem went away a month or two ago -- a very great improvement.

Since then I have had your problem on both machines -- thaw after hibernate sometimes works, but not always: in the exception cases it get very near the end of resuming (shown by the percentage display) and then the screen goes blank and it reboots. There's no record in /var/log to indicate what went wrong.

The frequency with which this unwanted reboot happens seems to change with new kernel releases. A couple of times, after a kernel update, I thought the problem had gone away because hibernate/resume had worked on both machines for several days. Then one, and later the other, machine would fail to resume. The problem seems to be worse on the laptop: failure to complete thaw is more common.

I have found a workaround that I can live with though it is a nuisance when there's a kernel upgrade. I have two menu entries in /boot/grub2/grub.cfg the default labelled RESUME at the top of the list and the alternative labelled BOOT as the next option. The only difference is that in the RESUME case I add to the 'linux ... ' line at the end 'acpi=off'. I can't do that when booting as too many things stop working (e.g. screen brightness control, cpu throttling, checking state of battery, and others). But if I have it when resuming from hibernate there seem to be no detectable effects except that resume always works. (See also Bug #806315)

This is tolerable except that kernel updates are a great nuisance. I have to manually edit grub.cfg to recreate the two entries, then boot into the new kernel, then ensure that the default is set to boot with the acpi=off switch before I first use pm-hibernate.

I have not found any documentation that helps me understand what's going on, and how acpi=off can help. (I discovered the tip buried in a file on the internet, but can't recall where. But it gave no explanation.)

Like you I thought for a while that the failure to resume properly might have something to do with the swap area being too small or too big or needing to be refreshed, but eventually ruled that out.

My ideal solution would be for someone to alter the resume code to check if i915 module is in use, and if so resume with acpi=off (or its equivalent) -- perhaps re-setting it after resume has completed. But I am not a kernel developer and could not contribute to that.

Comment 26 Arne Woerner 2012-05-24 06:43:31 UTC

today thaw-ing crashed my box again...
it didnt even log a single syslog message and rebooted automatically...
but it seems like, it logged some messages during hibernation that should have been logged on thaw:

good hibernation+thaw:
May 22 23:15:34 vaako NetworkManager[562]: <info> (eth0): carrier now OFF (device state 10)
May 22 23:15:34 vaako dbus[586]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
May 22 23:15:34 vaako dbus-daemon[586]: dbus[586]: [system] Activating service name='org.freedesktop.nm_dispatcher' (using servicehelper)
May 22 23:15:34 vaako dbus-daemon[586]: dbus[586]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
May 22 23:15:37 vaako kernel:[711700.424187] HDMI hot plug event: Codec=3 Pin=7 Presence_Detect=0 ELD_Valid=1
May 22 23:15:37 vaako kernel:[711700.424232] HDMI status: Codec=3 Pin=7 Presence_Detect=0 ELD_Valid=0
May 22 23:15:37 vaako kernel:[711700.602662] HDMI hot plug event: Codec=3 Pin=7 Presence_Detect=1 ELD_Valid=1
May 22 23:15:37 vaako kernel:[711700.602705] HDMI status: Codec=3 Pin=7 Presence_Detect=1 ELD_Valid=1
May 23 06:01:10 vaako kernel:[711700.815770] PM: Syncing filesystems ... done.
May 23 06:01:10 vaako kernel:[711700.877306] Freezing user space processes ... (elapsed 0.01 seconds) done.
May 23 06:01:10 vaako kernel:[711700.888846] PM: Preallocating image memory... 


bad hibernation+thaw:
May 23 23:42:35 vaako NetworkManager[562]: <info> (eth0): carrier now OFF (device state 10)
May 23 23:42:38 vaako kernel:[775213.309193] HDMI hot plug event: Codec=3 Pin=7 Presence_Detect=0 ELD_Valid=1
May 23 23:42:38 vaako kernel:[775213.309248] HDMI status: Codec=3 Pin=7 Presence_Detect=0 ELD_Valid=0
May 23 23:42:38 vaako kernel:[775213.484684] HDMI hot plug event: Codec=3 Pin=7 Presence_Detect=1 ELD_Valid=1
May 23 23:42:38 vaako kernel:[775213.484727] HDMI status: Codec=3 Pin=7 Presence_Detect=1 ELD_Valid=1
May 23 23:42:38 vaako kernel:[775213.697724] PM: Syncing filesystems ... 
May 23 23:42:38 vaako kernel:[775213.784723] HDMI status: Codec=3 Pin=7 Presence_Detect=1 ELD_Valid=1
May 23 23:42:38 vaako kernel:[775213.788206] HDMI: detected monitor H22-1W at connection type HDMI
May 23 23:42:38 vaako kernel:[775213.788208] HDMI: available speakers: FL/FR
May 23 23:42:38 vaako kernel:[775213.788210] HDMI: supports coding type LPCM: channels = 2, rates = 32000 44100 48000, bits = 16 20 24
May 24 06:01:20 vaako kernel:imklog 5.8.10, log source = /proc/kmsg started.

Comment 27 Arne Woerner 2012-06-19 06:08:03 UTC

today it crashed again during thaw...
after the first hibernation after a fresh boot to kernel 3.4.2-4.fc17.x86_64...

Comment 28 Arne Woerner 2012-06-27 06:57:59 UTC

last night it failed to hibernate:
1. the left monitor went into standby,
2. but the right monitor presented a non-blinking cursor,
3. and the keyboard caps-lock lamp blinked... :-)

Comment 29 Arne Woerner 2012-07-07 06:48:16 UTC

3.4.4-3.fc17.x86_64 crashed and auto-rebooted on thaw today...

Comment 30 Arne Woerner 2012-07-20 22:51:44 UTC

3.4.5-2.fc17.x86_64 just failed to hibernate:
1. the left monitor went into standby,
2. but the right monitor presented a non-blinking cursor (last night it showed funny error messages, that were not logged in /var/log/messages or /var/log/pm-suspend.log, but it hibernated+thawed properly),
3. and the keyboard caps-lock lamp blinked, :-)
4. pm-hibernate start-up coincided with the start-up of my every-5min-incremental-backup-script (mostly: tar, find and bzip2)...

Comment 31 Arne Woerner 2012-07-27 16:58:37 UTC

today during hibernation it said on all ttys (pts/...)
"kernel:[313621.835438] do_IRQ: 0.108 No irq handler for vector (irq -1)"

but it thaw-ed properly...
i use that acpi=off on thaw (but on on boot) trick of aaronsloman, too, now... :-)

Comment 32 aaronsloman 2012-07-27 22:24:58 UTC

(In reply to comment #31)
> today during hibernation it said on all ttys (pts/...)
> "kernel:[313621.835438] do_IRQ: 0.108 No irq handler for vector (irq -1)"
> 
> but it thaw-ed properly...
> i use that acpi=off on thaw (but on on boot) trick of aaronsloman, too,
> now... :-)

Noticed a typo: that should be "(but NOT on boot)".

It has been working fine for me for some time, on both Dell Latitude E6410 and also Desktop PC, both running Fedora 16 (32 bit). However I recently discovered a problem, reported in Bug #842291, namely the sequence

 1. pm-hibernate
 2. power up and boot into Windows 7 
 3. in windows do restart
 4. resume linux from grub menu

prevents the next pm-hibernate from working. It just fails to hibernate and returns to the state before the hibernate command. Then the only option is to shutdown completely and reboot.

I eventually discovered that the failure to hibernate can be avoided by using shutdown instead of restart in windows in step 3. It took me a long time to work this out, so I thought I should pass on the warning!

Comment 33 Arne Woerner 2012-07-27 23:03:36 UTC

when i wrote "(but on on boot)" i meant "(but acpi=on on boot)"... :-)

is there anybody who uses Windows 7? *rotfl*

Comment 34 aaronsloman 2012-07-28 01:23:32 UTC

(In reply to comment #33)
> when i wrote "(but on on boot)" i meant "(but acpi=on on boot)"... :-)

Sorry. Brain slow late at night (UK).

> is there anybody who uses Windows 7? *rotfl*

Two or three times a year! This time I received a file in docx format, which Libreoffice could not read. Tried MS document reader on Win7, and it also  failed, but offered to download a converter, which successfully converted to odt, to my surprise.

Comment 35 Arne Woerner 2012-07-30 22:47:05 UTC

3.4.6-2.fc17.x86_64 just failed to hibernate...
1. the caps lock lamp blinked
2. the hard disc made funny noise
3. and just 1 of 2 monitors fell asleep
4. when i pressed the reset button it rebooted (it did not thaw)
inspite of that acpi=off-on-thaw trick...

Comment 36 Arne Woerner 2012-08-04 07:19:31 UTC

3.5.0-2.fc17.x86_64 doesnt hibernate at all... tried it 3 times...
after some time it had a kernel oops... e. g. about a LOCKUP of cpu7...

Comment 37 aaronsloman 2012-08-04 09:33:20 UTC

Sounds as if this may be a recurrence of, or related to, an old hibernate bug fixed a few months ago referenced in #789708 and #785384 .

I am still using 32 bit fedora 16 kernel 3.4.6-1.fc16.i686 (the latest available to me). I have no problems with hibernate (unless I forget to unplug my dvb-tv usb dongle before hibernating, in which case it hibernates without shutting down completely - power remains on). Resume still requires acpi=off

Comment 38 aaronsloman 2012-08-07 22:49:30 UTC

(In reply to comment #35)
> 3.4.6-2.fc17.x86_64 just failed to hibernate...
> 1. the caps lock lamp blinked
> 2. the hard disc made funny noise
> 3. and just 1 of 2 monitors fell asleep
> 4. when i pressed the reset button it rebooted (it did not thaw)
> inspite of that acpi=off-on-thaw trick...

I now have kernel 3.4.7-1.fc16.i686 on my core i5 laptop (Dell E6410) and hibernate works fine. Also resume/thaw with acpi=off. Without that, resume still fails and leads to reboot.

Perhaps there has been some change in FC17 not included in FC16 which interferes with hibernate.

'uptime' shows that my core i5 desktop PC, still on kernel 3.3.7-1.fc16.i686, has been hibernating and resuming (with acpi=off) without problems since 22 May 2012, often hibernating and resuming several times in one day, except for a few days when I've been away. Because of the convenience of never rebooting I've disabled kernel upgrades on that machine.

Comment 39 Arne Woerner 2012-08-11 20:46:50 UTC

kernel 3.5.1-1.fc17.x86_64 cant hibernate here... *sob*
the kernel oops said something about the swapper process and the stack trace something about intel_idle() and then a lot of other functions about "idle"...

Comment 40 Arne Woerner 2012-08-17 15:40:13 UTC

kernel-3.5.2-1.fc17.x86_64 cant hibernate...

Comment 41 aaronsloman 2012-08-18 02:34:04 UTC

That's strange. I now have 3.5.2-1.fc17.i686 installed on my core i5 laptop (on which I've recently upgraded the bios and a few other things provided on the Dell website), and pm-hibernate works perfectly for me, as it has been doing for months. I am still having trouble resuming, however, unless I add acpi=off to the boot menu options (only for reboot).

Comment 42 Arne Woerner 2012-08-18 04:18:43 UTC

1. u run a 32bit kernel on ur i5? isnt it a 64bit CPU?
2. with "(only for reboot)" u mean "(only for thaw)"?

Comment 43 aaronsloman 2012-08-18 12:08:55 UTC

(In reply to comment #42)
> 1. u run a 32bit kernel on ur i5? isnt it a 64bit CPU?

It's a 64 bit cpu, but I don't need a 64 bit linux -- for my usage it would waste memory and add to compatibility problems, so I use 32 bit fedora which runs fine on 64 bit cpu with 32 bit support.

> 2. with "(only for reboot)" u mean "(only for thaw)"?
Sorry, I mistyped 'resume' (=thaw) as 'reboot'. Apologies for confusion.

Comment 44 Arne Woerner 2012-08-18 18:44:51 UTC

1. oki... maybe a 32bit kernel is easier... :-)
2. np

Comment 45 Arne Woerner 2012-08-22 03:57:11 UTC

i found that it sometimes fails to thaw some CPUs:

vaako kernel:[40619.352527] Disabling non-boot CPUs ...
vaako kernel:[40619.354515] CPU 1 is now offline
vaako kernel:[40619.356723] CPU 2 is now offline
vaako kernel:[40619.358708] CPU 3 is now offline
vaako kernel:[40619.361177] CPU 4 is now offline
vaako kernel:[40619.362804] CPU 5 is now offline
vaako kernel:[40619.363734] Broke affinity for irq 23
vaako kernel:[40619.363743] Broke affinity for irq 44
vaako kernel:[40619.364798] CPU 6 is now offline
vaako kernel:[40619.366331] CPU 7 is now offline
vaako kernel:[40619.366613] Extended CMOS year: 2000 
vaako kernel:[40619.366694] PM: Creating hibernation image:
vaako kernel:[40619.455782] PM: Need to copy 470555 pages
vaako kernel:[40619.367849] Extended CMOS year: 2000 
vaako kernel:[40619.368335] microcode: CPU0 updated to revision 0x28, date = 2012-04-24
vaako kernel:[40619.368367] Enabling non-boot CPUs ...
vaako kernel:[40619.368429] Booting Node 0 Processor 1 APIC 0x2
vaako kernel:[40619.381744] NMI watchdog: enabled, takes one hw-pmu counter. 
vaako kernel:[40619.382179] microcode: CPU1 updated to revision 0x28, date = 2012-04-24
vaako kernel:[40619.382182] CPU1 is up
vaako kernel:[40619.382267] Booting Node 0 Processor 2 APIC 0x4
vaako kernel:[40624.405679] CPU2: Not responding. 
vaako kernel:[40624.405913] Error taking CPU2 up: -5
vaako kernel:[40624.405994] Booting Node 0 Processor 3 APIC 0x6
vaako kernel:[40629.431685] CPU3: Not responding. 
vaako kernel:[40629.432149] Error taking CPU3 up: -5
vaako kernel:[40629.432320] Booting Node 0 Processor 4 APIC 0x1
vaako kernel:[40634.485766] CPU4: Not responding. 
vaako kernel:[40634.485909] Error taking CPU4 up: -5
vaako kernel:[40634.485980] Booting Node 0 Processor 5 APIC 0x3
vaako kernel:[40639.542151] CPU5: Not responding. 
vaako kernel:[40639.542270] Error taking CPU5 up: -5
vaako kernel:[40639.542347] Booting Node 0 Processor 6 APIC 0x5
vaako kernel:[40644.601686] CPU6: Not responding. 
vaako kernel:[40644.601804] Error taking CPU6 up: -5
vaako kernel:[40644.601888] Booting Node 0 Processor 7 APIC 0x7
vaako kernel:[40644.616176] NMI watchdog: enabled, takes one hw-pmu counter. 
vaako kernel:[40644.616673] microcode: CPU7 updated to revision 0x28, date = 2012-04-24
vaako kernel:[40644.616677] CPU7 is up

Comment 46 Arne Woerner 2012-09-03 06:09:26 UTC

kernel 3.5.3-1.fc17.x86_64 just tried to kill the idle task during pm-hibernate (before it fell asleep)... that caused a kernel oops...

it doesnt do that when i try to hibernate without graphix from single user mode...

Comment 47 John Schmitt 2012-09-03 06:30:45 UTC

I thought that being able to insert "acpi=off" on thaw would allow my machine to hibernate but doing that uncovered new issues.  Should these issues perhaps be broken into other bugs?  I haven't been able to isolate all the symptoms, but these are some of the things I've seen:

1. machine generates ext4 error messages, sometimes requires fsck on boot
2. sometimes doesn't actually powerdown the computer
3. operates very slowly when the GUI apps are being thawed

Comment 48 Arne Woerner 2012-09-22 11:27:42 UTC

the 3.5.4-1.fc17.x86_64 kernel hates me 2...
does somebody know why that is?
intellinux refuses to say if they can reproduce it on their boxes... :-)

Comment 49 aaronsloman 2012-09-23 00:59:12 UTC

I am using 32-bit version of this kernel on Dell Latitude E6410. Hibernate/thaw had been working with acpi=off used in grub before thaw. Various people told me that was overkill. However, one of its effects seemed to be to allow only one cpu to be active during the resume process. So I've now changed grub for resume to include maxcpus=1 instead of acpi=off, and this seems to be successful, as described in bug #806315 comment 61

I have no idea whether this will generalise to Core i7 + 64 bits.

Comment 50 Arne Woerner 2012-09-23 06:03:03 UTC

um... but it doesnt even hibernate with 3.5...
should i try to hibernate with 1 cpu?

i will try that maxcpu trick with 3.4...

Comment 51 Arne Woerner 2012-09-23 08:15:08 UTC

i doesnt hibernate with 3.5, even when i turn off all but 1 cpus... :-)
then it complained about some exception in the lzo compression thingies...

Comment 52 aaronsloman 2012-09-23 11:24:05 UTC

(In reply to comment #50 and #51)
> um... but it doesnt even hibernate with 3.5...

Apologies: I should have checked what you meant by Comment #48

> should i try to hibernate with 1 cpu?

I don't know how to give pm-hibernate an instruction to use only 1 cpu.

'maxcpus=1' is used as a boot flag, so it can work only for boot/thaw, not hibernate, unless I've misunderstood something. I guess if you use that flag for full boot that will restrict your machine to only 1 cpu. I don't know what effects that could have.

Anyhow, I have had no trouble with pm-hibernate completing successfully with all cpus available, since 24th May 2012, using Fedora 16 (on desktop and laptop machines) and more recently Fedora 17 (only on laptop).

But I use 32 bit linux and have core i5, not i7. I don't have enough expertise to know whether the difference between 32-bit fedora nd 64-bit fedora or between i5 and i7 could explain your problems with hibernate. It could be a motherboard problem, or might be fixable with bios update, which others suggested to me when I was having trouble because hibernate failed.

There's extensive discussion of hibernate (not resume) problems in Bug #785384
reporting work done mainly by  Bojan Smojver to make it work. The last complaint there was reported was in July, but turned out to be a different problem. See comment 125 in that bug report.

> i will try that maxcpu trick with 3.4...

Just in case that wasn't a typo: it's 'maxcpus' not 'maxcpu'

(In reply to comment #51)
> i doesnt hibernate with 3.5, even when i turn off all but 1 cpus... :-)

I suspect anyone working on this problem will need to know exactly what you did to 'turn off all but 1 cpus'. Otherwise nobody can replicate your test.

> then it complained about some exception in the lzo compression thingies...

Since hibernate bugs seem to have been fixed for other Fedora users around May 2012, it may be a good idea for you to start a new bugreport dealing with pm-hibernate only (resume still has bugs others are reporting, and I don't know about suspend). If you give full details of your hardware configuration, bios revision, the boot parameters you are using, hibernate commands you use, kernels you have tried, any error reports you get, it may be possible for someone to work out which difference accounts for problems you have and others do not.

Just a suggestion!

In my case pm-hibernate has worked flawlessly on PC and laptop. On PC running F16 with  kernel 3.3.7-1.fc16.i686 #1 SMP Tue May 22 14:14:30 UTC 2012

'uptime' records 102 days without reboot, with a total of about 150 successful hibernates recorded in /var/log/messages*

During that time I have used acpi=off for resume except for the last few resumes, when I used maxcpus=1. Without one or other of those boot flags in grub.cfg when restarting from hibernate, resume sometimes succeeds and sometimes fails (always at the point where graphical screen should be restored). To me that suggests a synchronisation bug in resume. But pm-hibernate always reports using 3 threads to compress on both my machines, and seems to be flawless now.

Comment 53 Arne Woerner 2012-09-23 14:25:51 UTC

oki
so this bug report is about
"why does thaw need that acpi=off/maxcpus=1 trick?"?

and the new one will b about
"why doesnt 3.5 hibernate?"?

Comment 54 aaronsloman 2012-10-03 03:35:08 UTC

See Bug #862475 - Why do I need maxcpus=1 to resume from pm-hibernate in 32-bit Fedora 16 on Viglen Desktop PC, Fedora 17 on Dell E6410 laptop, both with intel core i5 cpu, intel graphics?

I have no idea whether there's some difference between Core i5 and Core i7 that produces different behaviours. So I phrased the new bugreport entirely in terms of i5.

Comment 55 Arne Woerner 2013-03-09 07:52:41 UTC

i dont have access to that box anymore... -arne

Comment 56 Fedora End Of Life 2013-07-04 05:59:26 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 57 John Schmitt 2013-07-05 05:48:21 UTC

I reproduced this today.  

$ uname -r
3.9.6-200.fc18.x86_64

Comment 58 aaronsloman 2013-07-05 09:08:56 UTC

Both pm-hibernate and suspend (usually invoked by shutting the lid) have worked consistently for me since mid June in F18 on a core i5 machine (Dell Latitude E6410)

3.9.6-200.fc18.x86_64 #1 SMP Thu Jun 13 18:56:55 UTC 2013

Perhaps there's a difference between core i5 and i7.

[This is my first 64bit linux and I've been very impressed at how smoothly it also supports 32 bit programs.]

On the same machine suspend was very unreliable in 32-bit F17 (i.e. usually resume failed) and resume from hibernate worked only with 'maxcpus=1' in grub.cfg

Comment 59 Mario 2013-07-05 09:35:31 UTC

I don't know if my issue is related to this bug. On a toshiba Z930 laptop (i7) second suspend fails freezing the system, while first suspend from a fresh boot goes fine.
After searching a bit, this seems related to an ACPICA bug:
https://github.com/acpica/acpica/commit/34f226fa2643f1d2e6527ea4edb24947cfe1fb6a
that was fixed on 20130626 release. As far I know, this release has neither been merged in the 3.9.9 nor in the 3.10 kernel.
Maybe the patch could be applied in the next Fedora kernel?

Comment 60 Arne Woerner 2013-07-05 13:44:56 UTC

so this bug is still active?

should we
(a) bump the Fedora version of this bug to FC18?
or
(b) make a new bug report (because I do not have a Core i7 anymore)? :-)

-Arne

Comment 61 Mario 2013-07-05 14:57:31 UTC

In my experience the bug is alive across Fedora and kernel versions. I installed Fedora 18 a few months ago, upgraded regularly until 19 release (via fedup) and it's still here. IMHO it does not seems to be tied to a particular processor, maybe to the chipset and/or the bios.
Maybe the assignee should decide about the classification of the bug.
Mario

Comment 62 aaronsloman 2013-07-17 19:22:10 UTC

(In reply to aaronsloman from comment #58 on 2013-07-05)
> Both pm-hibernate and suspend (usually invoked by shutting the lid) have
> worked consistently for me since mid June in F18 on a core i5 machine (Dell
> Latitude E6410)

This is still true. Now on kernel: 3.9.9-201.fc18.x86_64 

Still no problem with either pm-hibernate (sometimes used several times a day) or suspend triggeed by shutting lid. 

When I was using Fedora 17 (32 bit) suspend usually failed to resume, and resume from pm-hibernate required maxcpus=1, which was a nuisance, but made it totally reliable for me.

But both just work normally now. Could the persistent bug(s) be hardware dependent?

(Upgrading to F18 caused serious NetworkManager problems for me with Enterprise wifi, because of altered security mechanisms and wicd was unusable, but I think NM works now after I found out, by chance, which files to edit -- only partially tested. Everything else seems to be fine.)

Comment 63 Fedora End Of Life 2013-08-01 17:19:55 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.