Bug 781749 (kernel_hibernate) - Hibernation issue tracker bug
Summary: Hibernation issue tracker bug
Keywords:
Status: CLOSED UPSTREAM
Alias: kernel_hibernate
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 684452 451551 627980 677820 697544 713882 723499 726651 728689 731588 735625 742477 742708 unlock_new_inode 745634 753611 753836 755770 759536 759776 760364 769576 770222 770443 771334 771559 781789 785384 786312 787044 thaw 789708 791267 794692 795138 796109 796357 796516 796564 796597 797037 797076 797181 797559 799229 799575 800423 803605 804581 804903 805730 806072 808283 808909 810878 811043 823871 859723 hdmithaw
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-14 20:26 UTC by Adam Williamson
Modified: 2015-10-09 13:24 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-24 14:45:14 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Half the number of pages used by hibernation I/O (1.23 KB, patch)
2012-03-17 00:32 UTC, Bojan Smojver
no flags Details | Diff

Description Adam Williamson 2012-01-14 20:26:15 UTC
It was mentioned in the kernel talk at FUDCon that hibernation is dangerously broken - as in, can cause data corruption at resume time - and no-one has the time or inclination to look at fixing it, either at RH or upstream. Given this, and the fact that hibernation is rarely a useful option compared to power off or suspend, would it make sense simply to disable it in the kernel build so that we don't run the risk of corrupting people's data?

Comment 1 Dave Jones 2012-01-14 21:17:36 UTC
This is going to affect a number of packages. If we decide we're going this way, at the minimum, support will need removing from all the desktops, pm-hibernate from pmtools as well as the kernel.

Comment 2 Josh Boyer 2012-01-14 21:26:36 UTC
In reply to comment #1)
> This is going to affect a number of packages. If we decide we're going this
> way, at the minimum, support will need removing from all the desktops,
> pm-hibernate from pmtools as well as the kernel.

Moving to the distribution component to cover the general nature.

Comment 3 Bill Nottingham 2012-01-17 16:38:19 UTC
CC'ing some desktop environment maintainers for comments as to how feasible this is.

Comment 4 Kevin Fenzi 2012-01-17 16:52:29 UTC
For Xfce, I am pretty sure if Upower stops saying the machine can hibernate, Xfce will no longer display a hibernate button to the user.

Comment 5 Rex Dieter 2012-01-17 16:57:16 UTC
Ditto for kde's use of upower as well.

Comment 6 Matthias Clasen 2012-01-18 00:01:52 UTC
Pretty sure gnome is going to be fine too. But I am asking Richard to confirm too.

Comment 7 Richard Hughes 2012-01-18 12:25:58 UTC
Yes, GNOME will be fine, as upower is reading the values out of /sys/power/state and only setting the can-hibernate property to TRUE if 'disk' is present. I'm pretty sure all of GNOME uses upower for getting this data.

Whilst I agree that hibernation is dangerous, I think you're going to get some pretty big pushback from users that actually use this. At the moment gnome-settings-daemon automatically hibernates if the computer can hibernate, and if the battery is critically low. If we suspend instead, and then the power runs out (suspend still needs power) then we risk corrupting documents.

The alternative is we force-shutdown the system on critical low power, although a lot of apps don't auto-save, and that will make us unpopular to say the least.

Given that suspend "can cause data corruption" also, I'm not sure that's enough justification for removing a feature that I know a large number of people use.

But I also admit, hibernation takes way too long and works on too few machines due to the large amount of swap space required. Given I'm not the one tasked with maintaining the kernel hibernate code, I'll be OK adapting to changes if required.

Comment 8 Bill Nottingham 2012-01-18 14:08:44 UTC
I agree - I suspect there will be pushback, but I don't want to be on the hook for fixing hibernate. Wrapping it in a "yes-i-want-to-eat-my-data" boot option probably isn't feasible?

Comment 9 Adam Williamson 2012-01-20 00:23:52 UTC
FWIW, I don't recall the last time I saw a question about hibernate on the forums or #fedora. My impression is it's just not a very widely used feature; people usually suspend or shut down.

I'd be happy to post a thread in the forums to informally test the waters on how many people really use hibernate functionality.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 10 GeoffLeach 2012-03-05 20:44:50 UTC
Consider this pushback (see 796516). I'm off the grid, so power down happens every pm. Having to do shutdown is a PITA. BTW, was there a thread started, if so where?

Comment 11 Arne Woerner 2012-03-05 22:50:36 UTC
i think i tamed bug #788433... :-)
10 thaws and no crash/reboot/panic/oops...
at least it is much more stable than before...

i think either the hdd (write cache) or the mainboard (platform mode) misbehave in my case...

r we hibernating "like the doctor orders"?
i mean: r there intel/amd application notes on hibernation?

i personally would prefer to use some hibernate procedure instead of "halt -p"...

-arne

Comment 12 Adam Williamson 2012-03-05 23:00:40 UTC
arne: we may well not be hibernating 'like the doctor orders', no. the broad underlying problem here is there really isn't anyone at the kernel developer level who cares a whole deal about maintaining hibernate functionality, which is why it's so often broken.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 13 aaronsloman 2012-03-06 00:07:40 UTC
I am not a developer and don't know enough to be one, but I shall be
devastated without hibernate.

I have a desktop machine running fedora 15, and a laptop (Dell Latitude E6410) runningfedora 16. On both I regularly use hibernate for the simple reason
that I have a lot of activities I am involved in and I have between
8 and 10 virtual desktops (running on CTWM, not gnome or kde, though
for a while I used openbox CTWM gives me more flexibility along with speed low foot-print and great robustness.).

I use several virtual desktops to keep enduring records of where I am on various tasks e.g. pdf files partly read and commented on, editor files
with partly finished papers, or comments, or email messages, and other
things. When I go out, or retire at night, or go to a conference I
hibernate the PC and when I next turn it on having all that
information in context makes it much easier for me to return to the
most urgent task.

The laptop is mostly used when I travel or have to give presentations on campus. Again, keeping context that I can retrieve in seconds is enormously useful.

E.g. when preparing a talk with demos I have one desktop with the
pdf or xdvi file to be presented, another in which I work on the
latex source in the editor, and often one or more additional
desktops in which I have videos or program demos ready to run. I can
work on this configuration over several days before the actual
presentation, using hibernate to preserve it all (e.g. when the
machine has to be switched off while travelling).

In fact pm-hibernate seems to work reliably (for me) in Fedora 16
using kernel 3.1.7-1.fc16.i686 The trouble started with kernel 3.2

If there's a new version of 3.2 I'll be willing to test it. The last
version I tried was 3.2.7-1. It worked for a while, then started
freezing during hibernate, so I've abandoned it.

NOTE: I previously used tuxonice, because it was faster and gave
more information during hibernate and resume, but for some reason it
started freezing during resume in F15, after being rock-solid for
several years.

Tuxonice also gave me the option to boot windows (e.g. to test
hardware, very rarely needed) while hibernating. But recently
pm-hibernate has also shown a progress indicator during hibernate
and resume, and also allows me to see the grub menu after hibernate,
so I've abandoned tuxonice.

Apologies for length.

Aaron Sloman
http://www.cs.bham.ac.uk/~axs

See also: https://bugzilla.redhat.com/show_bug.cgi?id=785384 (from hibernate-users)

Comment 14 Adam Jackson 2012-03-06 15:05:38 UTC
Just a note: saying YES I WANT HIBERNATE isn't helpful.  We know you want hibernate.  We want hibernate too.  Wanting it doesn't make it magically work.

Comment 15 Justin M. Forbes 2012-03-06 15:11:25 UTC
I am not sure how hibernating right now is any safer than suspend and running the risk of running out the battery. It's a gamble either way, and data can be lost.

Comment 16 GeoffLeach 2012-03-06 16:30:02 UTC
(In reply to comment #14)
> Just a note: saying YES I WANT HIBERNATE isn't helpful.  We know you want
> hibernate.  We want hibernate too.  Wanting it doesn't make it magically work.

OK, how do we (who are not kernel hackers) contribute?

Comment 17 Paul Bolle 2012-03-06 19:06:39 UTC
(In reply to comment #8)
> Wrapping it in a "yes-i-want-to-eat-my-data" boot option
> probably isn't feasible?

Currently there's a "hibernate=" kernel parameter. Perhaps a (say) "disable"
value would be acceptable upstream.

This means, of course, that Fedora would default to "hibernate=disable" (or
something similar) to achieve what you suggest. Anyone manually removing a
kernel parameter like that should be (made) aware there's a downside to doing
that.

Comment 18 Arne Woerner 2012-03-06 19:51:53 UTC
(In reply to comment #17)
> Currently there's a "hibernate=" kernel parameter. Perhaps a (say) "disable"
> value would be acceptable upstream.

i'm for that... :-) -arne (i like Scary Movie 4)

Comment 19 GeoffLeach 2012-03-06 22:19:46 UTC
(In reply to comment #17)

> Currently there's a "hibernate=" kernel parameter. Perhaps a (say) "disable"
> value would be acceptable upstream.
> 
> This means, of course, that Fedora would default to "hibernate=disable" (or
> something similar) to achieve what you suggest. Anyone manually removing a
> kernel parameter like that should be (made) aware there's a downside to doing
> that.

1. Is this parameter user-accessible, or would it require the user to build a custom kernel?

2. In either case, would a bug report for non-function hibernate be accepted? (In that case, we would be back where we are now, no?)

Comment 20 aaronsloman 2012-03-07 03:21:20 UTC
(In reply to comment #14)
> Just a note: saying YES I WANT HIBERNATE isn't helpful.  We know you want
> hibernate.  We want hibernate too.  Wanting it doesn't make it magically work.

I should have pointed out that I was responding to the following claimed "fact" at the top of this page: "the fact that hibernation is rarely a useful option compared to power off or suspend".

I assumed, wrongly perhaps, that anyone thinking that could not have understood the kind of use that I and others made of hibernate. So I offered a detailed description of why for me (and others I know) it's essential. Apologies if I missed the point.

That said, I am very grateful to all the people whose efforts enable me to use linux. Neither Windows nor Mac OS tempts me!

Thanks.

Comment 21 Paul Bolle 2012-03-07 09:10:05 UTC
(In reply to comment #19)
> 1. Is this parameter user-accessible, or would it require the user to build a
> custom kernel?

Kernel parameters can be set, edited, and removed via the bootloader. It was mentioned (in comment #8), I assume, as an easy way for people to still use hibernation (without having to recompile their kernels).
 
> 2. In either case, would a bug report for non-function hibernate be accepted?
> (In that case, we would be back where we are now, no?)

The gist of this bug report is that hibernation is broken but unlikely to be fixed (at least currently). So the chances of reports concerning hibernation getting much attention are low (as they apparently already are now).

Comment 22 Paul Bolle 2012-03-07 09:41:19 UTC
(In reply to comment #20)
> I should have pointed out that I was responding to the following claimed "fact"
> at the top of this page: "the fact that hibernation is rarely a useful option
> compared to power off or suspend".

0) I'll be going a bit off topic here.

1) If I understand your comment #13 correctly you'd like to preserve your desktop's state after you've stopped using your computer. Well, if desktops were able to remember what stuff was being done on shut down (or log out) and reconstruct that state on power on (and log in) this whole discussion would be moot. Currently hibernation seems to take longer than shut down. And if the net effect of a shut down and power on cycle would resemble a hibernation and thaw cycle there'd be little left to be gained by fixing hibernation.

1) I'm not sure whether desktops actually have tried (or even are trying) to do that and how successful they were (or are). It could very well be that in practice it would be rather hard to implement this idea correctly. Eg, it could require big changes to both the desktop environment and the programs currently in use.

Comment 23 aaronsloman 2012-03-07 11:07:41 UTC
"Well, if desktops were able to remember what stuff was being done on shut down (or log out) and reconstruct that state on power on (and log in) this whole discussion would be moot"

Correct, except possibly for timing issues (see below). I have thought about this but decided it was totally unrealistic because all developers of interactive tools with state (editors, document readers, drawing packages, program development environments, etc., would have to agree on a set of conventions for recording their state on receiving some signal, and provide mechanisms for restoring their state when run in an appropriate way. (Firefox attempts this, though the version I am now using always requires user interaction when restoring its state after shutdown, and displays unwanted advertising pages.)

"Currently hibernation seems to take longer than shut down."
For me that's not the case. Maybe it's different for people using large systems like gnome or kde. It will probably depend on age of machine, kind of hard drive in use, size of saved state, etc. I suspect that starting up the whole operating system and previously running applications and then restoring state will take longer than just restoring from hibernation.

Moreover, since kernel 3.2 both hibernation and resume have got very much faster, perhaps as a result of new compression/decompression algorithms?

As of last night I am running 3.2.9-1.fc16.i686 and so far have hibernated and resumed a few times without problems.

Comment 24 aaronsloman 2012-03-07 12:57:36 UTC
(In reply to comment #11)
> i think i tamed bug #788433... :-)
> 10 thaws and no crash/reboot/panic/oops...
> at least it is much more stable than before...

That's also my impression using kernel 3.2.9-1.fc16.i686

Moreover, I no longer need to insert "acpi=off" in boot command when resuming from hibernate. Previously, without that flag it would not complete the resume, and would instead reboot. I am very glad that's fixed.

Hibernate is now amazingly fast (apparently because using three threads for compression on my Dell E6410 with Intel Core i5 cpu).

Resume from grub boot menu takes about 18 seconds, which is much less than full boot, log in, start up applications, etc. (Wireless connection takes a bit longer to restore, of course.)

My only grumble now is that updating kernel always puts the wrong boot partition in grub.cfg (the partition I originally used for fedora 15, instead of the partition currently in use for F 16). I have to edit the file by hand before first boot. That looks like a serious bug in grub2. Any advice on where to report that? I know it has nothing to do with hibernate.

Thanks.

Comment 25 cam 2012-03-07 13:10:07 UTC
(In reply to comment #9)
> FWIW, I don't recall the last time I saw a question about hibernate on the
> forums or #fedora. My impression is it's just not a very widely used feature;
> people usually suspend or shut down.
> 
> I'd be happy to post a thread in the forums to informally test the waters on
> how many people really use hibernate functionality.
> 
> 
> 
> -- 
> Fedora Bugzappers volunteer triage team
> https://fedoraproject.org/wiki/BugZappers

I prefer suspend, but it seems so frequently broken that hibernate is a good alternative. Currently I am using hibernate.

If I could choose only one, I would prefer suspend, but that is a great loss.

In some conditions I prefer hibernate, a use case being 'on the road and not always near power'. It's best to conserve power if you don't know how long it will be until you are using the machine again.

Comment 26 Mike Heller 2012-03-07 14:31:20 UTC
I'm sorry.
But I can't believe that you are discussing to disable a default feature on a operating system because nobody cares that feature?

On the other hand, I only use hibernation because of power consuption on 
a notebook or desktop either.

And I guess that disabling such a feautre will scare off users...

Comment 27 GeoffLeach 2012-03-07 14:38:33 UTC
(In reply to comment #23)

> As of last night I am running 3.2.9-1.fc16.i686 and so far have hibernated and
> resumed a few times without problems.

It appears that hibernate failure is system dependent. Hardware? I have a home-brew box that hibernates 3.2.9-1.fc16.i686.PAE every night with no problems. It has 2GB memory. OTOH I have a laptop with 8GB that can't hibernate 3.2.9-1.fc16.i686.PAE at all.

Comment 28 aaronsloman 2012-03-07 17:05:18 UTC
(In reply to comment #27)
> (In reply to comment #23)
> 
> > As of last night I am running 3.2.9-1.fc16.i686 and so far have hibernated and
> > resumed a few times without problems.
> 
> It appears that hibernate failure is system dependent. Hardware? I have a
> home-brew box that hibernates 3.2.9-1.fc16.i686.PAE every night with no
> problems. It has 2GB memory. OTOH I have a laptop with 8GB that can't hibernate
> 3.2.9-1.fc16.i686.PAE at all.

Sounds like an addressing problem?

Mine is Dell Latitude E6410 with 4GB (but 10GB swap area).

For a long time (starting wwith Fedora 13, June 2010) the intel integrated graphic card (i915?) was a source of severe problems but from Fedora 15 those problems disappeared.

However I have had to remove from my .xinitrc file the command to blank after an hour:
  
  xset dpms 3600 0 0
 
This seemed to (sometimes) prevent the screen working after hibernate+resume, though I was never sure that was the cause. It used to work fine in 2.6 kernels with tuxonice. I could go for weeks without rebooting.

Comment 29 Bojan Smojver 2012-03-07 23:46:22 UTC
(In reply to comment #0)
> It was mentioned in the kernel talk at FUDCon that hibernation is dangerously
> broken - as in, can cause data corruption at resume time - and no-one has the
> time or inclination to look at fixing it, either at RH or upstream. Given this,
> and the fact that hibernation is rarely a useful option compared to power off
> or suspend, would it make sense simply to disable it in the kernel build so
> that we don't run the risk of corrupting people's data?

For me (ThinkPad T510, Intel graphics), hibernation is broken only when kernel mode setting is used. Otherwise, it works OK. It's just that the box is useless like that. And, it got broken after 2.6.34 (see: https://bugs.freedesktop.org/show_bug.cgi?id=41705#c7), which is way before hibernation threading was introduced.

And yes, it causes memory corruption, which is most likely not caused by hibernation code itself. We know that, because new hibernation code calculates CRC32 of the image, so there is a very small possibility of a bad image being loaded back in. Also, other folks already verified that pages get corrupted after the image has been loaded back in (see kernel bug 37142).

You can track some hibernation bugs related to Intel graphics here:

https://bugs.freedesktop.org/show_bug.cgi?id=41705
https://bugs.freedesktop.org/show_bug.cgi?id=40241
https://bugzilla.kernel.org/show_bug.cgi?id=37142
https://bugzilla.kernel.org/show_bug.cgi?id=13811

Saying that hibernation is generally broken is untrue.

Hibernation is a very useful option and if it worked on my laptop, I would use it much more often than suspend. After all, once you hibernate, you battery will not get drained, unlike with suspend.

PS. My pet theory is is that DRM code keeps stale page pointers around and "reinitialises" them after thaw. Of course, those pages by then belong to something else, causing corruption. This analysis based on gut feeling and hairs on the back of my neck, of course. ;-)

Comment 30 Mike Heller 2012-03-08 04:51:27 UTC
(In reply to comment #27)
> (In reply to comment #23)
> 
> > As of last night I am running 3.2.9-1.fc16.i686 and so far have hibernated and
> > resumed a few times without problems.
> 
> It appears that hibernate failure is system dependent. Hardware? I have a
> home-brew box that hibernates 3.2.9-1.fc16.i686.PAE every night with no
> problems. It has 2GB memory. OTOH I have a laptop with 8GB that can't hibernate
> 3.2.9-1.fc16.i686.PAE at all.

it looks pretty similar in my cases.
One desktop and a laptop either with 2GB of RAM are working fine.
On the other hand my workstation at home with 8GB of RAM fails hibernation at all.

Comment 31 Adam Williamson 2012-03-08 05:39:56 UTC
a few points.

If you use hibernate and it seems to mostly work - great. I'm not going to suggest that you are crazy and seeing hallucinations. However, that's not sufficient evidence to support the statement that 'hibernate is not generally broken'. The people who say that hibernate is inherently broken are the people who work on the kernel and know how the hibernate code actually works, and what's wrong with it.

It's perfectly possible for it to work (apparently) perfectly 999 times and then eat your baby on the 1000th time. That, to me, is code that is inherently broken. That is the current state of the code, and that's why I proposed disabling it.

Mike Heller, yes, that is precisely what I'm proposing. In my view it's better to disable code that is known to be actively dangerous than to leave it on. Of course it's not an ideal solution, and the ideal solution is to fix the damn code, but given that that does not look like happening and really isn't within the Fedora project's control, we're left with only bad choices and required to pick the least bad.

ajax, I think it is useful to know who uses hibernate and what for, to be honest - part of the FUDCon discussion involved us all sitting around scratching our heads wondering what the hell anyone used hibernate for anyway, so it probably doesn't hurt to know that there are at least a few people with reasonable use cases for it.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 32 Bojan Smojver 2012-03-08 05:57:43 UTC
(In reply to comment #31)
> The people who say that hibernate is inherently broken are the people
> who work on the kernel and know how the hibernate code actually works, and
> what's wrong with it.

Can you please point to a thread where this is discussed. AFAIK, Rafael is the hibernation maintainer. Did he decide to drop that altogether or what?

I cannot speak for the others, but to me, hibernation problems boil down to KMS problems. I do not see how that is a hibernation issue.

PS. If they know what's wrong with it, why don't they fix it? Tongue in cheek, of course. ;-)

Comment 33 Bojan Smojver 2012-03-08 06:01:36 UTC
(In reply to comment #30)

> On the other hand my workstation at home with 8GB of RAM fails hibernation at
> all.

Have you tried blacklisting any of the modules? These sound like driver problems.

Comment 34 Bojan Smojver 2012-03-08 07:10:05 UTC
Just went over all of the bugs that depend on this. In almost all cases, the crashes and memory corruption are related to i915. The other serious problem are related to sandboxing.

The rest appears to be assorted driver problems (XYZ doesn't work after thaw), platform problems (need to use shutdown instead), modules that probably need to be blacklisted, modules that got rewritten and won't suspend properly any more etc.

I don't read this as: "It's perfectly possible for it to work (apparently) perfectly 999 times and then eat your baby on the 1000th time." If that's the case, this whole thing should be discussed on LKML instead of here.

Comment 35 Adam Williamson 2012-03-08 07:11:00 UTC
bojan: the source of the bug was an in-person discussion, so I don't have specific bug references. the kernel team probably would, if they feel like chipping in.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 36 Bojan Smojver 2012-03-08 07:58:40 UTC
(In reply to comment #35)
> bojan: the source of the bug was an in-person discussion, so I don't have
> specific bug references. the kernel team probably would, if they feel like
> chipping in.

OK.

I'll bet 90% of the memory corruption on thaw will be handled by disabling hibernation (by default) on systems that use i915. If that's something you'd like to do, then sure. Disabling hibernation complely will prevent everyone from testing, unless they want to build their own kernel.

Comment 37 aaronsloman 2012-03-08 13:41:30 UTC
(In reply to comment #23)
I wrote:
> As of last night I am running 3.2.9-1.fc16.i686 and so far have hibernated and
> resumed a few times without problems.

Several more hibernate/resume sessions followed without problems. But hibernate has just stalled again, so had to shut down forcibly. This is on Dell E6410 laptop, with Intel graphics requiring i915.

My desktop PC, also uses intel graphics also needs i915, but that's hibernating and resuming fine, using kernel 2.6.42.7-1.fc15.i686

My laptop previously worked on Fedora 16 with 3.1.7-1.fc16.i686 (though did not hibernate and resume so fast) but it seems that one of the recent kernel upgrades has caused something else to change, so that 3.1.7-1 no longer works. If I knew how I would revert to that.

Comment 38 Bojan Smojver 2012-03-09 02:38:45 UTC
(In reply to comment #37)
 
> Several more hibernate/resume sessions followed without problems. But hibernate
> has just stalled again, so had to shut down forcibly. This is on Dell E6410
> laptop, with Intel graphics requiring i915.

If you want to automate your tests, this is how you can do it:

echo -n reboot > /sys/power/disk; for (( i=0; i<125; i++)); do echo $i;
pm-hibernate; chvt 2; sleep 1; chvt 1; sleep 2; done

This will hibernate/thaw 125 times and it will change VTs every time the system comes back, exercising i915 a bit more. On my system, it usually lasts 20 to 25 times, at which points segfaults start.

Comment 39 Mike Heller 2012-03-09 06:34:00 UTC
(In reply to comment #33)
> (In reply to comment #30)
> 
> > On the other hand my workstation at home with 8GB of RAM fails hibernation at
> > all.
> 
> Have you tried blacklisting any of the modules? These sound like driver
> problems.

Yes I did but without success. ( I also tried to unload all unnecessary modules )
Maybe you can give me a hint which one I should try or how I can find out the failing one?

Comment 40 aaronsloman 2012-03-10 21:23:54 UTC
(In reply to comment #37)
I wrote:
> My desktop PC, also uses intel graphics also needs i915, butmys that's hibernating
> and resuming fine, using kernel 2.6.42.7-1.fc15.i686

I was wrong. I just took longer before it froze on hibernate.

> My laptop previously worked on Fedora 16 with 3.1.7-1.fc16.i686 (though did not
> hibernate and resume so fast) but it seems that one of the recent kernel
> upgrades has caused something else to change, so that 3.1.7-1 no longer works.
> If I knew how I would revert to that.

I managed to revert by uninstalling kernel, kernel-headers, and gcc , then installing kernel-headers for 3.1.7-1 then reinstallingti gcc. hibernate/resume worked for a while, and then froze. So I was deluding myself in thinking the problems were restricted to my laptop and to kernels 3.2*
 
It's very worrying that this is all proving so difficult. Perhaps linux has reached the level of complexity that requires use of formal methods in development -- and massive resources. Or perhaps it now has to be restricted to a subset of available hardware, if the problem is caused by the recent intel graphic cards. I assumed when I bought my laptop (in 2010) that Intel was so helpful to linux that the new hardware would soon be fully supported. It seems I was wrong and should have chosen the nvidia option?

Comment 41 Bojan Smojver 2012-03-10 22:55:15 UTC
(In reply to comment #39)

> Maybe you can give me a hint which one I should try or how I can find out the
> failing one?

Really difficult to say, to be honest. Trial and error is the only way, I'm afraid. I had problems with variety of modules in the past: USB, network, wireless, WWAN etc. So, I would systematically tried all of those until I reach the core modules that really cannot be unloaded.

Comment 42 Bojan Smojver 2012-03-10 23:23:39 UTC
(In reply to comment #40)
> Or perhaps it now has to be restricted to
> a subset of available hardware, if the problem is caused by the recent intel
> graphic cards. I assumed when I bought my laptop (in 2010) that Intel was so
> helpful to linux that the new hardware would soon be fully supported. It seems
> I was wrong and should have chosen the nvidia option?

Yes, it probably is restricted to a subset of hardware.

I communicated with quite a few Intel engineers (few of which worked for Red Hat as well) about this and they strike me as guys that want to fix the problem (and like guys who's own laptops hibernate properly). It's just that:

- they don't necessarily have my hardware (I know, you'd think Intel guys would have access to everything, but in practice, I don't think that's the case)

- they primarily work on new platforms, the ones that are about to hit the market, not hardware that's already out there

I even contemplated putting a bounty on this, but I don't think the amount I would be able to offer would motivate highly paid engineers enough. If I could part with my ThinkPad T510 for a couple of months, I would even do that. But, it's a bread winner that one, so I cannot do that.

In any event, I would not panic. This is just a bug. And it is most likely in DRM code.

If you pass nomodeset to the kernel and do the hibernate/thaw loop from comment #38, do you still get trouble? I sure don't.

Comment 43 Bojan Smojver 2012-03-10 23:44:32 UTC
(In reply to comment #40)
> So I was deluding myself in thinking the
> problems were restricted to my laptop and to kernels 3.2*

Just one more comment here. There were 3 major changes in hibernation code of 3.2:

1. Compression/decompression now uses threads.

2. CRC32 of the image is calculated (in a separate thread) and checked when compression/decompression is used.

3. I/O has been made smarter for both hibernate and thaw (compression or not).

This caused 3.2 to have better hibernate/thaw performance (which is what you were seeing). Also, we are now in the position to be reasonably sure that what was read in on thaw is what was hibernated (i.e. because of CRC32 calculation/checking). So, 3.2 hibernation code alone should be more reliable than that of 3.1.

Compression/CRC32 calculation can be completely disabled by passing hibernate=nocompress to the kernel. In that case, old compression/CRC32 free code is used.

PS. Speaking as the author of the above improvements.

Comment 44 aaronsloman 2012-03-13 07:17:08 UTC
(In reply to Bojan Smojver comment #42)
> If you pass nomodeset to the kernel and do the hibernate/thaw loop from comment
> #38, do you still get trouble? I sure don't.

I have not had an opportunity to try the looping experiment, but I have tried this:

I am now running 3.2.9-2.fc16.i686 and have two entries for it in grub.cfg, one for booting (e.g. after crash) and one for resuming from hibernate, the default.
The only difference is that in the 'linux' line the resume entry has two extra parameters at the end: "acpi=off nomodeset"

The effect of 'nomodeset' seems to be that the graphic mode does not change until after the decompression is complete (then the font size gets smaller).

I don't know what 'acpi=off' does but I previously found I had problems if I did not include it when resuming after hibernate. I found it as a tip on a website several months ago, but cannot recall where.

If anyone can advise as whether the acpi=off could make a useful difference and why or where I should look to find out (I have searched without success) I'll be grateful.

Also, is there a way to have flags that work only during resume, without having to have two grub menu entries?

Thanks.

Comment 45 Bojan Smojver 2012-03-13 07:50:20 UTC
(In reply to comment #44)
 
> If anyone can advise as whether the acpi=off could make a useful difference and
> why or where I should look to find out (I have searched without success) I'll
> be grateful.

If you are having i915 problems like me, it generally should not make a difference. In other words, you don't have to specify acpi=off. Unless you know that your system has ACPI problems, which doesn't seem to be the case from what you're saying.

> Also, is there a way to have flags that work only during resume, without having
> to have two grub menu entries?

Try adding this to your kernel command line: rd.driver.blacklist=i915. This will prevent i915 from being loaded by initramfs.

Comment 46 Bojan Smojver 2012-03-13 08:06:15 UTC
(In reply to comment #45)
 
> Try adding this to your kernel command line: rd.driver.blacklist=i915. This
> will prevent i915 from being loaded by initramfs.

BTW, this does not help on my machine. After about 50 or so hibernate cycles I get segfaults.

Comment 47 Bojan Smojver 2012-03-13 09:55:17 UTC
(In reply to comment #46)
> (In reply to comment #45)
> 
> > Try adding this to your kernel command line: rd.driver.blacklist=i915. This
> > will prevent i915 from being loaded by initramfs.
> 
> BTW, this does not help on my machine. After about 50 or so hibernate cycles I
> get segfaults.

And, of course, the machine eventually died.

In comparison, I just did over 130 hibernate/thaw cycles with nomodeset on the very same ThinkPad T510. I'm writing this from that session. The only segfaults were these:

Mar 13 19:24:54 shrek kernel: [  486.149939] modem-manager[3742]: segfault at 44 ip 000000000042edba sp 00007fff0389b740 error 4 in modem-manager[400000+55000]
Mar 13 20:38:30 shrek kernel: [ 1193.037601] modem-manager[7050]: segfault at 44 ip 000000000042edba sp 00007fffc607f7c0 error 4 in modem-manager[400000+55000]

Which is bugs in this software. The rest of the stuff works just fine.

Comment 48 Arne Woerner 2012-03-13 13:20:17 UTC
is there a difference in swap usage?
on my box (it uses i915 and KMS) it seems to be important that no swap space is used, when the box begins to hibernate... but my sample size is just 14 successful thaws (with empty swap area)... and about 100 unsuccessful thaws (with some hunderd MiBs in the swap area)...

Comment 49 Bojan Smojver 2012-03-13 22:29:34 UTC
(In reply to comment #48)
> is there a difference in swap usage?

Given we are talking about memory corruption problems, how much memory is available for hibernation image may make a difference in how readily the issue is reproduced. After all, the corruption appears random, so what and where will get corrupted will depend on what's in use.

In my experience on ThinkPad T510, I need at least 20 hibernate/thaw cycles to replicate the issue. I have 8 GB RAM, of which about 1 GB is in use when I do testing. My swap is 8 GB in size (and unused), so plenty of space there.

Comment 50 aaronsloman 2012-03-14 00:51:16 UTC
(I wote in comment #44)
 
 > I am now running 3.2.9-2.fc16.i686 and have two entries for it in grub.cfg, one
 > for booting (e.g. after crash) and one for resuming from hibernate, the
 > default.
 > The only difference is that in the 'linux' line the resume entry has two extra
 > parameters at the end: "acpi=off nomodeset"
 >
 > The effect of 'nomodeset' seems to be that the graphic mode does not change
 > until after the decompression is complete (then the font size gets smaller).
 
 So far I have had no freezing during pm-hibernate. (after more cycles than would previously have produced freezing.)
 
 I have noticed that it no longer reports using three threads when
 hibernating. Has multi-threaded compression been turned off, or could I
 have inadvertently turned it off?
 
 I still find that hibernate and resume are both pretty fast. (On DELL
 Latitude E6410).

Comment 51 Bojan Smojver 2012-03-14 01:03:22 UTC
(In reply to comment #50)

>  I have noticed that it no longer reports using three threads when
>  hibernating. Has multi-threaded compression been turned off, or could I
>  have inadvertently turned it off?

Threading has been reduced to a single compression/decompression thread in 3.2.9-2. This was in an effort to see whether threading had anything to do with problems, which I don't believe is the case. But, I would say that, wouldn't I? ;-)

See:

http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=commitdiff;h=77c4837f0f440071d4df225afb164839af4537f8

>  I still find that hibernate and resume are both pretty fast. (On DELL
>  Latitude E6410).

Yeah, 2 things help:

1. Async I/O queue is deeper.

2. Compression is in its own thread (although a single one).

Comment 52 aaronsloman 2012-03-14 07:23:17 UTC
(In reply to comment #51)
> (In reply to comment #50)
> 
> >  I have noticed that it no longer reports using three threads when
> >  hibernating. Has multi-threaded compression been turned off, or could I
> >  have inadvertently turned it off?
> 
> Threading has been reduced to a single compression/decompression thread in
> 3.2.9-2. This was in an effort to see whether threading had anything to do with
> problems, which I don't believe is the case. But, I would say that, wouldn't I?

After hibernating and resuming several times it eventually froze -- running a single thread. (Dell Latitude E6410), suggesting that your non-belief is justified!

I had been using "NOMODESET" only for resume, because I had somehow got the impression that modeset at boot time was required for 'startx' to bring up X with full screen resolution.

But I find that is incorrect. I can use NOMODESET for boot, and then run startx after logging in, and get X running. After that I use it for resume.

So I'll continue using this and report tck later.

Previous comments suggest that the bug is either in the Intel graphics firmware/hardware (unlikely if MSWindows users don't have problems), or else in the i915 linux driver, in which case no kernel changes will fix it?? Is anyone who works on the graphics driver reading these reports?

Comment 53 Bojan Smojver 2012-03-14 08:15:25 UTC
(In reply to comment #52)

> Previous comments suggest that the bug is either in the Intel graphics
> firmware/hardware (unlikely if MSWindows users don't have problems), or else in
> the i915 linux driver, in which case no kernel changes will fix it?? Is anyone
> who works on the graphics driver reading these reports?

i915 is part of the kernel (if you run lsmod, you'll see it there). There is a user space portion of the Intel driver, but that is not the issue.

So, it is the kernel side that needs fixing.

Comment 54 Adam Williamson 2012-03-14 22:08:07 UTC
the effect of 'nomodeset' will be that you will be using the 'vesa' driver for X, not the 'intel' driver.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 55 aaronsloman 2012-03-16 01:53:43 UTC
(In reply to comment #54)
> the effect of 'nomodeset' will be that you will be using the 'vesa' driver for
> X, not the 'intel' driver.
d in 
But after I've logged in (in non-graphical mode) I run 'xstart' and that launches the X window system with 1440x900 resolution, which I assume implies that it is using the intel driver, since the 'vesa' driver cannot achieve that resolution. Or have I misunderstood.

I am still getting hibernate freezing, but that's after running 'xstart'.

Perhaps I should try switching to a text console before running pm-hibernate. I had not previously thought of doing that.

Comment 56 Bojan Smojver 2012-03-16 02:02:28 UTC
(In reply to comment #55)

> But after I've logged in (in non-graphical mode) I run 'xstart' and that
> launches the X window system with 1440x900 resolution, which I assume implies
> that it is using the intel driver, since the 'vesa' driver cannot achieve that
> resolution. Or have I misunderstood.
> 
> I am still getting hibernate freezing, but that's after running 'xstart'.
> 
> Perhaps I should try switching to a text console before running pm-hibernate. I
> had not previously thought of doing that.

You can verify what is being used in /var/log/Xorg.0.log file. Also, if you run lsmod | grep i915, you should get nothing back if that module was not loaded.

Comment 57 Adam Williamson 2012-03-16 02:33:14 UTC
"which I assume implies that it is using the intel driver, since the 'vesa' driver cannot achieve that resolution. Or have I misunderstood."

Yes. The vesa driver is perfectly capable of reaching that resolution, and higher.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 58 aaronsloman 2012-03-16 03:02:33 UTC
(In reply to comment #56)

> > Perhaps I should try switching to a text console before running pm-hibernate. I
> > had not previously thought of doing that.
> 
> You can verify what is being used in /var/log/Xorg.0.log file. Also, if you run
> lsmod | grep i915, you should get nothing back if that module was not loaded.

If I switch to a text console it goes back to the original low resolution, but lsmod shows that i915 is still in use.

In comment #57
> Yes. The vesa driver is perfectly capable of reaching that resolution, and
higher.

So I wonder if I need i915 at all since I don't use 3-D graphics. Something must have caused it to be included when I did 'startx', after logging in with nomodeset.

I'll have to do more experimenting (including using rd.driver.blacklist=i915) after I've had some sleep.

Comment 59 aaronsloman 2012-03-16 23:48:22 UTC
I am using a Dell Latitude E6410 with intel graphic card and kernel  3.2.9-2.fc16.i686

I always boot in non-graphical mode (previously runlevel 3). I.e. 
   /etc/systemd/system/default.target

is a symbolic link to 
   /lib/systemd/system/runlevel3.target

So I can run tests, install updates, before using 'startx' to enter graphical mode.

I've tried the following in the kernel line in grub.cfg either when booting or when resuming or both, 

  (a) nomodeset
  (b) rd.driver.blacklist=i915

So far it does not seem that 'nomodeset' prevents pm-hibernate crashing, though I am not sure about the blacklist command. I'll continue to experiment with (b).

To my surprise, although nomodeset stops the machine entering high resolution mode when booting, it does not stop i915 module being loaded, as shown by output of
   lsmod | grep i915:

i915                  399773  5 
drm_kms_helper         30800  1 i915
drm                   179187  2 i915,drm_kms_helper
i2c_algo_bit           12980  1 i915
i2c_core               28123  6 videodev,i2c_i801,i915,drm_kms_helper,drm,i2c_algo_bit
video                  18500  1 i915

In contrast: rd.driver.blacklist=i915 does neither: i.e. it does not prevent high resolution mode being entered (so font size is reduced before booting finishes) and it does not prevent i915 being loaded giving the same evidence from lsmod as above.

In case it's relevant:
The command grep 915 /boot/config-3.2.9-2.fc16.i686 produces

   CONFIG_DRM_I915=m
   CONFIG_DRM_I915_KMS=y

At this stage I don't have clear evidence as to whether using (a) or (b) or both together affects crashing: I have not been experimenting long enough.

One thing that did surprise me was the effect on power consumption. If I set screen brightness to minimum, turn off backlighting of keys, turn off wireless, then measure battery drain using

   cat /sys/class/power_supply/BAT0/current_now

The values seem to be consistently higher if 'nomodeset' was included in the boot flag. For example with nomodeset the current does not drop below about 960000 and is more often over 1000000. However if I boot without 'nomodeset' then in the low load state the current can drop as low as 783000, though it is generally closer to 845000, depending on what I have been doing recently.

So it looks as if 'nomodeset' increases minimal power consumption by at least around 12%

In contrast rd.driver.blacklist=i915 seems not to affect the minimal load on battery. Should 'nomodeset' increase power consumption?

Comment 60 GeoffLeach 2012-03-17 00:07:37 UTC
(In reply to comment #51)
> (In reply to comment #50)
> 
> >  I have noticed that it no longer reports using three threads when
> >  hibernating. Has multi-threaded compression been turned off, or could I
> >  have inadvertently turned it off?
> 
> Threading has been reduced to a single compression/decompression thread in
> 3.2.9-2. This was in an effort to see whether threading had anything to do with
> problems, which I don't believe is the case. 

FWIW, with 3.2.9-2.fc16.i686.PAE (and no compression) the problem (reproducable hanf with pm-hibernate) remains.

Comment 61 Bojan Smojver 2012-03-17 00:25:21 UTC
(In reply to comment #59)

>   (b) rd.driver.blacklist=i915

Just FYI, this will only prevent i915 from being loaded by initramfs. Once the thawed kernel is read in and resumed, the i915 kicks in, as normal.

I was only testing with this because someone mentioned that not having i915 in initramfs helps. That appears to be misinformation.

Comment 62 Bojan Smojver 2012-03-17 00:32:38 UTC
Created attachment 570741 [details]
Half the number of pages used by hibernation I/O

If anyone would like to play by compiling the kernel, this patch will reduce the number of pages used by hibernation code in half, so that hopefully other parts of the kernel have enough and do not hang waiting for free pages. Well, that's this flimsy theory at least.

Let me know whether it helps.

Comment 63 Bojan Smojver 2012-03-17 00:34:25 UTC
OOPS, posted this patch to the wrong bug. This was meant to go to bug #785384.

Comment 64 aaronsloman 2012-03-17 10:36:56 UTC
(In reply to comment #62)
> Created attachment 570741 [details]
> Half the number of pages used by hibernation I/O
> 
> If anyone would like to play by compiling the kernel, this patch will reduce
> the number of pages used by hibernation code in half, so that hopefully other
> parts of the kernel have enough and do not hang waiting for free pages. Well,
> that's this flimsy theory at least.

I don't think any of my problems could be due to number of pages, as I have had hibernate freezing with swap unused and lots of spare memory.

E.g. here's output of top at present:

top - 10:35:01 up 11:10,  9 users,  load average: 0.82, 0.61, 0.37
Tasks: 167 total,   2 running, 164 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.7%us,  2.0%sy,  0.0%ni, 93.4%id,  3.7%wa,  0.2%hi,  0.0%si,  0.0%st
Mem:   3541948k total,   593320k used,  2948628k free,    49400k buffers
Swap: 10489064k total,        0k used, 10489064k free,   274684k cached

I've known pm-hibernate freeze in similar circumstances.

Comment 65 aaronsloman 2012-03-17 21:43:05 UTC
Some more observations that may (or may not) help to pin down the cause of hibernate problems.

1. Shortly after I wrote comment 64 showing much free memory and no used swap space, pm-hibernate froze again.

2. When pm-hibernate works I sometimes can't resume: instead of completing resuming it reboots. However if the resume boot parameters include "acpi=off" then resuming works reliably. So I use that for resuming after hibernating, but not for fresh boot.

3. It seems that using the "nomodeset" boot parameter both for a fresh boot and for resuming from hibernate, allows more hibernate+resume cycles without freezing. However, the price paid for that is higher battery drain. I have not noticed any other side effect.

In this situation adding "rd.driver.blacklist=i915" as suggested above does not seem to make a difference. In particular it does not prevent the i915 module being loaded before starting up graphic mode.

All the above apply to starting up in run-level 3 in non-graphical mode, then later running X after logging in, using 'startx' controlled by ~/.xinitrc

4. I have also noticed that during hibernation the wireless light on my laptop does not get turned off until very late, so that if hibernate freezes around 92% the light is still on. I wonder whether disabling wireless much earlier during  hibernate process could reduce the risk of unwanted interactions. (I use wicd -- maybe there is code to disable wireless that erroneously assumes NetworkManager is in use? I find NM unusable.)

|Apologies if all this is irrelevant.
 
Query: I sometimes find after resume from hibernation that sshd is not running, which interferes with use of my local network. Is there a place to specify things to start up after hibernation?

Comment 66 aaronsloman 2012-03-18 00:09:26 UTC
(Clarification of comment #65)

> All the above apply to starting up in run-level 3 in non-graphical mode, then
> later running X after logging in, using 'startx' controlled by ~/.xinitrc

I should perhaps have explained, for people unfamiliar with non-graphical login on linux (probably none reading this?), that after booting in non-graphical mode, one can resume in graphical mode, which is what I normally do: it's the main point of using hibernate.

Comment 67 Bojan Smojver 2012-03-18 00:50:44 UTC
(In reply to comment #65)

> |Apologies if all this is irrelevant.

Not irrelevant at all. Unfortunately, hibernation can be tripped by many different things, which makes it complicated.

From what you're saying here, I think it's most likely you are having driver (or other module) problems since 3.2. Try blacklisting modules systematically until you see which one is causing trouble.

Add a file in /etc/pm/config.d and set SUSPEND_MODULES variable like so:

SUSPEND_MODULES="abc xyz"

Where abc and xyz are names of modules you'd like to blacklist. If you are suspecting wireless, start with that. See whether it makes a difference.

> Query: I sometimes find after resume from hibernation that sshd is not running,
> which interferes with use of my local network. Is there a place to specify
> things to start up after hibernation?

You can put your own scripts in /etc/pm/sleep.d directory. Look in /usr/lib64/pm-utils/sleep.d (or /us/lib/pm-utils/sleep.d on 32-bit archs) directory for examples on how to do it.

Comment 68 aaronsloman 2012-03-18 01:16:59 UTC
(In reply to comment #67)

> From what you're saying here, I think it's most likely you are having driver
> (or other module) problems since 3.2.

Also older kernels, in fedora 15 on desktop PC, with intel graphics, using i915.

> Try blacklisting modules systematically
> until you see which one is causing trouble.
> 
> Add a file in /etc/pm/config.d and set SUSPEND_MODULES variable like so:
> 
> SUSPEND_MODULES="abc xyz"

It looks as if experimenting will take a lot of time. I may try later.
 
> >  Is there a place to specify
> > things to start up after hibernation?
> 
> You can put your own scripts in /etc/pm/sleep.d directory. Look in
> /usr/lib64/pm-utils/sleep.d (or /us/lib/pm-utils/sleep.d on 32-bit archs)
> directory for examples on how to do it.

Thanks that's very useful. I found that there are already scripts

   55NetworkManager
   91wicd

So I'll remove the first one and rename the second one 55wicd

There's already a lot of stuff about video in the scripts, including i915 mentioned briefly, and also intel. If the i915 modules are problematic, perhaps the i915 entries need to take more defensive action?

I wonder if anyone who has relevant expertise regarding i915 can comment?
(I am out of my depth!)

Comment 69 aaronsloman 2012-03-19 01:35:49 UTC
Now running kernel 3.2.10-3.fc16.i686 on Dell Latitude E6410

The strategy mentioned previously still seems to be robust:
I boot or resume from hibernate with "nomodeset" kernel parameter.

When resuming from hibernate add the extra parameter "acpi=off"
(The latter seems to be required to stop resume failing and triggering a full reboot.)

The only cost seems to be higher battery consumption with acpi=off, as reported in comment #59

On my desktop machine, running Fedora 15, also with intel graphics I am still using kernel 2.6.41.4-1.fc15.i686, using the same strategy (nomodeset + acpi=off). This appears to be quite robust, though hibernate and resume are a bit slower than with newer kernels.
However if I try using the desktop machine with kernel 2.6.42.9-2.fc15.i686 the pm-hibernate command mostly crashes. I have not investigated why.

Comment 70 Mike Heller 2012-03-19 09:31:10 UTC
back from holidays and read thru the thread yesterday.
I spent a lot of time to play arround with SUSPEND_MODULES... but with no success at all. ( tried almost every modules also combined with others )
Finally, my 8GB desktop at home still hangs all time during hibernation.
Any other ideas or did I something wrong?

Comment 71 aaronsloman 2012-03-19 21:30:39 UTC
(In reply to comment #69)

I wrote:

> Now running kernel 3.2.10-3.fc16.i686 on Dell Latitude E6410
> 
> The strategy mentioned previously still seems to be robust:
> I boot or resume from hibernate with "nomodeset" kernel parameter.
> 
> When resuming from hibernate add the extra parameter "acpi=off"
> (The latter seems to be required to stop resume failing and triggering a full
> reboot.)

No luck: after a few hibernate/resume cycles it froze, just after something like
  Firewire_core: rediscovered device fw0
  63%

I've gone back to 3.2.9-2.fc16.i686, which lasted longer in this mode.

Comment 72 Bojan Smojver 2012-03-20 03:11:52 UTC
I think we are dealing with at least two distinct problems here:

1. Memory corruption caused by i915/KMS. With kernel 3.3 my system will last from a couple to maybe 20 or so hibernate/thaw cycles, at which point segfaults start, which then ends in panic or something similar.

2. Hangs on hibernate with kernel 3.2. We already know that threading is not the cause, because people that are having this problem have been able to replicate the problem reliably with hibernate=nocompress. Of course, different buffering during hibernation may be the cause - this is still being investigated. This problem also affects limited number of hardware combinations, but we don't know yet what the common factor is.

Of course there are other problems, related to virtual box and of course the regular assortment of modules that are broken.

Comment 73 Mike Heller 2012-03-20 07:57:11 UTC
(In reply to comment #72)
> I think we are dealing with at least two distinct problems here:
> 
> 1. Memory corruption caused by i915/KMS. With kernel 3.3 my system will last
> from a couple to maybe 20 or so hibernate/thaw cycles, at which point segfaults
> start, which then ends in panic or something similar.
> 

does kernel 3.3 fix the "hangs on hibernate with kernel 3.2"???

Comment 74 Bojan Smojver 2012-03-20 08:57:19 UTC
(In reply to comment #73)

> does kernel 3.3 fix the "hangs on hibernate with kernel 3.2"???

Neither of them hang for me (in fact, today I did 250 hibernate/thaw cycles with 3.3 on my ThinkPad T510 with nomodeset), so someone that has hardware where 3.2 hangs will have to confirm that. Anyhow, 3.3.0 has been submitted for testing in F-16 and F-17, so we'll find out really soon.

Comment 75 Mike Heller 2012-03-20 18:55:08 UTC
(In reply to comment #74)

> Neither of them hang for me (in fact, today I did 250 hibernate/thaw cycles
> with 3.3 on my ThinkPad T510 with nomodeset), so someone that has hardware
> where 3.2 hangs will have to confirm that. Anyhow, 3.3.0 has been submitted for
> testing in F-16 and F-17, so we'll find out really soon.

tried today with nomodeset. same behaviour as with acpi=off -> hangs every second hibernate.
so I'll wait until 3.3 is available in repo testing...

Comment 76 Mike Heller 2012-03-20 19:57:07 UTC
(In reply to comment #74)
> Anyhow, 3.3.0 has been submitted for
> testing in F-16 and F-17, so we'll find out really soon.

couldn't wait. :-)
after install kernel 3.3.0-2 from rawhide the hibernation process success several times ( no hangs anymore ) keep on tracking.
But after some resumes -> fresh boot instead of loading image.

Comment 77 Mike Heller 2012-03-20 19:57:38 UTC
(In reply to comment #74)
> Anyhow, 3.3.0 has been submitted for
> testing in F-16 and F-17, so we'll find out really soon.

couldn't wait. :-)
after installed kernel 3.3.0-2 from rawhide the hibernation process success several times ( no hangs anymore ) keep on tracking.
But after some resumes -> fresh boot instead of loading image.

Comment 78 Bojan Smojver 2012-03-20 20:30:35 UTC
(In reply to comment #75)

> tried today with nomodeset. same behaviour as with acpi=off -> hangs every
> second hibernate.

Note that nomodeset option will generally only help with the first problem - i915/KMS memory corruption.

Comment 79 aaronsloman 2012-03-21 20:49:33 UTC
I used yumex and Fedora 16 updates testing to install new kernel 3.3.0-2.fc16.i686 (with stupid grub2 as usual putting the wrong partition UUID in the grub.cfg file).
Did not use 'nomodeset' or 'acpi=off'.

I tried pm-hibernate from console mode after login. For some reason got lots of segfault errors after resume, so tried again. It worked, but I've noticed a new line of output (also in /var/log/messages:

  ata2: link is slow to respond, please be patient 

Then started X and again tried pm-hibernate and resume. It worked. Then tried with firefox running. It worked. Tried again, and it worked, all without having to use either 'nomodeset' or 'acpi=off'

So far so good.

[Digression
Can anyone point me at the secret location where grub2 stores the information about the linux root partition, so that I can make it insert the F16 root partition uuid not the (no longer used) F15 root partition, so that I don't have to edit grub.cfg by hand after every kernel update? (I've done a lot of searching, in vain, though I have seen messages from other annoyed users -- including ubuntu  users -- complaining about this.)
]

Comment 80 Adam Williamson 2012-03-21 21:04:54 UTC
/boot/grub2/device.map , I think.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 81 aaronsloman 2012-03-21 22:03:48 UTC
(In reply to comment #80)
> /boot/grub2/device.map , I think.

Alas no: at least on my machine that specifies only the /boot partition
   # this device map was generated by anaconda
   (hd0)      /dev/sda
   (hd0,6)      /dev/sda6

My F16 root partition is /dev/sda13 (identified by uuid in /boot/grub2/grub.cfg

My old F15 root partition, kept for reference is /dev/sda12 -- that's the one that gets inserted (in uuid format) in the linux entry for every new kernel.

Maybe it's yum install, not grub2 that's buggy?

Comment 82 aaronsloman 2012-03-21 22:19:45 UTC
(In reply to comment #79)

> Then started X and again tried pm-hibernate and resume. It worked. Then tried
> with firefox running. It worked. Tried again, and it worked, all without having
> to use either 'nomodeset' or 'acpi=off'

But on the very next test, resume failed and it rebooted instead.

So I have re-inserted the entry with acpi=off for resume after hibernate
in grub.cfg

Comment 83 aaronsloman 2012-03-22 17:39:41 UTC
(In reply to comment #81)
> (In reply to comment #80)
> > /boot/grub2/device.map , I think.
> 
> Alas no: at least on my machine that specifies only the /boot partition
>    # this device map was generated by anaconda
>    (hd0)      /dev/sda
>    (hd0,6)      /dev/sda6
> 
> My F16 root partition is /dev/sda13 (identified by uuid in /boot/grub2/grub.cfg
> 
> My old F15 root partition, kept for reference is /dev/sda12 -- that's the one
> that gets inserted (in uuid format) in the linux entry for every new kernel.
> 
> Maybe it's yum install, not grub2 that's buggy?

Further investigation suggests that there's a very serious problem with grub2 and its use by yum. See bug #756559

I have inserted a report of the problem there (with a typo acknowledged in Comment 10).

Comment 84 aaronsloman 2012-03-25 18:06:28 UTC
(In reply to comment #79)
I wrote:

> I used yumex and Fedora 16 updates testing to install new kernel
> 3.3.0-2.fc16.i686 (with stupid grub2 as usual putting the wrong partition UUID
> in the grub.cfg file).

> .....It worked, but I've noticed a new
> line of output (also in /var/log/messages:
> 
>   ata2: link is slow to respond, please be patient 

Since kernel 3.3.0-2 that message, along with a delay of a couple of seconds before and after, has persisted whenever I boot, and sometimes when I resume from hibernate. It is still happening with kernel 3.3.0-4.fc16.i686.

'grep patient /var/log/messages' produces:

Mar 22 21:44:23 lape kernel: [    7.038582] ata2: link is slow to respond, please be patient (ready=0)
Mar 23 12:06:40 lape kernel: [    7.153556] ata2: link is slow to respond, please be patient (ready=0)
Mar 24 22:24:21 lape kernel: [    7.049138] ata2: link is slow to respond, please be patient (ready=0)
Mar 24 23:02:00 lape kernel: [    6.453185] ata2: link is slow to respond, please be patient (ready=0)
Mar 24 23:04:56 lape kernel: [    6.443441] ata2: link is slow to respond, please be patient (ready=0)
Mar 25 14:08:44 lape kernel: [    6.439209] ata2: link is slow to respond, please be patient (ready=0)
Mar 25 18:41:27 lape kernel: [    7.046079] ata2: link is slow to respond, please be patient (ready=0)

I've tried older /var/log/messages* files, and this started on 21st March, which is when I installed 3.3 kernel. I have a sample of messages files going back to January, with no occurrence of the word 'patient'.

Here's typical context in /var/log/messages:

Mar 25 18:41:27 lape kernel: [    2.822814] input: DualPoint Stick as /devices/platform/i8042/serio1/input/input5
Mar 25 18:41:27 lape kernel: [    2.836256] input: AlpsPS/2 ALPS DualPoint TouchPad as /devices/platform/i8042/serio1/input/input6
Mar 25 18:41:27 lape kernel: [    7.046079] ata2: link is slow to respond, please be patient (ready=0)
Mar 25 18:41:27 lape kernel: [   11.735489] ata2: COMRESET failed (errno=-16)
Mar 25 18:41:27 lape kernel: [   12.040326] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 25 18:41:27 lape kernel: [   12.042312] ata2.00: ATAPI: MATSHITA DVD+/-RW UJ892, 1.05, max UDMA/100
Mar 25 18:41:27 lape kernel: [   12.045232] ata2.00: configured for UDMA/100

I've tried running smartctl to check the hard drive and find no sign of anything wrong with it. This is on Dell Latitude E6410 with intel graphics, and intel core i5 cpu.

I expect people who boot in graphical mode will not notice this problem. If any are reading this message they could check their /var/log/messages file.

Internet searches show that various people have reported this warning message over the last few years, though not in connection with hibernate or resume, but I've not found any explanation, nor any indication of how to remove the delay and the messages.

Since the delay is not long (about 2 seconds before and after the message appears, which happens just before the screen switches from low resolution to high resolution) I can live with this, for the sake of a working version of pm-hibernate. But I suspect it indicates a problem new in kernel 3.3 that should be fixed.

Although pm-hibernate now works reliably on this machine, I still have to use acpi=off as boot flag for resume. I have two entries in grub.cfg, one for boot, without that flag, and one for resume, with it. If I don't include that flag, resume progresses to near the end then fails and the machine does a full reboot.

Should I start a separate bug report on that now that this persists after hibernate seems to be working reliably? It's presumably a problem with i915 and resume.

I don't have enough technical knowledge to work on kernel sources. If I can provide any other information, please let me know.

Comment 85 Bojan Smojver 2012-03-25 22:02:44 UTC
(In reply to comment #84)
 
> Should I start a separate bug report on that now that this persists after
> hibernate seems to be working reliably? It's presumably a problem with i915 and
> resume.

I think you should open a new bug report for this. The symptoms that you are describing here do not sound like the i915 corruption problem. I have not seen anyone avoid trouble when i915 is the cause by passing acpi=off. You are most likely having a problem with some other device.

Comment 86 Josh Boyer 2012-03-28 18:14:06 UTC
[Mass hibernate bug update]

Dave Airlied has found an issue causing some corruption in the i915 fbdev after
a resume from hibernate.  I have included his patch in this scratch build:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3940545

This will probably not solve all of the issues being tracked at the moment, but
it is worth testing when the build completes.  If this seems to clear up the
issues you see with hibernate, please report your results in the bug.

Comment 87 aaronsloman 2012-03-28 21:11:03 UTC
(In reply to comment #85)
> (In reply to comment #84)
> 
> > Should I start a separate bug report on that now that this persists after
> > hibernate seems to be working reliably? It's presumably a problem with i915 and
> > resume.
> 
> I think you should open a new bug report for this. The symptoms that you are
> describing here do not sound like the i915 corruption problem. I have not seen
> anyone avoid trouble when i915 is the cause by passing acpi=off. You are most
> likely having a problem with some other device.

Done: now in bug #806315

Comment 88 aaronsloman 2012-03-28 21:23:55 UTC
(In reply to comment #84)
> I wrote:

> > .....It worked, but I've noticed a new
> > line of output (also in /var/log/messages:
> > 
> >   ata2: link is slow to respond, please be patient 
> 
> Since kernel 3.3.0-2 that message, along with a delay of a couple of seconds
> before and after, has persisted whenever I boot, and sometimes when I resume
> from hibernate. It is still happening with kernel 3.3.0-4.fc16.i686.

I have now summarised that behaviour in bug #807593

Comment 89 aaronsloman 2012-03-30 03:55:56 UTC
(In reply to comment #85)

> I think you should open a new bug report for this. The symptoms that you are
> describing here do not sound like the i915 corruption problem. I have not seen
> anyone avoid trouble when i915 is the cause by passing acpi=off. You are most
> likely having a problem with some other device.

I've now found web sites recommending acpi=noirq to deal with resume or booting problems. (E.g. bug #709115 Comment 36).

I've tried that using patch kernel 3.3.0-7.1.fc16.i686

Using acpi=noirq  instead of acpi=off, I've managed to get several hibernate-resume cycles without resume failing. This also has the advantage that I can use the same kernel flags for boot as for resume, whereas I could use acpi=off only for resume, not for boot. (Using it at boot time disabled too much functionality.) 

I'll report further progress on this solution in bug #806315 .

Comment 90 aaronsloman 2012-04-12 21:43:42 UTC
(In reply to comment #89) 2012-03-29 23:55:56 EDT 

>......

> Using acpi=noirq  instead of acpi=off, I've managed to get several
> hibernate-resume cycles without resume failing. This also has the advantage
> that I can use the same kernel flags for boot as for resume, whereas I could
> use acpi=off only for resume, not for boot. (Using it at boot time disabled too
> much functionality.) 

Success with acpi=noirq did not last. So I reverted to using acpi=off for resume.

> I'll report further progress on this solution in bug #806315 .

It seems that the resume failing problem has been fixed in 3.3.1-3.fc16.i686

I no longer need toe acpi=off flag when resuming from hibernate.

Great progress: and thanks to all involved.

Comment 91 aaronsloman 2012-05-10 16:19:54 UTC
(In reply to comment #90 2012-04-12)
> It seems that the resume failing problem has been fixed in 3.3.1-3.fc16.i686
> 
> I no longer need toe acpi=off flag when resuming from hibernate.
> Great progress: and thanks to all involved.

Alas that did not last. Congratulations were premature.
See Bug #806315 for more details.

Comment 92 Clemens Eisserer 2012-05-10 21:48:15 UTC
unfroutunatly hibernation never worked on my toshiba tecra A8 because of its completly crippled BIOS. miss that feature a lot :/

Comment 93 Timothy Murphy 2012-05-11 11:30:13 UTC
I use hibernate on several different ThinkPads,
and find it extremely useful.
Please don't remove it.

Comment 94 Adam Williamson 2012-05-14 21:41:32 UTC
So this bug appears to have morphed over time into a tracker for hibernation issues, on the basis that we'll fix them rather than disabling it. (And whew, looks like the press hyenas who just chewed Canonical up for disabling hibernation in 12.04 missed this bug). Updating summary.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 95 aaronsloman 2012-05-15 00:50:40 UTC
(In reply to comment #94)
> So this bug appears to have morphed over time into a tracker for hibernation
> issues, on the basis that we'll fix them rather than disabling it.

My impression, as a long term user of hibernation (originally using swsusp2, then its successor 'tuxonice' and more recently pm-hibernate) is that hibernate used to work reliably on linux but over the last year or so problems arose, some concerned with hibernation failing and some concerned with resuming failing.

The problems with the hibernation (not resume) process have been removed and the process significantly speeded up in the last few months.

But resume/thaw remains problematic. As explained in Bug #806315 I have found a set-up that seems to make resume work every time, but it is not a satisfactory long term solution. It requires me to have two versions of the grub2 menu item, one for a fresh boot and one for resume. The only difference is that the resume version includes 'acpi=off' (a tip found in a web page, but without any explanation as to why it is needed). As explained in https://bugzilla.redhat.com/show_bug.cgi?id=806315#c37 I have a script to produce an edited copy of /boot/grub2/grub.conf for use when resuming, and I invoke hibernate with a script that temporarily installs that copy so that acpi=off is there for resuming.

I use this on both a laptop and a desktop machine running Fedora 16, both using intel integrated graphics, requiring the i915 module, which appears to be the cause of the problem.

The resume process runs until very near the end with the progress counter getting very close to 100%. Then, when the screen is supposed to re-set to graphic mode it crashes and reboots -- not every time, but very often, especially on my laptop. Because that seems to be reliably fixed by acpi=off, someone who understands the acpi mechanism should be able to track down what the difference acpi=off makes to the resume code, and perhaps find a way to prevent the crash+reboot even without the acpi boot parameter. It probably has something to do with restoring the screen state when the i915 module is used.

Unfortunately I am not a kernel programmer and would not be able to hunt this down.

It may turn out that there is some problem in the intel graphics hardware/firmware that means that it will always be necessary to use acpi=off when resuming with i915 module in use. If so, that case should be tested for, and the acpi setting changed by the default grub2 code code, without users having to mess around with boot menu options. It sounds like a very small change to the code run on resume after hibernate.

Not being a kernel programmer, I would not easily be able to find and fix the relevant portion of code. I hope this description helps.

Comment 96 admin 2012-05-24 05:45:57 UTC
I'm crashed my Fedora installation with this bug. now I'm reproduced it on another machine. After hibernate my system and running yum update system is unbootable at this now. I reinstalled it. See details at https://bugzilla.redhat.com/show_bug.cgi?id=823871. I'm voting for F17 release blocker bug due release criteria All known bugs that can cause corruption of user data must be fixed or documented at Common F17 bugs.

After hibernating/yum update system may become unbootable/unusable due to errors in packages unpacking.

Comment 97 aaronsloman 2012-05-24 09:25:50 UTC
(In reply to comment #96)
> ....
> After hibernating/yum update system may become unbootable/unusable due to
> errors in packages unpacking.

When I last had this symptom (using Fedora 16) it turned out to be due to mounting of an old root file system caused by kernel update putting the wrong root partition UUID in grub.cfg.

Eventually this turned out to be the result of a bug in buggy (used by update kernel) that made it copy the root partition from /etc/fstab, where I had inadvertently inserted the wrong root partition when editing fstab after upgrading from F15 to F16 installed on a new partition (with /boot on a separate partition).

Having the wrong root partition in grub.cfg is disastrous. Having it in /etc/fstab is not, because that file cannot be read until after root has been successfully mounted. For some time I had to edit grub.cfg by hand after every update kernel before shutting down.

(Hibernate instead of reboot after update kernel can also cause a disaster.)

I don't know if grubby has been fixed, as I've now corrected my fstab. If grubby has not been fixed, updating kernel could produce the symptoms described.

As a precaution, after update kernel, grub2-mkconfig can be run. It seems to produce a correct grub.cfg. See also Bug #756559

Comment 98 admin 2012-05-24 09:59:04 UTC
No. It's not only kernel. In my case it was related with broken dependencies of other packages.

Comment 99 aaronsloman 2012-05-24 10:41:14 UTC
(In reply to comment #98)
> No. It's not only kernel. In my case it was related with broken dependencies
> of other packages.

That symptom affected me also: it seemed to result from booting with wrong root partition and having a mixture of F15 (from root partition) and F16 from kernel file in /boot. This was possible because /boot had its own partition, separate from / partition). Symptoms included unwriteable file systems and package dependencies failing.

If you don't have a separate boot partition and an old root partition, this cannot be the source of your problem. Apologies if I misunderstood your description of the problem.

Comment 100 admin 2012-05-24 13:16:39 UTC
No. It's source. I have not older root partition and other OS installations. After hibernate and yum update my system dropped me to recover shell for repair filesystem. After fsck ilesyste, was repaired, but fixed some duplicates inodes. After repairing it have multiple errors in some files an services such avahi-daemon is unable to start now. It's my new installation. On old installation, which was completely destroyed during yum update some packages are updated successfully, some dependent packages was not updated And (i think) bash and some packages was updated partially.

Comment 101 admin 2012-05-24 14:42:16 UTC
As I see, we need rebase kernel package to 3.4 due to many fixes with hibernation.

Comment 102 Adam Williamson 2012-05-24 16:10:15 UTC
This bug is now being used as a tracker. Please don't discuss single specific bugs in the comments here; it'll only get confusing.

Comment 103 Timothy Murphy 2012-05-25 10:59:03 UTC
I don't understand what this last comment (102) means.
Reading through the many comments it is not clear to me if anyone has had memory corruption after hibernation who is _not_ using the i915 driver.

Comment 104 Michał Piotrowski 2012-07-05 07:01:14 UTC
I know that hibernation is broken on some systems and works fine on other. I understand that developers don't want to deal with bugs that are caused by hibernation. How about tainting kernel when user uses hibernation? Then you could completely ignore bugs on tainted kernels.

Comment 105 aaronsloman 2012-07-06 21:22:53 UTC
(In reply to comment #104)
> I know that hibernation is broken on some systems and works fine on other. I
> understand that developers don't want to deal with bugs that are caused by
> hibernation. How about tainting kernel when user uses hibernation?Then you
> could completely ignore bugs on tainted kernels.

Well there has been significant development on hibernation since this thread started, and in particular, a bug that cause pm-hibernate to crash was fixed (in Fedora 16) a few months ago, in March, kernel 3.3

There is still a problem with resume from hibernation, apparently connected with the i915 driver, but I have found what appears to be a slightly messy but totally reliable workaround (on both my laptop and my desktop PC using F16) described here, involving boot flags:
http://www.cs.bham.ac.uk/~axs/laptop/hibernate-on-linux.html

I don't understand why it is needed or why it works! I suspect a small change to resume from hibernate would make my fix unnecessary, but I don't know enough to do anything about it.

Comment 106 Arne Woerner 2012-08-05 05:17:00 UTC
is there a reason why it got worse again?
is it the intel driver?
but why does intel make a driver that cant even hibernate?

Comment 107 Jaroslav Reznik 2013-03-22 12:14:19 UTC
Add Tracking keyword to avoid bug closure during Rawhide rebase process.

Comment 108 Mike Heller 2014-06-13 08:48:29 UTC
since kernel kernel-3.14.5-200 my hp elitebook doesen't hibernate anymore.
kernel kernel-3.14.4-200 works fine

Comment 109 Paul Bolle 2014-06-13 09:48:05 UTC
(In reply to Mike Heller from comment #108)
> since kernel kernel-3.14.5-200 my hp elitebook doesen't hibernate anymore.
> kernel kernel-3.14.4-200 works fine

This is a tracker bug. I think you're supposed to file a separate report for your issue (if nothing similar is already filed, that is). Someone else might then add your issue to this tracker bug at some later point.

Comment 110 Adam Williamson 2014-06-13 19:43:14 UTC
Indeed. These days, it's probably best to file a bug report upstream on bugzilla.kernel.org (ideally after verifying the issue exists with an upstream kernel) - there's more devs watching that than just the Fedora ones. There were some small Fedora-specific changes between 3.14.4-200 and 3.14.5-200 - see http://koji.fedoraproject.org/koji/buildinfo?buildID=520927 - but none of them leap out as possibly causing a hibernate regression to me, so I'd guess it's something that changed upstream between 3.14.4 and 3.14.5.


Note You need to log in before you can comment on or make changes to this bug.