Bug 862475

Summary: Why do I need maxcpus=1 to resume from pm-hibernate in 32-bit Fedora 16 on Viglen Desktop PC, Fedora 17 on Dell E6410 laptop, both with intel core i5 cpu, intel graphics
Product: [Fedora] Fedora Reporter: aaronsloman <a.sloman>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 17CC: bojan, eran.borovik, gansalmon, hughsient, itamar, jonathan, jskarvad, kernel-maint, madhu.chinakonda, nuonguy, pknirsch
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Fedora 18 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-31 22:18:02 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description aaronsloman 2012-10-02 23:20:04 EDT
Description of problem:
Pm-hibernate now regularly shuts down for me, and has done since about March 2012. Resume/Thaw however crashes unless I add an extra parameter to grub boot menu after pm-hibernate. I used to do this, successfully with acpi=off, but then I discovered that one effect of that was using only one cpu during resume. So I tried using maxcpus=1 for resume from hibernate, and that so far totally reliably allows resume to complete.

Without maxcpus=1 I find that resume works fine until just before the desktop is refreshed (100% decompression) and after that it just shuts down automatically and does a full reboot -- but with no record of what went wrong.

Previously I used acpi=off to enable successful resume, but since maxcpus=1 is more specific and seems to make resume totally reliable I thought I should start a new bugreport.

Version-Release number of selected component (if applicable):
On Dell laptop E6410: kernel 3.5.4-2.fc17.i686 pm-utils-1.4.1-18.fc17.i686

On Viglen PC: kernel 3.3.7-1.fc16.i686 pm-utils-1.4.1-13.fc16.i686

(I have had the same problem with all recent kernels.)

How reproducible:

Failing to resume after hibernate is random -- but more often than not it fails.
Resuming successfully with maxcpu=1 happens 100% of the time.

Steps to Reproduce:
1. Boot
2. use pm-hibernate
3. reboot
    sometimes resumes successfully sometimes crashes (just before final screen refresh), after indicating decompression using 3 threads.

1. Boot
2. use pm-hibernate
3. reboot with maxcpus=1
    always resumes successfully, after decompressing with 1 thread.

Actual results:
As above

Expected results:
resume should always work after pm-hibernate, without requiring maxcpus=1

Additional info:
The fact that maxcpus=1 prevents resume crashing seems to imply that the multi-threading code in resume has a bug. Perhaps something needs to be syncrhonised before the display is finally restored.

Note: I previously reported this as problem that required acpi=off to resume in Bug #806315
I hope this new bugreport will make it easier for whoever wrote the multi-threaded decompression code to find the bug.
I have the impression that causing something to wait just before restoring the display will fix this.
(I don't know if this is connected with use of i915 graphics module.)	

If it can't be fixed is there a way to make the resume from hibernate code check whether i915 is in use, and if so use only 1 cpu for resume so that users don't need different boot flags for full boot and for resume.
Comment 1 Jaroslav Škarvada 2012-10-03 03:28:10 EDT
(In reply to comment #0)
> Without maxcpus=1 I find that resume works fine until just before the
> desktop is refreshed (100% decompression) and after that it just shuts down
> automatically and does a full reboot -- but with no record of what went
> wrong.
> 
Does it work without maxcpus=1 in case you hibernate without X/desktop (i.e. from something previously called runlevel 3)? We need to sort out whether it is X driver or kernel issue.
Comment 2 aaronsloman 2012-10-03 05:58:54 EDT
(In reply to comment #1)
> Does it work without maxcpus=1 in case you hibernate without X/desktop (i.e.
> from something previously called runlevel 3)? We need to sort out whether it
> is X driver or kernel issue.

I normally boot into runlevel 3 (which enables me to do things like test for updates before going into graphic mode). I always do that first after installing a new kernel, and for several months I've been testing pm-hibernate in that mode with each new kernel, on my Dell Latitude E6410. I usually find that resume works randomly without any special boot flags, so the failure to complete the resume does not depend on whether X is running. I have the impression that when resume completes there is a change of display mode immediately after 100% is reached, whether it's running X or not. It's at that point that resume fails.

Just to check I've tried after getting your message, on my Dell laptop with kernel 3.5.4-2.fc17.i686 #1 SMP Wed Sep 26 22:32:49 UTC 2012

Resume failed three times in a row, without X running.

I would prefer not to repeat this test on the desktop Viglen PC (also intel core i5, but running 3.3.7-1.fc16.i686 #1 SMP Tue May 22 14:14:30 UTC 2012) because I have 10 desktops with lots of windows containing work in progress. That desktop has uptime 117 days with well over 150 successful hibernate-resume cycles, originally using acpi=off for resume, but for the last two weeks or so using maxcpus=1 instead. The PC used to sometimes resume OK and sometimes fail without either boot flag. (As far as I know I am using the latest F16 kernel, but I did not try to update the kernel as that would have required a reboot.)

If you really need me to test that the PC still sometimes fails with latest F16 kernel, I'll do it but I suspect the evidence from the Dell is sufficient.

I upgraded the Dell from F16 to F17 some time in August to see if the change would lead to resume working properly, but it did not.

I did not discover the existence of the maxcpus=1 option until I noticed that during resume with acpi=off it showed only one penguin and reported only one thread for decompression, whereas without acpi=off it showed 4 penguins and three threads for decompression. So that set me wondering whether the failure to resume was directly linked to the number of threads and some sort of timing or synchronisation bug. That led me to try maxcpus=1, which works as well as acpi=off, but presumably has fewer side-effects.
Comment 3 Jaroslav Škarvada 2012-10-03 06:06:33 EDT
Maybe some SMP issue in the i915 driver, reassigning to kernel for further investigation.
Comment 4 aaronsloman 2012-10-06 06:29:11 EDT
Would it be possible to produce a version of the kernel with multi-thread decompression disabled, at least if i915 driver is present? If given instructions on how to install it I would be willing to try it out without the maxcpus flag. If that resumes from hibernate reliably it would help to isolate the bug in question.

This was done earlier in the year for hibernate failures related to multi-thread compression. See Bug #785384 The problem was eventually fixed by Bojan Smojver in April 2012 (see comment 114) after an earlier trial of a version of hibernate with single-thread compression.

In that case the problem was connected with calculation of free pages, though the resume/thaw problem may be very different.
Comment 5 aaronsloman 2012-10-10 18:58:39 EDT
In connection with Bug #859723 "(hibernate) 3.5.x kernel / Core i7 / pm-hibernate crashes before power down" Bojan Smojver wrote in comment 4

> If you'd like to eliminate compression/threading issues, you can boot with
> hibernate=nocompress.

So I thought I would try that instead of maxcpus=1 to see if had any effect on this bug (resume not completing with decompression multi-threaded).

The nocompress option seems to prevent the resume crash (which occurs just at the end of decompression with multi-threading), but the cost is that it takes several seconds longer to restart. I tried about 5 times, using kernel 3.5.6-1.fc17.i686 installed today on Dell Latitude E6410. It always resumed with nocompress.

I then removed the 'nocompress' and it resumed once after hibernate, but the next time crashed. Trying again without nocompress it crashed again. Inserting 'maxcpu=1' into grub.cfg just before invoking pm-hibernate allowed several hibernate resume cycles (as was previously also achieved with "acpi=off" used for resume only).

Decompressing with only one thread is still quite fast -- a lot better than not compressing at all.

Summary: 
1. with the standard grub.cfg pm-hibernate works but resume frequently fails right at the end of decompression, causing a full reboot.

2. booting with 'hibernate=nocompress' allows hibernate+resume to work OK but resume is particularly slow.

3. booting with standard grub.cfg then changing it before running pm-hibernate, by inserting either 'maxcpus=1' or 'acpi=off' allows hibernate and resume to work, reliably, though resume is a little slower with only one decompression thread (acpi=off and maxcpus=1 both prevent multi-thread decompression).

I conclude that there's a bug in the multi-threaded decompression code (maybe specific to interaction with i915 driver), which is not in the single-threaded decompression code.

If the bug cannot easily be fixed, is there any chance of providing a boot flag, something like hibernate.decompress-cpus=1 so that users don't have to alter grub.cfg after normal boot, before running pm-hibernate, to allow a successful resume with compression.

I have similar symptoms on my Desktop PC running Fedora 16 with this kernel

  3.3.7-1.fc16.i686 #1 SMP Tue May 22 14:14:30 UTC 2012

On that machine (also 4cpu core i5 processor), I have successfully been using the acpi=off flag then more recently the maxcpus=1 flag for resume for 124 days since the last full boot (as shown by 'uptime'), often hibernating more than once per day.

When I tried without either boot flag, resume failed more often than not. That's a machine with quite different motherboard, two very different hard drives, different (intel) graphic facilities. So I think this shows conclusively that there's a problem about resume using multiple threads for decompression: it always gets to 100% before crashing and rebooting, and it is not specific to my laptop.
Comment 6 Bojan Smojver 2012-10-10 20:02:19 EDT
Just before the crash, can you see "PM: Image successfully loaded" message? If yes, the decompression threads would have finished by then.

If you wish to play with kernel compilation, I can send you a small patch that will pause the kernel for a few seconds after each of the thaw stages is completed. At least it will give us a better indication where to look, although, if the problem really is of a multi-threaded nature, inserting pauses will change the experiment itself.
Comment 7 aaronsloman 2012-10-10 20:29:40 EDT
This is copied from comment 7 in Bug #859723 which is about hibernate failing.
I'll reply to your comment #6 above, immediately after this.

Bojan Smojver wrote (in Bug #859723):

> Being the author of both compression and threading hibernation code, I can
> tell what changed in 3.5 (as compared to 3.4):
> 
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;
> a=commitdiff;h=5a21d489fd9541a4a66b9a500659abaca1b19a51
> 
> Not much to do with threading - mostly about being more careful with memory
> allocations. Anyhow, I see from your comment that you are having trouble
> even with earlier kernels on thaw, so it's probably not 3.5 specific anyway.

Apologies: I should have made that clear. The bug in resume became apparent only after you fixed the earlier bug that prevented hibernate completing (in May I think). After that I found hibernate totally reliable, but resume was unreliable, and tended to crash immediately after decompression -- screen going blank, and machine rebooting. The bug could not be manifested before that since hibernate did not shut down reliably.

Starting several months ago, I tried various cures for the resume problem, and for a while settled on using acpi=off for resume only.

That worked but seemed to be overkill, until someone mentioned maxcpus=1, which also works reliably, allowing resume to complete.

> You note in the other bug that this may be a bug related to decompression
> and threading (possibly even some kind of i915 interaction). This code does
> not interact with i915 directly. So, if there is a decompression/threading
> bug, it is in the hibernation decompression/threading code. If there is an
> i915 bug, that would be an entirely separate issue.

(I lack your expertise: I mentioned i915 only because it is used on both my machines with the resume bug, and i915 has often been referred to as a source of problems.)

> Without seeing some kind of screen dump or other kind of debugging info, it
> will be very difficult to say what could really be casing this.

I don't know how to get a screen dump, and I don't know if it would help: during the resume process everything proceeds normally and the decompression begins with the notification that 3 threads are being used. Everything seems to work perfectly with the percentage display increasing *very* quickly until decompression seems to be complete (I think it reaches 100%) and the screen goes blank. Then there's a pause and a full reboot starts. I presume there's nothing in any log file because logs can only be created after decompression and successful resume.

In contrast, if I use acpi=off or maxcpus=1 for resume after hibernate, the behaviour during resume is exactly the same except that only 1 thread is reported for decompression. When the 100% decompression is reached the screen still goes blank for an instant then almost immediately is restored to its previous state with an xterm window showing the pm-hibernate command.

So the crashing with multi-thread decompression seems to occur between decompression finishing and the system/screen being restored to its previous state.

This can happen even if I have not started graphic mode, i.e. boot to level 3, then login, run pm-hibernate, then restart, and resume gets to end of decompression and does not restore the previous screen but reboots. So it does not depend on whether X was running when hibernate started. (Resume sometimes succeeds without the special flag to limit the number of threads, but it's random. It crashes and reboots more often than it succeeds.)

I am not a system programmer, but I wonder whether it's possible to insert some instruction to ensure re-synchronisation of the cpus immediately after multi-threaded decompression is complete, and before the pre-hibernate state is reinstated?

Perhaps that question just displays my ignorance?

NOTE: because this discussion is about resume not hibernate I've copied this comment from Bug #859723 to here, where I think it is more appropriate, so that others looking at this bug will see it.
Comment 8 aaronsloman 2012-10-10 20:34:15 EDT
(In reply to comment #6)
> Just before the crash, can you see "PM: Image successfully loaded" message?
> If yes, the decompression threads would have finished by then.

When decompression happens it is so fast that I don't see anything before the display changes. Now that I know what to look for I'll try again, first with maxcpus=1 to ensure successful resume, to see if I can see the message. If I can, I'll repeate without maxcpus=1 to see whether anything shows up.

> If you wish to play with kernel compilation, I can send you a small patch
> that will pause the kernel for a few seconds after each of the thaw stages
> is completed. At least it will give us a better indication where to look,
> although, if the problem really is of a multi-threaded nature, inserting
> pauses will change the experiment itself.

I have not tried compiling a kernel since several years ago, and am not sure that I'll have time soon to re-learn everything required. Anyhow, I'll report back after looking for the successfully loaded image.
Comment 9 Bojan Smojver 2012-10-10 20:37:28 EDT
(In reply to comment #7)

> I am not a system programmer, but I wonder whether it's possible to insert
> some instruction to ensure re-synchronisation of the cpus immediately after
> multi-threaded decompression is complete, and before the pre-hibernate state
> is reinstated?

There is no need for any of that. Hibernation code uses standard kernel threading API to start and stop threads. So, the kernel itself already does everything that is necessary there.

If there is a problem with multi threading code on thaw itself, it is only because I screwed something up. We'll try to find out if that is the case.
Comment 10 Bojan Smojver 2012-10-10 20:39:58 EDT
(In reply to comment #8)
 
> I have not tried compiling a kernel since several years ago, and am not sure
> that I'll have time soon to re-learn everything required. Anyhow, I'll
> report back after looking for the successfully loaded image.

OK, let me know.
Comment 11 Bojan Smojver 2012-10-10 20:46:13 EDT
Ah, forgot to mention. When number of CPUs is 1, you will have hibernation/thaw code use at least 2 extra threads (apart from the main hibernation/thaw thread): one for compression/decompression and another one for calculation of the image checksum (CRC32). So, threading will still be well and truly present, just running on a single CPU.
Comment 12 aaronsloman 2012-10-10 21:19:06 EDT
(In reply to comment #10)
> (In reply to comment #8)
>  
> > I have not tried compiling a kernel since several years ago, and am not sure
> > that I'll have time soon to re-learn everything required. Anyhow, I'll
> > report back after looking for the successfully loaded image.
> 
> OK, let me know.

I've done four cycles on the laptop (E6410), running kernel 3.5.6-1.fc17.i686

Twice I ran pm-hibernate with grub.cfg altered to include maxcpus=1 to see what it looks like when resume completes successfully.

It starts off with only one penguin displayed and a lot of text scrolling too fast to read. Then there's a screen blank and decompression starts, with text at the top of the screen showing progress, indicating 1 thread.

After decompression completes, everything is so quick that I don't have time to read any text displayed before the screen is restored, but I did notice that for about half a second (or less?) after the decompression completes, the screen goes blank there is some text displaced to the upper right of the screen -- maybe about 10 lines but it disappears too fast for me to count the lines or read anything. Maybe that includes your "PM: Image successfully loaded". But, if so, I can't read it.

Immediately after that the text disappears and the screen is restored to the state when I gave the pm-hibernate command. 

After doing that twice, I then tried pm-hibernate twice with standard grub.cfg, i.e. no maxcpus flag included.

The first time it resumed perfectly, though a bit faster and showing four penguins on the screen instead of only one in the first phase. Decompression indicated 3 threads and again went very fast. The first time the resume succeeded, and also very briefly displayed a few lines displaced to the top right of the screen before restoring the screen to the pre-hibernate state. I.e. it behaved exactly as with maxcpus=1, except for indicating 3 threads in use for decompression.

So I again gave the pm-hibernate command with standard grub.cfg. Everything was as before except that I did not see the very brief display of displaced text, and resume did not complete. Instead the machine rebooted.

So it looks as if some text is displayed *very* briefly when resume succeeds, though it's too brief (and in small print) for me to read. 

When resume fails, no text is displayed when screen goes blank after decompression, and the machine reboots immediately.

Perhaps the fact that it reboots rather than freezing is significant?

If the above information does not help you decide where to look in the code,
I am willing to try running a new kernel with debugging features added, though I'll need detailed instructions. It will probably have to wait till Friday (it's 2am Thursday here in UK and I'll shortly retire.)
Comment 13 Bojan Smojver 2012-10-10 21:30:14 EDT
(In reply to comment #12)

> Perhaps the fact that it reboots rather than freezing is significant?

Not sure, to be honest.

> If the above information does not help you decide where to look in the code,
> I am willing to try running a new kernel with debugging features added,
> though I'll need detailed instructions. It will probably have to wait till
> Friday (it's 2am Thursday here in UK and I'll shortly retire.)

The debugging features will be very simple. A short sleep, that will enable you to read messages, will be added.

The easiest thing to do is to compile a vanilla kernel (because hibernation code is the same there).

So, something like this:
------------------
mkdir vanilla
cd vanilla
git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git .
cp /boot/config-3.5.6-1.fc17.x86_64 .config
make oldconfig
make -j 4
sudo make modules_install
sudo make install
------------------

Then, edit grub.cnf so that vanilla is default.

Here is the patch that will stop the kernel for 10 seconds after image has been loaded successfully:
------------------
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 3c9d764..6be4c01 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -1413,6 +1413,7 @@ end:
                pr_debug("PM: Image successfully loaded\n");
        else
                pr_debug("PM: Error %d resuming\n", error);
+       ssleep(10);
        return error;
 }
 
------------------
Comment 14 aaronsloman 2012-10-10 21:57:22 EDT
(In reply to comment #13)

> The easiest thing to do is to compile a vanilla kernel (because hibernation
> code is the same there).
> 
> So, something like this:
> ------------------
> mkdir vanilla
> cd vanilla
> git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git .

OK. That's running (though I had to install git first!)

> cp /boot/config-3.5.6-1.fc17.x86_64 .config

I have fc17.i686 not x86_64. I assume the difference in .config file will suffice to ensure it compiles for my environment.

> make oldconfig
> make -j 4
> sudo make modules_install
> sudo make install
> ------------------
> 
> Then, edit grub.cnf so that vanilla is default.
> 
> Here is the patch that will stop the kernel for 10 seconds after image has
> been loaded successfully:
> ------------------
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 3c9d764..6be4c01 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -1413,6 +1413,7 @@ end:
>                 pr_debug("PM: Image successfully loaded\n");
>         else
>                 pr_debug("PM: Error %d resuming\n", error);
> +       ssleep(10);
>         return error;
>  }
>  
> ------------------
I can probably make that change by hand.

Will report back later.
Comment 15 aaronsloman 2012-10-10 22:19:44 EDT
(In reply to comment #14)
> (In reply to comment #13)

> > make oldconfig

This gave me a large number of options -- mostly incomprehensible to me. So I simply accepted all the defaults.

> > make -j 4

Now compiling, apparently happy.
Comment 16 aaronsloman 2012-10-11 05:18:19 EDT
I had a lot of trouble with 'make modules_install'

Repeatedly aborted with an error WARNING: couldnot open xxxxx  'No such file or directory' 

So I've repeatedly had to give commands of the form mkdir -p xxxxx and then redo 'make modules install'.

This is still continuing, several hours after I originally started the kernel build (I was asleep much of the time).
Comment 17 aaronsloman 2012-10-11 05:27:08 EDT
(In reply to comment #16)

> So I've repeatedly had to give commands of the form mkdir -p xxxxx and then
> redo 'make modules install'.
> 
> This is still continuing, several hours after I originally started the
> kernel build (I was asleep much of the time).

I've just realised I've been wasting my time because the installation is continually failing at the same place, and my creation of new directories does not help:

  Building modules, stage 2.
  MODPOST 1899 modules
sh /usr/local/src/kernel/vanilla/arch/x86/boot/install.sh 3.6.0+ arch/x86/boot/bzImage \
        System.map "/boot"
WARNING: could not open /var/tmp/initramfs.UQrQqb/lib/modules/3.6.0+/modules.order: No such file or directory
WARNING: could not open /var/tmp/initramfs.UQrQqb/lib/modules/3.6.0+/modules.builtin: No such file or directory

If I create the directory it just produces a new error message some time later about 'No such file or directory' except that part of the path name is different each time, e.g. in this case UQrQqb

I conclude that I am trying to create a kernel (3.6.0) that my current system (3.5.6-1.fc17.i686) is not able to create. 

I don't know any way to move beyond this point. Sorry.
Comment 18 aaronsloman 2012-10-11 06:08:05 EDT
I looked in /boot and /boot/grub2 and found, to my surprise that despite the error messages a kernel had been created (/boot/vmlinuz-3.6.0+) and grub.cfg had been modified. So I have tried booting hibernating and resuming.

I can't get wireless to work, so I am testing on one machine and typing on another.

One difference is that the print commands for reporting percentage saving and restoring instead of all being written to the same part of the screen are printed on successive lines.

The first time resume failed. The second time it succeeded. The third and fourth times resume succeeded. Each time it showed the 'PM: ... kbytes read in ... seconds' (which seems to come from hibernate.c)

I did not get Image successfully loaded, but that may be because there is no sleep command after the print command and it may have disappeared too fast.

I also did not get Error ... resuming when it failed to resume.

So far I have had one failed resume and several successful resumes. I have had that pattern in the past -- but it isn't usable, because the failed resumes are unpredictable. I'll try a few more times and see if I can get any more information from the screen before the resume fails.
Comment 19 aaronsloman 2012-10-11 06:31:46 EDT
Just noticed an error in the log in the text screen after I run 'startx', namely FATAL: Module i915 not found.

This is presumably connected with the failure of 'make modules install'.

The 'du' command shows /lib/modules/3.6.0+ has no modules. I am surprised it works at all!

I've had another resume failure. As before it showed the 'PM .. kbytes read' message, paused for several seconds, but did not print either

   Image successfully loaded
or
   Error ... resuming.

It did seem to flash up something starting with 'Suspending' before crashing and rebooting. But I may be hallucinating. It happened very fast.

I've now gone back to 3.5.6-1 and use of maxcpus=1.

I hope there's something useful in that information. Perhaps I can do something more to get the missing modules working, but it won't be till tomorrow, or late tonight.

Thanks for your help.
Comment 20 aaronsloman 2012-10-11 08:45:09 EDT
I've just noticed that lots of modules were successfully compiled. So the problem seems to be only that they were not installed. E.g. 

  cd vanilla/drivers/gpu
  ls -l drm/*/*.ko
prints out:
-rw-r--r-- 1 axs axs  4985101 Oct 11 03:35 drm/gma500/gma500_gfx.ko
-rw-r--r-- 1 axs axs   378525 Oct 11 03:35 drm/i2c/ch7006.ko
-rw-r--r-- 1 axs axs   202134 Oct 11 03:35 drm/i2c/sil164.ko
-rw-r--r-- 1 axs axs  9402129 Oct 11 03:35 drm/i915/i915.ko
-rw-r--r-- 1 axs axs 31645453 Oct 11 03:35 drm/nouveau/nouveau.ko
-rw-r--r-- 1 axs axs 15460304 Oct 11 03:35 drm/radeon/radeon.ko
-rw-r--r-- 1 axs axs  1367285 Oct 11 03:35 drm/ttm/ttm.ko
-rw-r--r-- 1 axs axs  1260741 Oct 11 03:35 drm/via/via.ko
-rw-r--r-- 1 axs axs  3322152 Oct 11 03:35 drm/vmwgfx/vmwgfx.ko

So it looks as if there's only a bug in the code for installing the modules in /lib/modules

However, I can't tell if this could affect any of the evidence about resume failing.

Perhaps it confirms that the failure has nothing to do with i915, because i915.ko had not been installed and was not in use, and the symptom was as before, namely resume after hibernate sometimes succeeds and sometimes fails, immediately after decompression is complete.
Comment 21 aaronsloman 2012-10-11 16:06:22 EDT
Arrrgh... I have just noticed that I did not see the underscore in

   make modules_install

I typed the command with a space, instead of using copy and paste, which I should have done!
Comes of working half asleep, much too late.

I'll re-try and report back
Comment 22 aaronsloman 2012-10-11 17:55:06 EDT
Apologies for previous errors. I now have all the modules in place and can run 3.6.0+ with everything working normally, including wireless, and no more complaints from X.org about Module i915 not found. I did get one new problem, namely warning messages about my SD card which I leave in the drive and use for backup when travelling. The message was: "mmc0: Timeout waiting for hardware interrupt", which I don't get with earlier kernels. I removed the card to make sure it was not causing problems, and continued testing the new kernel (3.6.0+).

I learnt nothing new, unfortunately. As before it sometimes resumes successfully after pm-hibernate and sometimes fails to resume, and instead does automatic reboot. This can happen both when I hibernate in non-graphic mode (runlevel 3, No X) or in graphic mode (after invoking 'startx', with ctwm window manager.

One difference between 3.5.6-1 and 3.6.0+ is that when 3.6.0+ resumes I don't get text shifted right on screen just after decompression finishes.

Also in both cases I do not get "PM: Image successfully loaded", unless it is printed for such a short time that I don't detect it. However I do get 

"Image loading done" followed by something like "PM: read ... kbytes in ... seconds ... MB/s"
(usually about 1.3 seconds).

After that has been visible, I've noticed something else. Just before the display changes back to the pre-hibernate state, something else flashes up below the printout but it is very short. I *think* it's something like 'Suspending console..' but it disappears so quickly that I find it very hard to be sure. I suspect that's part of the saved screen contents, from just after pm-hibernate starts.

So the random failure to resume remains a complete mystery. I also can't understand why I don't see the 'Image successfully loaded' message after 'Image loading done', even when resume succeeds.

So for now I've gone back to kernel 3.5.6-1.fc17.i686 and using 'maxcpus=1' for resume from hibernate.

If you think there's more diagnostic code that could go into swap.c or one of the related files, I now know how to rebuild the kernel after the change.
Comment 23 Bojan Smojver 2012-10-11 21:59:15 EDT
(In reply to comment #22)
 
> Also in both cases I do not get "PM: Image successfully loaded", unless it
> is printed for such a short time that I don't detect it.

OK. Could you please insert this line, just before that ssleep(10) that I gave you in the patch:
-------------------
printk(KERN_ERR "PM: Out of load image.\n");
-------------------

This should be followed by that ssleep(10) - meaning - you will have 10 seconds to read this. (I'm surprised that kernel didn't stop for you after printing 100% for 10 seconds already - or did it?)

What this will do is print this message _every_ time image load function completed. This should be the case every time.

Here is another thing you can try:

In line 1087 of swap.c, there is a line like this:
-------------------
nr_threads = clamp_val(nr_threads, 1, LZO_THREADS);
-------------------

Please change it to:
-------------------
nr_threads = 3;
-------------------

This will ensure that at 3 threads are used for decompression, even if number of CPUs is 1. Then, run this with maxcpus=1. You should see 3 threads used for decompression.

If there is no crash, I'm guessing threading itself has nothing to do with the crash. Instead, executing multiple threads on multiple CPUs does.

Oh, and most importantly, thank you for testing!

PS. If you would like to test against 3.6.1 (i.e. the real stable), you can do this: "git checkout -t -b linux-3.6.y origin/linux-3.6.y". And then recompile.
Comment 24 aaronsloman 2012-10-11 22:54:25 EDT
(In reply to comment #23)
> (In reply to comment #22)
>  
> > Also in both cases I do not get "PM: Image successfully loaded", unless it
> > is printed for such a short time that I don't detect it.
> 
> OK. Could you please insert this line, just before that ssleep(10) that I
> gave you in the patch:
> -------------------
> printk(KERN_ERR "PM: Out of load image.\n");
> -------------------
> 
> This should be followed by that ssleep(10) - meaning - you will have 10
> seconds to read this. (I'm surprised that kernel didn't stop for you after
> printing 100% for 10 seconds already - or did it?)

It did pause, but previously without the expected printing.

I thought I would first try your first newj change on its own. I now find that printk works and I get a message printed, whereas pr_debug does not print anything.

Is pr_debug controlled by some global flag that needs to be turned on? I can't find its definition.

Anyhow, I inserted the printk command in two places, just before

  pr_debug("PM: Image successfully loaded \n");

and also where you suggested, just before ssleep(10). Both printed OK, but in both cases resume worked.

I thought I should let you know now.

I'll now experiment some more, and if resume crashes will see what is printed.

I'll later try your other change, probably after getting some sleep.
Comment 25 Bojan Smojver 2012-10-11 23:06:05 EDT
(In reply to comment #24)

> Is pr_debug controlled by some global flag that needs to be turned on? I
> can't find its definition.

You may want to put this in /etc/pm/sleep.d directory (you can name it 00verbosity if you like):
-------------------
#!/bin/sh
case "$1" in
	hibernate)
		read cur def min ker < /proc/sys/kernel/printk
		if [ $cur -lt $ker ]; then
			echo $ker > /proc/sys/kernel/printk
		fi
		;;
	thaw)
		read cur def min ker < /proc/sys/kernel/printk
		if [ $cur -gt $def ]; then
			echo $def > /proc/sys/kernel/printk
		fi
		;;
	*)
esac
-------------------

It should make sure you printk prints out debug messages.

BTW, if you saw a pause consistently, this means that decompression threads already finished and the image has been restored to memory.
Comment 26 aaronsloman 2012-10-11 23:17:37 EDT
(In reply to comment #24)

> Anyhow, I inserted the printk command in two places, just before
>
>   pr_debug("PM: Image successfully loaded \n");
>
> and also where you suggested, just before ssleep(10). Both printed OK, but
> in both cases resume worked.

I tried a few more times and eventually resume crashed after printing this twice:
PM: Out of load image

So that answers your question about where it gets to before crashing.

(In reply to comment #24)
> It should make sure you printk prints out debug messages.

Printk is working fine. It's pr_debug that failed to print.

> BTW, if you saw a pause consistently, this means that decompression threads

> already finished and the image has been restored to memory.

Well now it's confirmed that that happens before the resume crash.

I'll sleep now and later try changing the decompression, as suggested, though if the crash is happening after decompression and after a pause, then presumably changing the decompression should not make any difference.

Anyhow must sleep now.


Thanks.
Comment 27 Bojan Smojver 2012-10-11 23:25:07 EDT
Thank you once again.

Yes, there is probably no point testing multiple decompression threads with single CPU, now that we know that decompression runs fine. The crash is probably somewhere else.

PS. I think pr_debug() actually uses printk() to print stuff out. It's just a macro. And, if printk config says to not print debugging stuff to the console, you won't see it.

Oh, and have a good sleep! :-)
Comment 28 aaronsloman 2012-10-12 07:00:55 EDT
(In reply to comment #27)
> Yes, there is probably no point testing multiple decompression threads with
> single CPU, now that we know that decompression runs fine. The crash is
> probably somewhere else.

I tried anyway, with nr_threads = 3; as you suggested.

and I reduced the delay after printk to 5 seconds to save time!

(printk works but pr_debug doesn't. There must be a global switch for pr_debug somewhere, but since I can use printk I'll just ignore pr_debug. I did not need to change anything to make printk work.)

Results: 

tried pm-hibernate with grub set to maxcpus=1 for resume only and it resumed successfully about 5 times while reporting using 3 threads for decompression.

Unfortunately that's not conclusive because crashing is random anyway.

then tried again WITHOUT maxcpus=1 and it still resumed successfully several times, but crashed on the 5th attempt in the usual place -- immediately after decompression. (maxcpus=1 seems not to significantly affect the time for decompression.)

[DIGRESSION I wish pm-hibernate had a -r option like tuxonice hibernate to request immediate reboot after hibernate. That would save me having to use the power button to reboot each time when testing.]

All that is inconclusive but I've now decided to try the following:
Set nr_threads = 1;

and repeat the above experiment. If it still crashes without maxcpus=1 that will finally exonerate multi-threaded decompression.

I'll report later.

> PS. I think pr_debug() actually uses printk() to print stuff out. It's just
> a macro. And, if printk config says to not print debugging stuff to the
> console, you won't see it.

I suspect, as indicated above, that in addition to using printk pr_debug must be turned on or off by a global flag set somewhere, which is off by default.
But I don't need to find how to turn it on. (Perhaps in .config ?)

> Oh, and have a good sleep! :-)

Done!
Comment 29 aaronsloman 2012-10-12 08:51:45 EDT
(In reply to comment #28)
I wrote:
> ...
> All that is inconclusive but I've now decided to try the following:
> Set nr_threads = 1;
> 
> and repeat the above experiment. If it still crashes without maxcpus=1 that
> will finally exonerate multi-threaded decompression.

Done.
Recompiled kernel with nr_threads = 1; in swap.c
(in load_image_lzo line 1087)

First tested with use of 'maxcpus=1' for resume from hibernate.
Behaviour unchanged: resumed successfully a couple of times. 

Then tested without use of 'maxcpus=1' for resume from hibernate
resumed successfully twice and crashed twice immediately after decompressing saved image reporting using 1 thread.

Proves conclusively that contrary to my earlier suspicion the resume crash is not related multi-threaded decompression. It must occur somewhere after completion of swsusp_read

So I'll insert a collection of additional print commands in hibernate.c after call of swsusp_read and try again.
Comment 30 aaronsloman 2012-10-12 10:46:58 EDT
Extra print commands have led me to the call of suspend_console inside 
kernel_kexec in kernel/kexec.c

It goes into suspend_console (defined in printk.c around line 1857), which prints out "Suspending console(s) (use no_console_suspend to debug)"

The comment above states: 
    "This disables printk() while we go into suspend states"

Which explains why I got no further printing after that! I presume I should try 'no_console_suspend' in grub.cfg

But I thought I should report getting this far, especially in view of this comment after the call of suspend_console in kernel_kexec

                suspend_console();
 
                error = dpm_suspend_start(PMSG_FREEZE);
                 if (error)
                         goto Resume_console;
                 /* At this point, dpm_suspend_start() has been called,
                  * but *not* dpm_suspend_end(). We *must* call
                  * dpm_suspend_end() now.  Otherwise, drivers for
                  * some devices (e.g. interrupt controllers) become
                  * desynchronized with the actual state of the
                  * hardware at resume time, and evil weirdness ensues.
                  */
                 error = dpm_suspend_end(PMSG_FREEZE);

It looks as if the evil weirdness is what I have been chasing, and welcome any suggestions as to what to try next to pin it down.

Meanwhile I'll try to re-enable printing and see if that shows up any more clues.
Comment 31 aaronsloman 2012-10-12 11:10:20 EDT
OK some progress with no_console_suspend

Newly enabled printing showed exit from 

     suspend_console
     dpm_suspend_start
     pm_restrict_gfp_mask
       (invoked in hibernation_restore in kernel/power/hibernate.c)

After that it crashes. I'll see if I can add more print commands to pin it down more closely.

Suggestions welcome!
Comment 32 aaronsloman 2012-10-12 20:07:47 EDT
yum update has now installed kernel 3.6.1-1.fc17.i686. I have left "no_console_suspend" in the grub entry, and have run pm-hibernate several times in non-graphical mode and in graphical mode, without maxcpus=1 and resume has not yet crashed. But it was always random.
Comment 33 aaronsloman 2012-10-12 20:31:06 EDT
The next attempt at hibernate/resume produced a resume crash and automatic reboot.

So I've reverted to using maxcpus=1 for resume.

I wonder if it is possible to make this unnecessary by changing the code invoked at the last stage of resume, after decompression, during image restore, to use something like the use of this in swap.c
  nr_threads = 1;

described in comment #29. In that location it did not prevent resume crashing after decompression. But there must be some other portion of code where a restriction to 1 thread would prevent failure because resuming with 'maxcpus=1' prevents it. If that restriction were available in the last stage of resume code, then users would not have to edit grub.conf before using pm-hibernate, as explained in the bug description above.

I don't know enough to work out which portion of resume code to edit. The evidence reported in comment #31 suggests that it is code invoked some time after pm_restrict_gfp_mask

If someone can tell me which bits of code to change, and how to change them, I'll try that using the kernel source for 3.6.0+ which I used in tests suggested by  Bojan Smojver above.
Comment 34 Bojan Smojver 2012-10-12 20:59:35 EDT
On the subject to testing automation, if you echo reboot to /sys/power/disk before running pm-hibernate, the system will reboot instead of shutting down.
Comment 35 Bojan Smojver 2012-10-12 21:06:09 EDT
(In reply to comment #33)
 
> But there must be some other portion of code
> where a restriction to 1 thread would prevent failure because resuming with
> 'maxcpus=1' prevents it.

Setting maxcpus to 1 does not generally make things single threaded in the kernel (it is just compression/decompression hibernation code that uses number of CPUs as an indicator of home many threads should be used). This setting affects how many CPUs will be used, so I would be looking at bugs that may be affected by that. Locking and driver issues on thaw come to mind first.
Comment 36 aaronsloman 2012-10-12 21:33:45 EDT
(In reply to comment #35)

> Setting maxcpus to 1 does not generally make things single threaded in the
> kernel (it is just compression/decompression hibernation code that uses
> number of CPUs as an indicator of home many threads should be used). This
> setting affects how many CPUs will be used, so I would be looking at bugs
> that may be affected by that. Locking and driver issues on thaw come to mind
> first.

Thanks for the clarification.

Setting maxcpus=1 in grub.cfg (just for resume) appears to make resume totally reliable on the two different machines I have been using (Dell latitude E6410, and Viglen genie desktop cpu, both with intel core i5 cpus and intel graphics, but different motherboards, etc.)

So if there's a way for hibernate/resume code after decompression is complete to do whatever maxcpus=1 does, it should stop the last part of resume from crashing -- at least as a stop-gap that's more convenient than using two copies of grub.cfg? But maybe I don't understand how boot flags work.

The effect of maxcpus=1 seems to be temporary: i.e. after the resume has completed the number of cpus used seems to revert to the original boot value.
E.g. the next call of pm-hibernate apparently goes back to using 3cpus for compression on my machines. (It is blindingly fast!)

===
Re comment #34 thanks for the suggestion to echo reboot to /sys/power/disk before running pm-hibernate. I'll try that when I next need to do repeated tests. (It won't be fully automatic as I have a boot password set.))
Comment 37 Bojan Smojver 2012-10-13 00:51:01 EDT
(In reply to comment #36)

> The effect of maxcpus=1 seems to be temporary: i.e. after the resume has
> completed the number of cpus used seems to revert to the original boot value.
> E.g. the next call of pm-hibernate apparently goes back to using 3cpus for
> compression on my machines. (It is blindingly fast!)

That is because the kernel that is thawing the image actually gets replaced by the kernel from the image. And that is the kernel that had all CPUs enabled.
Comment 38 aaronsloman 2012-10-13 06:20:39 EDT
(In reply to comment #37)

> That is because the kernel that is thawing the image actually gets replaced
> by the kernel from the image. And that is the kernel that had all CPUs
> enabled.

Thanks for confirming what I suspected. So if someone (you??) can suggest a change to the resume code that will switch to using only one cpu after decompression has completed then perhaps the later crash will be prevented, but without affecting the the multiprocessing capability of the restored system.

Is there some system call that could be invoked just after reading in the saved image that will impose a limit of 1 cpu on subsequent processing?

I have not tried to work out whether your use of nr_threads in the decompression code in swap.c could be transplanted.

Perhaps that's a nonsensical idea  based on my not understanding the system?

If it can be implemented, that would be much better than requiring users to mess around with maxcpus in grub.cfg as I have been doing -- with over 140 successful resumes from hibernate -- on a PC that last had a full boot on 8th June, using kernel 3.3.7-1.fc16.i68. Previously resume regularly failed randomly, producing a full reboot.
Comment 39 Bojan Smojver 2012-10-13 07:03:07 EDT
(In reply to comment #38)
 
> So if someone (you??) can suggest a
> change to the resume code that will switch to using only one cpu after
> decompression has completed then perhaps the later crash will be prevented,
> but without affecting the the multiprocessing capability of the restored
> system.

Based on my previous communication with kernel developers, I doubt that any such workaround would be accepted. Linux kernel is a fully SMP capable system and any bugs that cause the system to crash when multiple CPUs are used will be fixed only the proper way. Which means, by finding which part of the code is not playing well when more than one CPU is in action and by fixing that code.

Now, which code that is, I cannot tell you. It could be part of hibernation code (including my own, of course), it could be some driver, it could something else.

One way to troubleshoot is to bisect the kernel (this is a very, very long and involved process) until a commit that broke it is found. Of course, this commit may be a major change in some subsystem, containing many lines, which would then make isolation again difficult.

I know - it is a difficult problem and no silver bullet. :-(
Comment 40 aaronsloman 2012-10-13 07:26:11 EDT
(In reply to comment #39)

> Based on my previous communication with kernel developers, I doubt that any
> such workaround would be accepted. Linux kernel is a fully SMP capable
> system and any bugs that cause the system to crash when multiple CPUs are
> used will be fixed only the proper way. Which means, by finding which part
> of the code is not playing well when more than one CPU is in action and by
> fixing that code.

I understand. I should have made clear that I was thinking of a temporary fix to help the people who now can't use hibernate becaue resume crashes. It could be controlled by a boot flag  which has no effect during normal boot (compare no_console_suspend?).
 
Of course, if kernel developers have some principled opposition to temporary fixes (eg in case bad features are  built on them) then I'll just have to go on switching between two versions fo grub.cfg, one for boot and one for resume, which has worked OK for me for several months. I could go on using that indefinitely. But most people having resume problems surmountable in this way won't know about that method and won't find out, unless they are expert google users!

Anyhow, thanks for your help in refuting my conjecture about multi-threading during decompression being the cause. I hope narrowing the options will help someone produce a proper fix one day.
Comment 41 Bojan Smojver 2012-10-14 20:43:35 EDT
You may want to have a look at these two bugs:

https://bugzilla.kernel.org/show_bug.cgi?id=37142
https://bugs.freedesktop.org/show_bug.cgi?id=28813

As you'll see from the first one, some folks (like myself) did not encounter any problems after the commit mentioned there. However, for others, problems persistent even after that.
Comment 42 aaronsloman 2012-10-14 22:41:44 EDT
(In reply to comment #41)
> You may want to have a look at these two bugs:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=37142
> https://bugs.freedesktop.org/show_bug.cgi?id=28813

Interesting. Several people report file corruption. I did have file corruption for a while, caused by grubby copying erroneous information from /etc/fstab when a new kernel was installed. But after I removed the error that went away.
So although I've had resume crashing in many tests no files have found to be corrupted. 

I also see that some people find that tuxonice works. I used to use tuxonice for several years as it was faster and more convenient to use than the standard pm-hibernate. But it stopped working for me on the Del E6410 after fedora 13: it would not hibernate.

I'll try tuxonice again, and see whether I can resume reliably without maxcpus=1
Comment 43 aaronsloman 2012-10-15 17:42:11 EDT
Tried tuxonice and had many problems, including not recognising the UUID label for /dev/sda3 swap partition, very slow boot, and never getting resume to complete. So I've given up and gone back to maxcpus=1 for resume and this kernel:
tud
   3.6.1-1.fc17.i686 #1 SMP Wed Oct 10 12:56:16 UTC 2012

If anyone reading this either has tuxonice with F16 or F17 working on Dell Latitude E6410 (with intel graphics) or has pm-hibernate/resume working perfectly without maxcpus=1 for resume, I'd like to hear from you.

I'd also like to hear if you have wicd working: it stopped for me on F17 (will not remember settings for wireless services) so I am stuck with NetworkManager and its terrible, terrible user interface. (Must find out where to file a bug report for Wicd, which is so much better -- when it works.)

Thanks
a.sloman AT cs.bham.ac.uk
Comment 44 aaronsloman 2012-10-18 17:16:11 EDT
Continued trying tuxonice, with latest kernel available from Matthias Hensler's web site http://mhensler.de/swsusp/download_en.php, namely
3.6.1-1_1.cubbi_tuxonice.fc17.i686 #1 SMP Thu Oct 18 12:19:53 CEST 2012
 
I could not get it to boot until I changed the grub entry to use root=/dev/sda14 instead if the UUID format. I also had to change /etc/fstab to use /dev/sda3 to identify my swap area. After that it booted OK, but seemed to have trouble connecting to my SD card, cured (temporarily) by removing and reinserting the card -- used only for backup when travelling, and only mounted when wanted.

Having got it to boot I tried using the tuxonice hibernate utility

    hibernate-tuxonice-2.0-7.cubbi1.noarch

Hibernate worked fine, but resume did not complete. It froze before decompression, requiring use of power button. So I tried setting
  UseTuxOnIce no

in
  /etc/hibernate/tuxonice.conf

It hibernated OK but without the nice tuxonice user interface (text-based progress bar). Likewise resume worked without the progresss bar, but to my surprise completed successfully, without requiring maxcpus=1 in grub.cfg.

It has now resumed successfully in that mode without maxcpus=1 about 10 times. Just as a sanity check I tried the standard kernel (3.6.1-1.fc17.i686) and instead of pm-hibernate ran the tuxonice hibernate with "UseTuxOnIce no". Resume crashed first time (after 100% decompression), forcing a reboot.

Summary

    standard kernel
        hibernates OK
        crashes on normal resume (after 100% decompression)
        resumes OK if grub.cfg includes maxcpus=1 for resume
        crashes on resume if I run hibernate with "UseTuxOnIce no"
            and no maxcpus flag

    tuxonice kernel
        hibernates OK, using hibernate
        crashes on resume (freezes requiring use of power button)
        resumes OK if I run hibernate with "UseTuxOnIce no"
            and no maxcpus flag

A strange mixture. I hope someone can make sense of this.

The other (pleasant) surprise is that if I use the tuxonice kernel and
hibernate (with "UseTuxOnIce no"), ssh connections to other machines (via
wireless connection to router) remain open for use after resume.

I have no idea whether that involves a security risk.
Comment 45 aaronsloman 2012-10-18 17:17:57 EDT
Forgot to say that I had reported all the above to Matthias Hensler, who will pass on relevant information to tuxonice developer.
Comment 46 aaronsloman 2012-10-20 13:09:46 EDT
(In reply to comment #45)
> Forgot to say that I had reported all the above to Matthias Hensler, who
> will pass on relevant information to tuxonice developer.
That should have been "developers"

There has been some very useful progress. After a few rounds of testing, Matthias has now produced:

(1) a new kernel  3.6.2-4_1.cubbi_tuxonice.fc17.i686, available on his download site:
http://mhensler.de/swsusp/download_en.php (email  him if you need the 64 bit version),
and more importantly

(2) a new version of dracut-tuxonice, which modifies the standard dracut for use with tuxonice kernels:

 http://mhensler.de/swsusp/download/dracut-tuxonice-002-17.fc17.cubbi1.noarch.rpm

Tuxonice kernels installed after that version of dracut-tuxonice will not exhibit the bug that prevented me booting with the UUID format for root device in grub.cfg. I no longer have to edit grub.cfg to use root=/dev/sda14 
For an already installed kernel use 'dracut --force' to fix the booting problem.

The resume-freeze bug remains, but is avoided (for me), as explained in Comment #44, by not using either 'pm-hibernate' (from pm-utils) nor the default configuration for 'hibernate' (the tuxonice utility): instead, I invoke 'hibernate' but with 'UseTuxOnIce no' in
  /etc/hibernate/tuxonice.conf

Exactly why that allows resume/thaw from hibernate to succeed, while all the other options fail, on my Dell E6410, remains unexplained, but perhaps it will give someone a clue about how to fix the resume problem more generally.

As far as I can tell, both hibernate and resume invoked in that manner use mostly the same code as 'pm-hibernate' would, including using multiple threads for both image compression and inflation (the printout is identical) but without crashing after expansion, and without requiring 'maxcpus=1' for resume/thaw to succeed. Presumably the tuxonice 'hibernate' command interacts with something in the tuxonice kernel which is also in standard kernels, which pm-hibernate does not do.

So an apparently totally reliable version of hibernate+resume, without requiring grub.cfg to be edited at all is available using a strange mixture of tuxonice and standard resume code. Revised summary:

    standard F17 kernel
        hibernates OK
        crashes on normal resume (after 100% decompression completed)
        BUT resumes OK if grub.cfg includes maxcpus=1 for resume

    tuxonice kernel
        now boots OK even with root=UUID=... in grub.cfg, and hibernates OK
        resumes OK if 'hibernate' is run with "UseTuxOnIce no"
            (no maxcpus flag required)
        BUT crashes on resume with default settings for hibernate.

With this configuration, after each upgrade of tuxonice kernel, no editing of files is needed, with above setting in /etc/hibernate/tuxonice. This is far superior to my previous solution, which required two versions of grub.cfg to be created every time the kernel is updated, one for full boot and one for resume (setting maxcpus=1).

I shall update: http://www.cs.bham.ac.uk/~axs/laptop/hibernate-on-linux.html
Comment 47 aaronsloman 2012-10-20 13:35:29 EDT
The updated files are now also in Matthias Hensler's yum repository:

http://mhensler.de/swsusp/repository_en.php
Comment 48 aaronsloman 2012-10-21 13:23:30 EDT
Using 3.6.2-4_1.cubbi_tuxonice.fc17.i686, with hibernate (tuxonice) used as described above, still resuming without fail, using standard grub.cfg (Over 30 successful hibernate/resume cycles and not one failure, with hibernate configured "UseTuxOnIce no".

Side issues (which may need separate bug reports):

I have noticed that I still get the warning message relating to SD card (Sandisk 8GB class 10 - formatted as ext4), previously mentioned in comment #22 when I tried kernel 3.6.0+ (compiled from source),  namely, 

   mmc0: Timeout waiting for hardware interrupt

after resume, whether the card was mounted or not, later followed by automatic detection of the card.

This did not happen with 3.5.* kernels, so (for me) it's new with 3.6, though google shows that other linux users have had similar problems.

After resume it happens every ten seconds or so, for about 2 minutes, until eventually the SD card is recognised. E.g. last few lines of /var/log/messages after resume

Oct 21 16:44:31 lape kernel: [12267.596877] mmc0: Timeout waiting for hardware interrupt.
Oct 21 16:44:32 lape kernel: [12267.674856] mmc0: new SDHC card at address 0001
Oct 21 16:44:32 lape kernel: [12267.675083] mmcblk0: mmc0:0001 00000 7.46 GiB 
Oct 21 16:44:32 lape kernel: [12267.676487]  mmcblk0: p1

However, it is not automatically remounted after being recognized after resume if it was previously mounted.

I only use the card occasionally so this doesn't affect me much.
I'll try to find time to experiment with non-tuxonice kernel 3.6.4 to see if it is specific to tuxonice.

Another old message has reappeared, apparently triggered (intermittently) by use of function keys -- e.g. using Fn+Up to alter screen brightness:

  dell_wmi: Received unknown WMI event (0x11)

But it doesn't seem to affect performance in any way: the keys behave as expected. When I get time, I'll have to experiment to see if this occurs only with tuxonice kernel, or whether it is a 3.6.* issue.
Comment 49 aaronsloman 2012-10-23 16:24:29 EDT
I have now installed kernel 3.6.2-4.fc17.i686 #1 SMP Wed Oct 17 03:22:23 UTC 2012

As with the tuxonice kernel I repeatedly ket this message when using the Fn key with Up or Down keys:

    dell_wmi: Received unknown WMI event (0x11)

No evidence of problems reconnecting too SD card (mmc0:) with 3.6.2-4 non-tuxonice.

To my surprise this has resumed from pm-hibernate four times in a row, without requiring maxcpus=1 for resume.

I'll test a few more times -- this could just be the old random success/failure.
Comment 50 aaronsloman 2012-10-23 18:39:01 EDT
(In reply to comment #49)
> I have now installed kernel 3.6.2-4.fc17.i686 #1 SMP Wed Oct 17 03:22:23 UTC
> 2012
> .....
> To my surprise this has resumed from pm-hibernate four times in a row,
> without requiring maxcpus=1 for resume.
> 
> I'll test a few more times -- this could just be the old random
> success/failure.

 ... and that's what happened!
One more try succeeded and the next one failed to resume: rebooted immediately after completing decompression.

So for 2.6.2-4 i686 I had two failures and about six successful resumes.

With the corresponding tuxonice kernel, I've had about ten successful resumes and no failures.

So I have returned to the strange mixture of kernel
   3.6.2-4_1.cubbi_tuxonice.fc17.i686

but using the 'hibernate' (tuxonice) command with 'UseTuxOnIce no' set.
So far that combination seems totally reliable, though it is slow to recognize sd card after boot or resume.
Comment 51 aaronsloman 2012-10-25 09:05:41 EDT
Discussion continuing on TuxOnIce forum in this thread:

http://lists.tuxonice.net/pipermail/tuxonice-users/2012-October/001193.html

Click on [Thread] to see list of messages so far.
Comment 52 aaronsloman 2012-11-01 05:53:18 EDT
(In reply to comment #50)
> ...
> So I have returned to the strange mixture of kernel
>    3.6.2-4_1.cubbi_tuxonice.fc17.i686
>
> but using the 'hibernate' (tuxonice) command with 'UseTuxOnIce no' set.
> So far that combination seems totally reliable, though it is slow to
> recognize sd card after boot or resume.

Now using kernel 3.6.3-1_1.cubbi_tuxonice.fc17.i686 in the same way, i.e. use the tuxonice kernel, and the tuxonice hibernate command, but with
'UseTuxOnIce no' in /etc/hibernate/tuxonice.conf

Resume from hibernate seems to be totally reliable in this configuration, even though it is not using the tuxonice hibernate/suspend code.

In contrast 'vanilla' kernel 3.6.3-1.fc17.i686 regularly fails to resume: it nearly completes then reboots apparently after completing image
expansion.

So there is something the tuxonice kernel does right that standard kernel does wrong. I asked the tuxonice developer (Nigel Cunningham) about this and
he replied

> "TuxOnIce uses all the standard driver calls and doesn't modify the
> suspend-to-ram code. The only reason there's this difference in what
> gets frozen is that upstream haven't yet seen that it's an advantage to
> freeze those kernel threads too (and I haven't worked to make the case
> well enough). Apart from those extra threads being frozen, everything is
> vanilla."

http://lists.tuxonice.net/pipermail/tuxonice-users/2012-October/001208.html

I wonder if it is possible for someone in fedora development to look at those kernel differences (which I don't understand). From my experience (since
20th October) it makes a dramatic difference to usability of hibernate (also demonstrated earlier with 3.6.2-4_1.cubbi_tuxonice.fc17.i686).

I have not had a single resume failure in about 10 days, with the strange combination of tuxonice kernel and hibernate command configured not to use
tuxonice kernel code.
Comment 53 aaronsloman 2012-11-09 13:13:15 EST
(In reply to comment #52 -- 8 days ago)
> I have not had a single resume failure in about 10 days, with the strange
> combination of tuxonice kernel and hibernate command configured not to use
> tuxonice kernel code.

Since then I have installed and tried two new kernels
  3.6.5-1.fc17.i686
  3.6.5-1_1.cubbi_tuxonice.fc17.i686

As before resume from hibernate failed in the non-tuxonice kernel. It almost complets and then reboots.

As before resume from hibernate using the tuxonice kernel fails -- at a much earlier state, as reported on the tuxonice forum.

However, if I edit /etc/hibernate/tuxonice.conf, to set "UseTuxOnIce no", and give the hibernate command while running the tuxonice kernel, resume works perfectly, presumably because of the small change mentioned in the quote in comment #52 about freezing kernel threads.

Strange.
Comment 54 aaronsloman 2012-11-23 11:38:55 EST
This information may be useful to others having problems resuming from hibernate, though it may be specific to Dell users. My machine is aDell Latitude E6410, and I recently discovered this dell forum regarding suspend and hibernate problems:
http://en.community.dell.com/support-forums/laptop/f/3518/t/19351240.aspx?PageIndex=4

One of the posters wrote: "Disabling SpeedStep is a poor (but the only) workaround at the moment". Others suggested that upgrading to bios rev A12 solved the problem. However I had recently upgraded the bios and still had problems with resume from hibernate. So I tried using the bios to disable 'Intel Speed Step'.

Since then, for the last few days, using kernel 3.6.7-4.fc17.i686, resume from hibernate has always succeeded, without requiring "maxcpus=1". That's about 10 resumes, not enough to prove the problem has gone away completely but a much longer run than I had in other tests, before resume failed to complete.

Does anyone know whether there is a known intel bios bug that prevents resume from hibernate working?

Another puzzle:

I expected that disabling speed step would prevent power saving when running on battery and using low screen brightness and low cpu load. (That's what the description in the bios menu implies.) But in my tests power saving has apparently worked, with a significant reduction in battery discharge rate, as shown by

   'cat /sys/class/power_supply/BAT0/current_now'

Does anyone know what the effects of disabling speed step should be?


Another puzzle

I have also noticed this in /var/log/messages, and in the output of dmesg:
 "[Firmware Bug]: ACPI: BIOS _OSI(Linux) query ignored

I think that is new since Bios Rev A12, and I have not been able to find anything informative about it or whether it is related to the hihbernate/resume bug. Does anyone know?
Comment 55 aaronsloman 2012-11-24 19:38:34 EST
(In reply to comment #54)
> .....
> Since then, for the last few days, using kernel 3.6.7-4.fc17.i686, resume
> from hibernate has always succeeded, without requiring "maxcpus=1". 

Unfortunately that did not last. Eventually the old behaviour returned: resume from hibernate fails, and machine reboots.

So I am back to setting maxcpus=1 for resume.
Comment 56 aaronsloman 2012-12-22 18:19:24 EST
Now using kernel: 3.6.10-2.fc17.i686 #1 SMP Tue Dec 11 18:33:15 UTC 2012

I still need maxcpus=1 for resume from pm-hibernate to work relimably.

Without that, it sometimes resumes OK, and sometimes crashes and reboots, always immediately after decompression has completed using 3 threads. 

It would save a lot of hassle if there were a grub.cfg setting that made maxcpus=1 active only for resume from hibernate. Perhaps there is, but I find grub2 so complex, so ill structured, and so badly documented that I have not found out how to configure it.
Comment 57 aaronsloman 2012-12-23 12:42:44 EST
Reading various bug reports I have the impression most other people now install 64 bit linux by default, whereas I don't as I have no need for it, and programs I use mostly run with smaller memory requirements in 32 bit linux.

But I wonder whether there is something that has been fixed for resume in 64 bit kernel but not 32 bit. So I thought I would ask whether anyone reading this who uses 32 bit fedora 17 plus multi-core cpu + intel graphics can successfully run hibernate/resume without requiring maxcpus=1 for resume.

In my case it seems partly random. I can get as many as 10 successful resumes in a row after pm-hibernate, and then the next one crashes and reboots, just before the display is restored, after decompression. I get that symptom with and without running X. However using maxcpus=1 for resume seems totally reliable. I've never had resume crash with that setting in grub.cfg

If I am the only 32 bit user and the only one with the resume problem I'll have to try installing 64-bit fedora with 32 bit libraries. 

Using Dell Latitude E6410 with core i5 cpu (4 core), intel graphics, 4GB RAM. 10GB swap partition.
Comment 58 John Schmitt 2012-12-27 21:44:10 EST
aaronsloman, I believe this is _not_ fixed for x86_64.  I tried it last with 3.6.9 and I saw the same symptoms you describe.  
aaronsloman, thank you for reporting in such detail.  The insight you provided helped me.
Comment 59 aaronsloman 2013-01-05 16:20:15 EST
(In reply to comment #58)
John Schmitt
Thanks for the feedback! Do you also get resume from hibernate working perfectly if you use "maxcpus=1" as a boot parameter when resuming?

If so, that would help to confirm that the problem has something to do with use of multiple threads during the final restore stage, i.e. just after image expansion, in both 32 bit and 64 bit linux.

I am now using kernel 3.6.10-2.fc17.i686 on my Dell E6410, and it still does not resume reliably after pm-hibernate, unless I use the maxcpus=1 option for resume --  as described in http://www.cs.bham.ac.uk/~axs/laptop/hibernate-on-linux.html#workaround
Comment 60 aaronsloman 2013-01-12 08:50:30 EST
I have just updated to kernel 3.6.11-1.fc17.i686
Resume from pm-hibernate still works only intermittently, so I still have to use maxcpus=1 for resume.
Comment 61 aaronsloman 2013-01-26 11:35:40 EST
Updated to fedora-release-17-2 and kernel-3.7.3-101.fc17.i686

The problem remains: pm-hibernate + resume worked twice then on the third resume it crashed and rebooted.

So I have gone mack to setting maxcpus=1 for resume.
Comment 62 aaronsloman 2013-02-16 18:11:52 EST
Updated to kernel-3.7.6-102.fc17.i686 #1 SMP Mon Feb 4 17:52:09 UTC 2013


The problem remains: pm-hibernate + resume worked once then on the second resume it crashed and rebooted, just after expanding image file.

I have gone mack to setting maxcpus=1 for resume.
Comment 63 aaronsloman 2013-03-24 21:07:53 EDT
Updated to kernel 3.8.3-103.fc17.i686 #1 SMP Mon Mar 18 15:57:42 UTC 2013

Problem as before. Can't resume from pm-hibernate unless I use maxcpus=1 for resume in grub.cfg

The setting that works for resume (not needed for initial boot) does not noticeably slow down resume. It only seems to affect the decompression of the saved image, which is very fast anyway, and is a small proportion of total resume time from power on.
Comment 64 aaronsloman 2013-05-28 13:50:43 EDT
Updated to kernel 3.8.13-100.fc17.i686 #1 SMP Mon May 13 13:51:09 UTC 2013

Problem as before. Tried to resume from pm-hibernate. It worked three times, then resume failed to complete and rebooted.

So back to using edited grub.cfg file with maxcpus=1 for resume. This has worked perfectly for hundreds of resumes, for many months, without a crash (both on my dell laptop running fedora17 and also on a desktop PC still using Fedora 16, both with Intel Core i5 cpus with integrated graphics). 

It works, but it is a nuisance having to change grub.cfg before hibernating to add the extra flag for resume. (I do this using a shell script).

It seems that the bug that causes resume to fail (randomly) just after expanding saved image using multi-threading and just before restoring graphic state can't be fixed. So it would be nice to have a boot flag to turn off multi-threading only when resuming from pm-hibernate, i.e. a version of 'maxcpus=1' that works only for resume. 

This has worked perfectly for me, and it does not seem to slow down resume noticeably.

If anyone wants to investigate this, Comment #35 by Bojan Smojver may be a good place to start.

I am using a 32 bit kernel. Comment #58 by John Schmitt indicates that the problem is also in x86_64 
 
Sorry I can't do it: I am not a kernel programmer.
Comment 65 aaronsloman 2013-06-21 05:26:45 EDT
I decided to try moving from Fedora 17(32 bit) to Fedora 18 (64bit), using the LXDE spin, now on kernel 3.9.5-201.fc18.x86_64

This is on a Dell Latitude E6410 with intel core i5 (4cpu).

Once again grub2 screwed me up at first and I had to fix an unbootable machine by running a live CD and editing grub.cfg to reinstate my old version, with the new menu entry added by hand. But after that everything worked, and 'yum update kernel' left my working format when I brought the system up to date.

Moreover, it seems that pm-hibernate is fixed at last, without any need for use of the 'maxcpus=' flag when resuming.

Further, the new system installed itself with suspend set to be triggered by shutting the lid. In the past I had to disable that because suspend did not work properly. So far it has worked without problems, though I have to be careful not to shut the lid if I want to access the laptop from my desktop machine.

I now have problems with wireless: neither NetworkManager nor Wicd connects -- something to do with dbus I think, but as a temporary fix 'ifup em1' gives me cable access.

Apart from that F18 seems to be a great improvement.
Comment 66 Fedora End Of Life 2013-07-03 19:31:55 EDT
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.
Comment 67 aaronsloman 2013-07-03 20:05:44 EDT
Having now used Fedora 18 since about June 18th I can confirm that this bug has not reappeared, despite many resumes from pm-hibernate. It has also resumed successfully from unintended suspend episodes, when I inadvertently shut the lid.
Many thanks to the system developers responsible.
Comment 68 John Schmitt 2013-07-05 01:49:16 EDT
I reproduced this today.  

$ uname -r
3.9.6-200.fc18.x86_64

My machine hibernated and woke up successfully once and the subsequent attempt at hibernation hung.
Comment 69 Fedora End Of Life 2013-07-31 22:18:09 EDT
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.