Bug 822071

Summary:	hibernation/resume cycle causes file system corruption: EXT4-fs error (device dm-1) in ext4_new_inode:895: IO failure
Product:	[Fedora] Fedora	Reporter:	Jacek Pawlyta <cunio>
Component:	dracut	Assignee:	dracut-maint
Status:	CLOSED ERRATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	17	CC:	accounts+redhat, admin, alex, awilliam, cgrim, chapelhilllaptopshop, david.moore, dracut-maint, fedora, gansalmon, gpadgett, harald, info, itamar, james, jonathan, jwulf, kernel-maint, madhu.chinakonda, mails.bugzilla.redhat.com, maurizio.antillon, mishu, ncrubel, nphilipp, pasqual.milvaques, paul.lipps, pekkas, public.oss, ralf, santiago, swt, theo148, thomas.lindroth, tilmann
Target Milestone:	---	Keywords:	CommonBugs
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	https://fedoraproject.org/wiki/Common_F17_bugs#hibernate-corruption
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-07-26 22:35:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	830447
Bug Blocks:

Description Jacek Pawlyta 2012-05-16 09:15:40 UTC

Description of problem:
hibernation (suspend to disk)/resume cycle causes EXT4 file system corruption probably in the /var/tmp directory

Version-Release number of selected component (if applicable):
3.4.0-0.rc7.git1.1.fc16.x86_64 (build --with release --with baseonly)

How reproducible:
often

Steps to Reproduce:
1. start laptop
2. hibernate (i.e. pm-hibernate)
3. resume from hibernation
  
Actual results:
corrupted EXT4 file system 

Expected results:
clear file system

Additional info:
 1240.417075] PM: restore of devices complete after 865.818 msecs
[ 1240.417260] PM: Image restored successfully.
[ 1240.417261] Restarting tasks ... done.
[ 1240.420743] PM: Basic memory bitmaps freed
[ 1240.421288] video LNXVIDEO:01: Restoring backlight state
[ 1242.479817] EXT4-fs error (device dm-1) in ext4_new_inode:895: IO failure
[ 1245.545873] usbcore: registered new interface driver btusb
[ 1246.556039] Bluetooth: hci0 command tx timeout
[ 1250.084835] tg3 0000:07:00.0: irq 48 for MSI/MSI-X
[ 1250.118167] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 1250.323497] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 231, 12836 clusters in bitmap, 12815 in gd
[ 1250.323513] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 1250.413262] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 264, 17629 clusters in bitmap, 17620 in gd
[ 1250.413283] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 1270.152552] generic-bluetooth XXXXXXXXX: unknown main item tag 0x0
[ 1270.152656] input: Logitech Bluetooth Mouse M555b as /devices/pci0000:00/0000:00:1d.1/usb7/7-2/7-2:1.0/bluetooth/hci0/hci0:12/input11
[ 1270.154051] generic-bluetooth XXXXXXX: input,hidraw0: BLUETOOTH HID v4.16 Mouse [Logitech Bluetooth Mouse M555b] on XXXXXX
[ 1302.643773] EXT4-fs error (device dm-1) in ext4_new_inode:895: IO failure
[ 1302.644345] EXT4-fs error (device dm-1) in ext4_new_inode:895: IO failure
[ 1302.644806] EXT4-fs error (device dm-1) in ext4_new_inode:895: IO failure
[ 1302.645148] EXT4-fs error (device dm-1) in ext4_new_inode:895: IO failure

Comment 1 Pekka Savola 2012-05-16 10:21:25 UTC

On i686 I get this:

EXT4-fs error (device dm-0): mb_free_blocks:1348: group 447, block 14657679:
freeing already freed block (bit 10383)

I suppose this is the same thing, but if not, I'll file a different bug.

Comment 2 Nils Philippsen 2012-05-21 10:33:48 UTC

I've seen the same on kernel-3.3.4-5.fc17.x86_64, only that I successfully suspended to and resumed from disk, then suspended to RAM:

[47250.418747] EXT4-fs error (device dm-1) in ext4_new_inode:941: IO failure
[47250.451622] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:739: group 78, 7178 clusters in bitmap, 7083 in gd
[47250.451737] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[47250.799362] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:739: group 75, 7595 clusters in bitmap, 7559 in gd
[47250.799375] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Let me know if you need the full dmesg log.

Comment 3 Jacek Pawlyta 2012-05-25 17:39:52 UTC

Kernel 3.4.0-1.fc18.x86_64 on Fedora-16 - the same problem

Comment 4 Thomas Lindroth 2012-06-04 23:14:53 UTC

Googling for this error lead me here. Lets try to figure out what we have in common.

Looks like you all run ext4 on device mapper like me. I use kernel 3.4 and you are using a recent kernel as well. I'm experiencing this problem on a Gentoo desktop without any suspending involved but there is a reason why I suspect this is the same problem. What graphics drivers are you using?

https://bugzilla.redhat.com/show_bug.cgi?id=723499#c2
This other report about a similar problem mention graphic drivers corrupting memory as a possible cause.

The first time I got a waring about file corruption was two days ago when I was playing a game using the radeon open source drivers. I quickly rebooted and did an fsck. All was fine until tonight when I started up the game again. I got corruption messages again but this time I ignored them. Soon after I got a general protection fault and after reboot my root fs is pretty badly corrupted.

If you are using the open source mesa based drivers which version are you using. I was running latest version from git.

Comment 5 Nils Philippsen 2012-06-05 14:43:09 UTC

(In reply to comment #4)
> Lets try to figure out what we have in common.

I don't think it's related to the graphics hardware, I've seen this on machines with (both open source) Radeon and Intel graphics. I can confirm that I use ext4 on device-mapper/LVM. If you don't suspend to disk I think you're experiencing something else -- I only see the problem after suspending to disk, then resuming and then some time passing (sometimes even only after a few suspend/resume cycles). I used mesa in whatever version we shipped in Fedora 16 and 17 at the time this happened.

Comment 6 Matthias Hensler 2012-06-06 10:40:14 UTC

I also do not think that the problem is related to any graphic hardware.

My setup: Lenovo Thinkpad T420 with Nvidia-card. The Intelcard is deactivated in BIOS and only the nvidacard is active. I am using the binary-nvidia driver.

Harddisk is an Intel SSD 520, with a separate /boot partition and everything else on logical LVM volumes. There is a standard DOS partitiontable (no GPT). Bootloader is grub2 and I believe it is using UEFI (not sure, have to check in detail if relevant).

Also I am not using the in-kernel hibernation, but TuxOnIce (my own kernel spins).

This setup has never caused any problems. The last working scenario was Fedora 16 with 3.3.7-1_1.cubbi_tuxonice.fc16.x86_64 (that is the standard Fedora 3.3.7-1.fc16 kernel, but with TuxOnIce patched). Never had any trouble with that kernel or any previous kernel.

The trouble started with upgrading to Fedora 17 (using PreUpgrade in this case). Installed is now 3.3.7-1_1.cubbi_tuxonice.fc17.x86_64 (the standard Fedora 3.3.7-1.fc17 kernel with TuxOnIce). Notice that it is the exact baseversion as the F16 kernel (although from what I remember Fedora has backported some more features to the F17-version in contrast to the F16-version). Using that kernel to hibernate and resume I immediatly see the symptoms described in this bug (occuring on the /-filesystem, which is located on LVM and using ext4).

So, from the thinks described here I conclude, that the problem either occurs in some of the additional patches that were introduced in the current Fedora 17 kernel, or in dracut (which seems to work a bit differently in Fedora 17). It occurs with standard swsusp, as well as with TuxOnIce, so if it is kernelrelated the problem has to be in the one of the commonly shared hibernate-infrastructe.

When looking at comment 3 (reporting that the problem occurs with a 3.4 kernel on Fedora 16) I would suspect that the cause is not dracut related (at least if dracut is still the standard Fedora 16 version in that case, which I believe to work without any problems).

Furthermore: reporter from comment 4 sees the problem with a 3.4 kernel on Gentoo. I do not know which method Gentoo uses for creating its initrd, but it seems more than likely that the initrd (and therefore dracut) is fine here, and the kernel is to blame. The regression was most likely introduced on the way from 3.3 to 3.4 and at least backported by Fedora in 3.3.4 (see comment 2).

Comment 7 Matthias Hensler 2012-06-09 19:12:27 UTC

Maybe my conclusions about dracut were too early.

As I carefully inspected my dmesg I noticed, that my rootfs gets mounted before the resumeprocess kicks in. That problem exists at least in my setup with tuxonice and the tuxonice-dracut module.

That could be the case for the in kernel hibernation as well. The problem is within dracut (either with dracut itself, or the LVM module from dracut).

If the swapdevice is on LVM and no resume= parameter is specified on the commandline (the problem might also exists if resume= is specified, that has to be checked), there is a high chance that dracut has already started to mount the rootfs, before the udev rule starting resuming kicks in.

I described the full problem with dracut in bug 822071. So far I have installed a little dracut-module in the pre-mount hook, which makes sure that udev is fully settled. That seems to fix the problem for me (at least with kernel 3.3.7 and 3.4.0, both TuxOnIce patched).

Comment 8 Nils Philippsen 2012-06-11 08:48:51 UTC

(In reply to comment #7)
> I described the full problem with dracut in bug 822071.

I think you mean bug #830447. CCing dracut maintainers here.

Comment 9 Joshua Wulf 2012-06-26 06:14:19 UTC

Lenovo T510 with  3.4.3-1.fc17.x86_64 

Suspended, resumed, tried to install a package, and got:

Jun 26 16:06:48 dhcp-1-77 kernel: [176960.495973] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 161, block 5280139:freeing already freed block (bit 4491)
Jun 26 16:06:48 dhcp-1-77 kernel: [176960.495979] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 161, block 5280140:freeing already freed block (bit 4492)
Jun 26 16:07:08 dhcp-1-77 kernel: [176980.644324] EXT4-fs error (device dm-1) in ext4_new_inode:897: IO failure

Comment 10 Joshua Wulf 2012-06-26 06:15:25 UTC

Sorry, to clarify - it had a hibernation event two days ago due to mains power failure.

Comment 11 info 2012-06-28 07:03:19 UTC

I've got the same problem, installed F17 yesterday, hybernated, ran into problems this morning. But in my case I realized the problem after failing to write in /usr/bin (even filed a about about being unable to install gvim). Same thing in the log:

Jun 28 07:49:51 localhost kernel: [24182.491209] EXT4-fs error (device dm-1) in ext4_new_inode:897: IO failure



/etc/fstab:

/dev/mapper/vg_marbleface-lv_root /                       ext4    defaults        1 1
UUID=2c82c6f5-32a7-474b-b0c0-833b4d5a0fce /boot                   ext4    defaults        1 2
/dev/mapper/vg_marbleface-lv_home /home                   ext4    defaults        1 2
/dev/mapper/vg_marbleface-lv_tmp /tmp                    ext2    defaults        1 2
/dev/mapper/vg_marbleface-lv_var /var                    ext4    defaults        1 2
/dev/mapper/vg_marbleface-lv_swap swap                    swap    defaults        0 0


Note: Installed F17 instead of upgrade from F15. Kept disk LVM layout from F15 installation, but formatted every lv except /home which I just reused.

Comment 12 Derek Linz 2012-07-01 16:25:12 UTC

Same problem, no LVM here:

UUID=3a3a1bb2-29c1-4a62-9520-d81b3bdd4e41 /           ext4    defaults        1 1                                                                                                                                                      
UUID=f519db7a-35d5-48eb-b3e4-187d1a4a1792 /boot       ext4    defaults        1 2                                                                                                                                                      
UUID=b840bade-b62d-40e9-9f27-82ffd85f881f swap        swap    defaults        0 0 


Linux eurocom 3.3.4-5.fc17.x86_64 #1 SMP Mon May 7 17:29:34 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux       


THe frustrating part is I reverted to this kernel to avoid an atheros wifi bug that makes the machine unusable. Sigh.

Comment 13 Derek Linz 2012-07-01 16:27:20 UTC

I'm using a ssd, I noticed one other person mentioned it, I don't suppose we all are? I'm going to try mounting with the discard flag for kicks.

Comment 14 info 2012-07-01 16:42:23 UTC

I'm not using SSD, but I am using LVM. smartctl shows no prefail or old age thresholds breached. I'm not hibernating since my previous post and I'm not seeing any problems. I'll try hibernating today and see what happens.


=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F3
Device Model:     SAMSUNG HD502HJ
Serial Number:    S20BJ1BZ101250
LU WWN Device Id: 5 0024e9 002c48b54
Firmware Version: 1AJ100E4
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Sun Jul  1 18:37:01 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Comment 15 info 2012-07-02 20:27:50 UTC

No, can't reproduce the problem. There was a kernel update few days go, could be that fixed it?

> uname -a
Linux marbleface.lan 3.4.4-3.fc17.x86_64 #1 SMP Tue Jun 26 20:54:56 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Comment 16 Josh Boyer 2012-07-03 14:33:51 UTC

*** Bug 823871 has been marked as a duplicate of this bug. ***

Comment 17 Jacek Pawlyta 2012-07-09 08:57:55 UTC

I'm using kernel 3.4.4-3.fc17.x86_64 and I see the problem :(

Comment 18 James 2012-07-13 19:11:56 UTC

(In reply to comment #17)
> I'm using kernel 3.4.4-3.fc17.x86_64 and I see the problem :(

You're not the only one, just ran into this with the same kernel. My notebook hibernated yesterday with critically low battery. Came home, turned it back on and it came back as if nothing were wrong -- didn't notice the ext4 error messages.

A few suspend/resume cycles later, it crashed this evening in the ext4 code (X died too), rebooted to an unmountable filesystem and had to manually fsck around to get it back.

Hardware is Intel i7 2760QM, HD 3000 graphics. SMART reports HDD in excellent health.

Comment 19 info 2012-07-16 18:49:06 UTC

Seems I'm after all still having these problems. Not directly visible except in error logs. For example, this one immediately after resuming from hibernation:

Jul 16 09:07:45 marbleface kernel: [168599.755011] EXT4-fs error (device dm-1) in ext4_new_inode:897: IO failure

And there are more from previous days:

Jul 13 14:38:43 marbleface kernel: [178470.512783] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 7, 5083 clusters in bitmap, 4161 in gd
Jul 13 14:38:43 marbleface kernel: [178470.512991] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 3, 6537 clusters in bitmap, 4489 in gd


Obviously they don't happen every day though I do hibernate every day since my last post (Jul 2nd).

Comment 20 Fedora Update System 2012-07-19 13:32:18 UTC

dracut-018-93.git20120719.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/dracut-018-93.git20120719.fc17

Comment 21 Fedora Update System 2012-07-20 01:52:49 UTC

Package dracut-018-93.git20120719.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing dracut-018-93.git20120719.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-10867/dracut-018-93.git20120719.fc17
then log in and leave karma (feedback).

Comment 22 Jacek Pawlyta 2012-07-20 07:12:43 UTC

three hibertations till now and it seems that the bug is fixed

Comment 23 Fedora Update System 2012-07-20 13:45:08 UTC

dracut-018-95.git20120720.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/dracut-018-95.git20120720.fc17

Comment 24 Jacek Pawlyta 2012-07-23 11:23:56 UTC

sorry to inform you but after fourth hibernation (automatic one, because of low battery) on:

kernel-3.5.0-0.rc7.git4.1.fc17.x86_64 (debuging turned off)
dracut-018-95.git20120720.fc17

I got this:

[71901.369923] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 228, 24401 clusters in bitmap, 24368 in gd
[71901.369946] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[71921.875323] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 308, 5983 clusters in bitmap, 5561 in gd
[71922.498433] EXT4-fs error (device dm-1): ext4_free_inode:320: comm klauncher: bit already cleared for inode 2492136
[71926.236161] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10095008:freeing already freed block (bit 2464)
[71926.236170] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10095009:freeing already freed block (bit 2465)
[71926.236173] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10095010:freeing already freed block (bit 2466)
[71926.236177] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10095011:freeing already freed block (bit 2467)
[71926.236180] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10095012:freeing already freed block (bit 2468)
[71926.236184] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107799:freeing already freed block (bit 15255)
[71926.236188] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107800:freeing already freed block (bit 15256)
[71926.236191] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107801:freeing already freed block (bit 15257)
[71926.236194] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107802:freeing already freed block (bit 15258)
[71926.236198] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107803:freeing already freed block (bit 15259)
[71926.236201] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107804:freeing already freed block (bit 15260)
[71926.236204] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107805:freeing already freed block (bit 15261)
[71926.236208] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107806:freeing already freed block (bit 15262)
[71926.236211] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107807:freeing already freed block (bit 15263)
[71926.236214] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107808:freeing already freed block (bit 15264)
[71926.236218] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107809:freeing already freed block (bit 15265)
[71926.236221] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107810:freeing already freed block (bit 15266)
[71926.236224] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107811:freeing already freed block (bit 15267)
[71926.236228] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107812:freeing already freed block (bit 15268)
[71926.236231] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107813:freeing already freed block (bit 15269)
[71926.236234] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107814:freeing already freed block (bit 15270)
[71926.236237] EXT4-fs error (device dm-1): mb_free_blocks:1301: group 308, block 10107815:freeing already freed block (bit 15271)
[71926.310944] EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure
[71926.351454] EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure
[71928.528245] EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure
[71950.993230] EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure
[71955.560164] EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure
[71968.925144] EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure

Comment 25 Fedora Update System 2012-07-24 10:40:21 UTC

dracut-018-96.git20120724.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/dracut-018-96.git20120724.fc17

Comment 26 Jacek Pawlyta 2012-07-25 08:40:13 UTC

last dracut update didn't help me, on:

fedora 17
kernel 3.5.0-1.fc18.x86_64
dracut-018-96.git20120724.fc17 

after resume from the first hibernation I got the following messages in the log: 

[ 2473.777587] EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure
[ 2476.657455] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 265, 9673 clusters in bitmap, 9661 in gd
[ 2476.657472] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 2478.840867] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 3, 4339 clusters in bitmap, 4310 in gd
[ 2478.840899] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Comment 27 Harald Hoyer 2012-07-25 09:59:42 UTC

(In reply to comment #26)
> last dracut update didn't help me, on:
> 
> fedora 17
> kernel 3.5.0-1.fc18.x86_64
> dracut-018-96.git20120724.fc17 
> 
> after resume from the first hibernation I got the following messages in the
> log: 
> 
> [ 2473.777587] EXT4-fs error (device dm-1) in ext4_new_inode:938: IO failure
> [ 2476.657455] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741:
> group 265, 9673 clusters in bitmap, 9661 in gd
> [ 2476.657472] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr =
> 0). There's a risk of filesystem corruption in case of system crash.
> [ 2478.840867] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741:
> group 3, 4339 clusters in bitmap, 4310 in gd
> [ 2478.840899] JBD2: Spotted dirty metadata buffer (dev = dm-1, blocknr =
> 0). There's a risk of filesystem corruption in case of system crash.

can you please resume with "rd.debug" on the kernel command line and attach the output of "dmesg" in this bugzilla.

Comment 28 Jacek Pawlyta 2012-07-25 13:38:16 UTC

> can you please resume with "rd.debug" on the kernel command line and attach
> the output of "dmesg" in this bugzilla.

I made it, but there is no dracut output in dmesg,
but in the meantime I found I have some famous dracut-20-51 residuals, I removed all of them using your post:  
http://permalink.gmane.org/gmane.linux.redhat.fedora.testers/99509
I reinstalled dracut-18-96 and so far as two hibernations/resumes so good

Comment 29 Harald Hoyer 2012-07-25 13:42:36 UTC

(In reply to comment #28)
> > can you please resume with "rd.debug" on the kernel command line and attach
> > the output of "dmesg" in this bugzilla.
> 
> I made it, but there is no dracut output in dmesg,
> but in the meantime I found I have some famous dracut-20-51 residuals, I
> removed all of them using your post:  
> http://permalink.gmane.org/gmane.linux.redhat.fedora.testers/99509
> I reinstalled dracut-18-96 and so far as two hibernations/resumes so good

oh.. you tried the bad rawhide dracut? my bad... please also check /etc/lvm.conf and probably revert it also.

Comment 30 Jacek Pawlyta 2012-07-26 10:59:44 UTC

this is just to inform you that after cleaning the system and rewriting /etc/lvm/lvm.conf no signs of previous hibernation/resume problems after 7 cycles!
The bug seams to be removed :)

fedora 17
kernel 3.5.0-1.fc18.x86_64
dracut-018-96.git20120724.fc17

Comment 31 Harald Hoyer 2012-07-26 11:16:16 UTC

thanks! don't forget to give +1 karma

Comment 32 Fedora Update System 2012-07-26 22:35:08 UTC

dracut-018-96.git20120724.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 33 Ralf Baechle 2012-07-31 09:16:24 UTC

I've just been hit by this bug for the 2nd time after having updated to dracut-018-96.git20120724.fc17 two days ago.

I wonder, does triggering this bug depend on what dracut version I was running by the time of the initial boot (which was 8 days ago, so with an older dracut version) or by the time of the hibernate/resume cycle?

Comment 34 Harald Hoyer 2012-07-31 11:33:28 UTC

(In reply to comment #33)
> I've just been hit by this bug for the 2nd time after having updated to
> dracut-018-96.git20120724.fc17 two days ago.
> 
> I wonder, does triggering this bug depend on what dracut version I was
> running by the time of the initial boot (which was 8 days ago, so with an
> older dracut version) or by the time of the hibernate/resume cycle?

I hope you resume with the same kernel version as you had while hibernating.

Comment 35 Ralf Baechle 2012-07-31 12:26:56 UTC

(In reply to comment #34)
> (In reply to comment #33)
> > I've just been hit by this bug for the 2nd time after having updated to
> > dracut-018-96.git20120724.fc17 two days ago.
> > 
> > I wonder, does triggering this bug depend on what dracut version I was
> > running by the time of the initial boot (which was 8 days ago, so with an
> > older dracut version) or by the time of the hibernate/resume cycle?
> 
> I hope you resume with the same kernel version as you had while hibernating.

In case of corruption case #1 it did indeed resume into a different kernel, that is 3.4.4-5.fc17.x86_64 -> 3.4.6-2.fc17.x86_64.  I tried to reproduce this on my other systems but those always booted into the right kernel even if I picked an incorrect kernel at the grub prompt.

In corruption case #2 the kernel for both suspend and resume was kernel-3.4.6-2.fc17.x86_64.

Btw, on all architectures but x86 the kernel will check the version in the header of the suspend state in the swap space.  X86 as the only version uses CONFIG_ARCH_HIBERNATION_HEADER which (see kernel/power/snapshot.c and kernel/power/power.h) disables the check in check_image_kernel() and the arch code function arch_hibernation_header_restore() that is being called instead does not do the version check.  So if the bootloader boots the wrong kernel things go splat.

Comment 36 Ralf Baechle 2012-07-31 18:57:40 UTC

(In reply to comment #35)

> In corruption case #2 the kernel for both suspend and resume was
> kernel-3.4.6-2.fc17.x86_64.
> 
> Btw, on all architectures but x86 the kernel will check the version in the
> header of the suspend state in the swap space.  X86 as the only version uses
> CONFIG_ARCH_HIBERNATION_HEADER which (see kernel/power/snapshot.c and
> kernel/power/power.h) disables the check in check_image_kernel() and the
> arch code function arch_hibernation_header_restore() that is being called
> instead does not do the version check.  So if the bootloader boots the wrong
> kernel things go splat.

And I can reproduce this trivially on a fully up to date F17 install (kernel-3.4.6-2.fc17.x86_64 and dracut-018-96.git20120724.fc17.noarch).

  Ralf

Comment 37 Dr. Tilmann Bubeck 2012-08-01 06:47:32 UTC

> In case of corruption case #1 it did indeed resume into a different kernel,
> that is 3.4.4-5.fc17.x86_64 -> 3.4.6-2.fc17.x86_64.

You have to be very careful to check, which kernel is booting, because currently grubby is buggy https://bugzilla.redhat.com/show_bug.cgi?id=732654 and inserts a wrong "Loading message..." into grub.cfg so that it prints the version number of the previously installed kernel and makes the user think, it boots a different number. But only the output message is wrong, it boots the right kernel. If you can still reproduce, you can use "uname -r" to check which kernel was loaded.

IMHO it is not enough to just _update_ dracut. You also have to use the new version during boot, which is triggered by a subsequent kernel update or an explicit "dracut --force".

Comment 38 Harald Hoyer 2012-08-01 09:26:04 UTC

Yes, but dracut cannot prevent you from booting a different kernel. So for dracut this is NOTABUG.

Comment 39 Ralf Baechle 2012-08-02 07:30:55 UTC

(In reply to comment #37)
> > In case of corruption case #1 it did indeed resume into a different kernel,
> > that is 3.4.4-5.fc17.x86_64 -> 3.4.6-2.fc17.x86_64.
> 
> You have to be very careful to check, which kernel is booting, because
> currently grubby is buggy https://bugzilla.redhat.com/show_bug.cgi?id=732654
> and inserts a wrong "Loading message..." into grub.cfg so that it prints the
> version number of the previously installed kernel and makes the user think,
> it boots a different number. But only the output message is wrong, it boots
> the right kernel. If you can still reproduce, you can use "uname -r" to
> check which kernel was loaded.
> 
> IMHO it is not enough to just _update_ dracut. You also have to use the new
> version during boot, which is triggered by a subsequent kernel update or an
> explicit "dracut --force".

Ah, of course.  However reinstalling the kernel package in order to get the initrd to be regenerated results in a broken grub.cfg probably because I temporarily uninstalled the only kernel package that was installed and the scripts are not prepared to deal with that kind of situation.

Comment 40 Ralf Baechle 2012-08-02 07:32:46 UTC

(In reply to comment #37)
> > In case of corruption case #1 it did indeed resume into a different kernel,
> > that is 3.4.4-5.fc17.x86_64 -> 3.4.6-2.fc17.x86_64.
> 
> You have to be very careful to check, which kernel is booting, because
> currently grubby is buggy https://bugzilla.redhat.com/show_bug.cgi?id=732654
> and inserts a wrong "Loading message..." into grub.cfg so that it prints the
> version number of the previously installed kernel and makes the user think,
> it boots a different number. But only the output message is wrong, it boots
> the right kernel. If you can still reproduce, you can use "uname -r" to
> check which kernel was loaded.
> 
> IMHO it is not enough to just _update_ dracut. You also have to use the new
> version during boot, which is triggered by a subsequent kernel update or an
> explicit "dracut --force".

Ah, of course.  However reinstalling the kernel package in order to get the initrd to be regenerated results in a broken grub.cfg probably because I temporarily uninstalled the only kernel package that was installed and the scripts are not prepared to deal with that kind of situation.  So that fixed the issue I think although I'm now getting some unhappy but apparently harmless messages from grub.  Thanks for the help!

Comment 41 Xavier Hourcade 2012-08-06 16:06:47 UTC

Hi, this bug still applies to F16 too (same story as in comment #18 here).
And I've got lots of <3 and karma to give :)

  dracut-018-55.git20120606.fc16.noarch
  kernel-3.4.6-1.fc16.x86_64

Disabling hibernation for now, was a little stressful ^^