Bug 730811

Summary: hibernation often fails to resume and forces fsck
Product: Red Hat Enterprise Linux 6 Reporter: Matthew Mosesohn <mmosesoh>
Component: kernelAssignee: John Feeney <jfeeney>
Status: CLOSED WORKSFORME QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.3CC: chellwig, dchinner, esandeen, jfeeney, jmoyer, lczerner, mbroz, msanders, msnitzer, rwheeler, vgoyal
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-11-01 14:42:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 637248    

Description Matthew Mosesohn 2011-08-15 19:58:10 UTC
Description of problem:
Hibernating in RHEL 6.1 and 6.2 pre-beta seems to cause crashes when resuming.  Then on next boot the system prompts for root password to run a manual fsck


Version-Release number of selected component (if applicable):
kernel-2.6.32-131.0.15.el6.x86_64
kernel-2.6.32-167.el6.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1. Boot system and log in
2. Start some applications, such as OpenOffice, Firefox, Thunderbird, Evince, Rhythmbox
3. Hibernate system
4. Resume from hibernation
  
Actual results:
About 30% of the time the system will kernel panic when resuming. 

Expected results:
Normal resume from hibernate

Additional info:

Comment 2 Ric Wheeler 2011-08-16 07:21:11 UTC
Can you please add in details on the file system used (ext4 I guess?), the IO stack and type of storage?

Thanks!

Comment 3 Matthew Mosesohn 2011-08-16 12:44:55 UTC
Ric,

ext4 LVMs with full disk LUKS encryption to a local 500gb SATA disk (on a  ThinkPad T520 laptop)

Comment 4 Ric Wheeler 2011-08-16 13:33:16 UTC
Sounds like LUKS might be losing the write barrier/flush requests?

Comment 5 Jeff Moyer 2011-08-16 14:52:07 UTC
Tough to say without more debugging.  I think I'd start by assigning this to a device-mapper developer.

Comment 6 Eric Sandeen 2011-08-16 14:54:50 UTC
If you're in for more testing and have some hardware to do it, I'd start with a very simple storage stack, and then add things to it, testing along the way, until you can see which layer/component seems to cause the problem.

If it's ext4 on a plain partition, I'll perk up.  :)

Comment 7 Milan Broz 2011-08-16 16:24:21 UTC
It could be that FLUSH is lost somewhere, order is wrong, there is missing flush for workqueue (dmcrypt uses internal threads but DM core should send flush only if there is no IO in flight).

Seems to need more debugging. If flush is correctly backported, I do not think the problem is in dmcrypt. (It simply forwards flush to underlying device - the same like linear target. DM core should wait for previous IOs so flush is sent when dmcrypt has empty encryption queues.)

Is the hibernation code properly fixed to send flush when saving memory image to encrypted swap?

What is corrupted first - memory image loaded from swap during resuming or filesystem?
(I would try to hibernate and instead of resume run fsck from live CD - if there no corrupted fs, memory image in swap is corrupted and fs corruption is just consequence.)

Comment 8 Matthew Mosesohn 2011-08-16 16:31:40 UTC
Milan,

Are you requesting I try to reproduce that?

Comment 10 Matthew Mosesohn 2011-08-29 16:40:04 UTC
Is there any update on this request?

Comment 12 Matthew Garrett 2011-08-30 14:54:11 UTC
Matthew,

Can you attach the backtrace you get on resume?

Comment 18 Matthew Garrett 2011-10-04 15:42:16 UTC
Which kernel are you testing 6.2 with? Make sure that it's -199 or later.