730811 – hibernation often fails to resume and forces fsck

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 730811 - hibernation often fails to resume and forces fsck

Summary: hibernation often fails to resume and forces fsck

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	John Feeney
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	637248
TreeView+	depends on / blocked

Reported:	2011-08-15 19:58 UTC by Matthew Mosesohn
Modified:	2013-01-10 13:06 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-11-01 14:42:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matthew Mosesohn 2011-08-15 19:58:10 UTC

Description of problem:
Hibernating in RHEL 6.1 and 6.2 pre-beta seems to cause crashes when resuming.  Then on next boot the system prompts for root password to run a manual fsck


Version-Release number of selected component (if applicable):
kernel-2.6.32-131.0.15.el6.x86_64
kernel-2.6.32-167.el6.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1. Boot system and log in
2. Start some applications, such as OpenOffice, Firefox, Thunderbird, Evince, Rhythmbox
3. Hibernate system
4. Resume from hibernation
  
Actual results:
About 30% of the time the system will kernel panic when resuming. 

Expected results:
Normal resume from hibernate

Additional info:

Comment 2 Ric Wheeler 2011-08-16 07:21:11 UTC

Can you please add in details on the file system used (ext4 I guess?), the IO stack and type of storage?

Thanks!

Comment 3 Matthew Mosesohn 2011-08-16 12:44:55 UTC

Ric,

ext4 LVMs with full disk LUKS encryption to a local 500gb SATA disk (on a  ThinkPad T520 laptop)

Comment 4 Ric Wheeler 2011-08-16 13:33:16 UTC

Sounds like LUKS might be losing the write barrier/flush requests?

Comment 5 Jeff Moyer 2011-08-16 14:52:07 UTC

Tough to say without more debugging.  I think I'd start by assigning this to a device-mapper developer.

Comment 6 Eric Sandeen 2011-08-16 14:54:50 UTC

If you're in for more testing and have some hardware to do it, I'd start with a very simple storage stack, and then add things to it, testing along the way, until you can see which layer/component seems to cause the problem.

If it's ext4 on a plain partition, I'll perk up.  :)

Comment 7 Milan Broz 2011-08-16 16:24:21 UTC

It could be that FLUSH is lost somewhere, order is wrong, there is missing flush for workqueue (dmcrypt uses internal threads but DM core should send flush only if there is no IO in flight).

Seems to need more debugging. If flush is correctly backported, I do not think the problem is in dmcrypt. (It simply forwards flush to underlying device - the same like linear target. DM core should wait for previous IOs so flush is sent when dmcrypt has empty encryption queues.)

Is the hibernation code properly fixed to send flush when saving memory image to encrypted swap?

What is corrupted first - memory image loaded from swap during resuming or filesystem?
(I would try to hibernate and instead of resume run fsck from live CD - if there no corrupted fs, memory image in swap is corrupted and fs corruption is just consequence.)

Comment 8 Matthew Mosesohn 2011-08-16 16:31:40 UTC

Milan,

Are you requesting I try to reproduce that?

Comment 10 Matthew Mosesohn 2011-08-29 16:40:04 UTC

Is there any update on this request?

Comment 12 Matthew Garrett 2011-08-30 14:54:11 UTC

Matthew,

Can you attach the backtrace you get on resume?

Comment 18 Matthew Garrett 2011-10-04 15:42:16 UTC

Which kernel are you testing 6.2 with? Make sure that it's -199 or later.

Note You need to log in before you can comment on or make changes to this bug.