Bug 166357

Summary: kernel places disks to sleep on swsusp, then fails to write pages to swap on lvm on raid1
Product: [Fedora] Fedora Reporter: Alexandre Oliva <oliva>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: agk, katzj, ncunning, pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard: NeedsRetesting
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-12-10 22:54:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexandre Oliva 2005-08-19 18:57:54 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8b3) Gecko/20050818 Fedora/1.1-0.2.7.deerpark.alpha2.1 Firefox/1.0+

Description of problem:
I tried to get an athlon box to suspend with swsusp, using the 1499_FC5 kernel.  Unlike a notebook that doesn't use raid at all, that suceeded in suspending to swap on lvm and came back up successfully (although at least once with filesystem corruption, after a failed suspend with an earlier kernel), this box wouldn't even store pages on disk.

It first put both disks to sleep (I could hear the clicks as they were turned off, like a regular shutdown), then it wrote it was going to save pages on disk, wrote it was 0% done, the disks sounded like they were being turned back on for a moment, but nothing happened.  For several more minutes, and then, all of a sudden, someone hit Alt-SysRq-B and the box rebooted.  One of the raid devices (the one holding the root device) needed resyncing.

If I tried to suspend again while it resynced, it barfed trying to stop the resyncing, the box refused to suspend for the first time.  When I insisted, it actually checkpointed the resyncing, but reported it couldn't be stopped and then that something was odd, the task wasn't stopped.  And then it froze again.

Version-Release number of selected component (if applicable):
kernel-2.6.12-1.1499_FC5

How reproducible:
Always

Steps to Reproduce:
1.Try to swsusp to swap on lvm on raid1

Actual Results:  Problems abound

Expected Results:  I wish it would work like swsusp to swap on lvm on raw disk partitions

Additional info:

Comment 1 Alexandre Oliva 2005-08-19 19:46:58 UTC
I suspect placing disks to sleep might be the right thing to do, such that the
saved state hsa them turned off as expected (?), but I've narrowed the problem
down to saving to raid.  If I switch to a raw disk device, the box suspends and
resumes fine most of the times.  It fails if I attempt to suspend in the middle
of a raid resync, though.  In fact, it fails in such a way that it comes back to
life after the attempted suspend, but with the raid subsystem completely hosed,
such that any further attempts to access /proc/mdstat or any filesystems mounted
out of the raid devices will hang.  In particular, further attempts to suspend
will hang as well.  This is what happened in the `when I insisted' case above.

Comment 2 Alexandre Oliva 2005-08-20 17:49:07 UTC
There's still an oddity in coming back up after a suspend to raw partition in
1.1502_FC5: when the box boots up, it says one of the raid devices (not holding
swap partitions) needs a resync.  When stopping all tasks for resuming, it stops
and checkpoints the resync, but when it completes the resume, there's no ongoing
resync.

I imagine it might be the case that resuming completes whatever I/O was ongoing
in the raid 1 devices that left one of the devices in need of a resync, but I'd
feel much safer if swsusp actually got all raid members stable before completing
the suspend, such that it wouldn't depend on resuming to avoid a complete resync.

Comment 3 Dave Jones 2005-08-20 18:10:04 UTC
Hmm, if the disks need a resync, we should probably not resume until that has
completed.  Starting a background rebuild, and then resuming (to a system that
won't be rebuilding) sounds terrifying.

Jeremy, any ideas ?


Comment 4 Alexandre Oliva 2005-08-20 21:27:02 UTC
Erhm...  My theory is the system didn't *really* need a resync, it just had some
pending raid I/O that the raid subsystem would complete to bring the array back
in sync right after resume.  I may be totally off though; if the system doesn't
actually complete the I/O after resuming, it's a big trouble, but ideally it
shouldn't leave any such pending I/O such that the resync doesn't start and stop
before resume.

Comment 5 Alexandre Oliva 2005-08-21 20:26:11 UTC
I just filed a separate bug for the raid-needs-resync problem, bug 166453. 
Let's leave this one for saving swap on lvm on raid alone, although the problems
might actually be related.

Comment 6 Dave Jones 2006-10-16 18:42:53 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 7 Nigel Cunningham 2007-12-10 22:54:01 UTC
Closing as per previous comment.