Red Hat Bugzilla – Bug 166453
swsusp doesn't wait until raid devices are stable
Last modified: 2007-11-30 17:11:12 EST
Booting up after swsusp on a system with root on LVM on raid 1 will almost
always claim the raid 1 device holding the root filesystem needs resyncing. It
starts the resync, but resume stops it and never restarts it. It's not clear
whether that's because the saved state contains information about the incomplete
I/O, and proceeds to complete it, or it's entirely lost. In any case, raid
devices ought to be stable at the time of the swsusp, otherwise the saved state
in a raided swap device (yeah, bad idea, but still...) may not even be readable
correctly (consider degraded raid 5, for one). swsusp should place all raid
devices in immediate safe mode, like shutdown does shortly before powering off.
we can't do that. how would we reactivate them when we resume ?
We suspend with dirty state, we just need to handle that better on resume.
The only thing we can do here is detect at boot up that we have a suspend
partition *Before* we activate raid sets, and do the resume before we even load
the raid modules.
This is probably going to be a really messy problem to fix up.
We have to start raid before LVM, since PVs may be on raid, and swap devices may
be on lvm or on raid.
Couldn't raid devices move to immediate safe mode at suspend time, and back to
normal operation mode on resume? This sounds like the right thing to do to me.
Suspending with dirty state means you can't rely on any data that's in the raid
devices, meaning you could even be loading broken kernel or initrd images (if
they're on raid), nevermind the suspend image itself. Flushing the raid devices
to make them stable is just as important as flushing the hard disk caches, IMHO.
if the resume partition is raid, you're hosed. That's just not a configuration
we can support. It's a chicken and egg problem. We *can not* touch raid
partitions that were active at the time of suspend.
They can't "go back to normal operation mode on resume". We don't run any magic
'fix things up' code when we resume, we load the state of the memory, restore
the registers and jmp back to where the EIP was before we resumed.
You *can* rely on that data on resume, because its in exactly the same state it
was before you suspended (as long as you haven't raidstart'd it before you do
This isn't a flushing issue, it's a state issue. Resuming from swap restores the
exact state things were in before we suspended.
How about switching to immediate-safe mode after saving the state, before
shutting the system down?
Hmm... This could still corrupt raid 5, ugh.
It has to be even more complex, but I think it can be done. Just like we place
disks to sleep after we stop all tasks and before we take the memory snapshot,
we could switch raid to immediate stable, flush everything, switch back to
normal mode (which won't cause I/O since all tasks are stopped), then snapshot
the memory and save it to disk, then switch back to immediate stable mode before
shut down. This appears to mirror exactly what happens to the IDE disk
subsystem, that shuts disks down, brings them back up to save state, then shuts
them down again before the host powers off.
If we do this, then the snapshot is taken with raid in regular operation mode,
but stable, and the system state saved on raid will be stable as well, and
nothing special has to happen when the system is brought back up.
Of course I'm speaking as someone not sufficiently familiar with the
implementation of the raid subsystem to tell whether what I propose is easily
doable, but I think it's probably the most reasonable path to go, since it
mirrors the working behavior we have for block devices in disk partitions.
I've run out of ways to say "We don't run any magic 'fix things up' code when we
Cleaning the state before we suspend won't do any good at all. We'd then resume
with clean state, and an active raidset. How well is that going to work ?
Can you explain which part of my last posting gave you the impression that we'd
have to fix anything up after resume? I must not have been clear, because I
explicitly addressed this point.
As for resuming with the active raidset, it would be active but without any
pending I/O, so it should hopefully be fine, no?