Bug 981841
Summary: | Unable to hibernate properly with multiple encrypted volumes due to ordering changes | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Robert Hancock <hancockrwd> |
Component: | kernel | Assignee: | dracut-maint |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 20 | CC: | collura, dracut-maint, gansalmon, harald, itamar, jonathan, kernel-maint, madhu.chinakonda, marcelo.barbosa |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | --- | Flags: | kernel-team:
needinfo?
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-04-28 18:23:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Robert Hancock
2013-07-06 08:23:18 UTC
From what I can tell, the kernel is storing the resume= device from the kernel command line on boot. Later on, when the kernel gets told to attempt resuming, the device name is converted into a major/minor pair. It appears that this device is the one the kernel will later attempt to hibernate to when required. The resume part with dracut seems reasonable - the encrypted volume gets activated and a resume is attempted from the device matching the one passed on the command line. At this point the swap encrypted volume is on dm-3 and the home volume (if it's listed on the boot command line) is on dm-4. The problem comes later, as for some reason those cryptsetup mappings get torn down and recreated when the "main" boot process starts. The lines in /etc/fstab for the /home partition and swap partition were in the opposite order from the order in the dracut command line. This means that now the swap encrypted volume is dm-4 and the home volume is dm-3. But the kernel still remembers that the resume device was minor number 3 and so when you hibernate it tries to write the image to the home partition, which fails as it's not a swap partition. Switching the order of lines in /etc/fstab so that the swap partition comes before the home partition fixed the problem in this case. But this all seems very fragile. Arguably it would be better if the kernel used the stored device filename when hibernating, rather than the major/minor numbers it remembered from the attempted resume, as that seems more likely to have a stable mapping to the actual partition/device. But I'm also not sure why the cryptsetup mappings are being torn down and recreated. I would think if that didn't happen, it would avoid this problem, and speed up the boot as well. (In reply to Robert Hancock from comment #0) > Description of problem: > dracut appears to be setting an incorrect device in /sys/power/resume on > boot, seemingly resulting in hibernate not working - I get a "PM: Cannot > find swap device, try swapon -a" error from the kernel. > dracut does not set /sys/power/resume, if you are not resuming. Writing to /sys/power/resume does result in resuming. But if dracut encounters a partition with the ID_FS_TYPE=suspend or ID_FS_TYPE=swsuspend, it will echo the major:minor to /sys/power/resume. (In reply to Robert Hancock from comment #1) > From what I can tell, the kernel is storing the resume= device from the > kernel command line on boot. Later on, when the kernel gets told to attempt > resuming, the device name is converted into a major/minor pair. It appears > that this device is the one the kernel will later attempt to hibernate to > when required. So, you mean: 1. cold boot 2. hibernate -> works 3. resume 4. hibernate -> fails, because the kernel wants to resuse the major/minor? I can't believe that. (In reply to Harald Hoyer from comment #2) > (In reply to Robert Hancock from comment #0) > > Description of problem: > > dracut appears to be setting an incorrect device in /sys/power/resume on > > boot, seemingly resulting in hibernate not working - I get a "PM: Cannot > > find swap device, try swapon -a" error from the kernel. > > > > dracut does not set /sys/power/resume, if you are not resuming. Writing to > /sys/power/resume does result in resuming. > > But if dracut encounters a partition with the ID_FS_TYPE=suspend or > ID_FS_TYPE=swsuspend, it will echo the major:minor to /sys/power/resume. I don't think I have any partition labelled as suspend or swsuspend (I'm not sure what would get the swap partition into a state so that it would be so labelled). But I do have an explicit resume= entry on the kernel command line, so it does try to resume during the dracut boot sequence. (In reply to Harald Hoyer from comment #3) > (In reply to Robert Hancock from comment #1) > > From what I can tell, the kernel is storing the resume= device from the > > kernel command line on boot. Later on, when the kernel gets told to attempt > > resuming, the device name is converted into a major/minor pair. It appears > > that this device is the one the kernel will later attempt to hibernate to > > when required. > > So, you mean: > > 1. cold boot > 2. hibernate -> works > 3. resume > 4. hibernate -> fails, because the kernel wants to resuse the major/minor? > > I can't believe that. No, not that complicated. Only one boot is involved. 1. cold boot 2. dm-crypt mappings get set up in order as listed on command line 3. dracut sets /sys/power/resume attempting resume - fails, no hibernate image in swap, but major-minor pair of resume device is stored by kernel 4. dm-crypt mappings torn down 5. dm-crypt mappings set up in order as listed in /etc/fstab 6. user hibernate - fails, stored major/minor is no longer a swap partition If, after booting up, I echoed the proper major/minor numbers for the resume partition into /sys/power/resume and hibernated, then on powering up it resumed fine. So it's the hibernate portion that's broken. (In reply to Robert Hancock from comment #4) > If, after booting up, I echoed the proper major/minor numbers for the resume > partition into /sys/power/resume and hibernated, then on powering up it > resumed fine. So it's the hibernate portion that's broken. Yeah, but dracut does not hibernate. It only resumes. (In reply to Harald Hoyer from comment #5) > Yeah, but dracut does not hibernate. It only resumes. It's true, the hibernate portion is done (and gets broken) after dracut is all finished. Not sure what component this should be assigned to then - it seems like we shouldn't be tearing down the encrypted volumes just to create them, and we shouldn't be storing the resume device in the kernel in as fragile of a fashion. OK, after installing some updates this problem is happening again. Apparently the order of the rd.luks.uuid lines on the kernel command line isn't reliably controlling the order in which the devices are activated in dracut, which means they don't reliably match the order in which they're listed in fstab. Switching the lines in fstab avoids the problem for now but who knows if some future change will switch them around again. Since there seems to be no reliable workaround, something needs to be rethought here I think. I can only presume this doesn't commonly happen since default setups generally encrypt a single drive either entirely or not at all. No, the lines on the kernel command line do not specify the order of activation. Activation happens with hotplug, as soon as the underlying device is discovered. Am going to reassign this to kernel at this point, as this doesn't really seem like a dracut problem. It seems like it all kind of comes back to that major:minor number pair that the kernel holds onto from bootup. It seems a bit naive to expect that device ID to still be valid by the time the user finally triggers a hibernate. In setups like this with multiple encrypted volumes, it frequently is not. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.12.6-200.fc19. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20. If you experience different issues, please open a new bug report for those. Changing version to F20 since as far as I can tell, nothing has really changed in this regard. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.13.4-200.fc20. Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. As far as I can tell there has been no change in this behavior. Reported to kernel mailing list here: https://lkml.org/lkml/2014/2/24/762 *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.14.4-200.fc20. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. Still no change as far as I'm aware. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.17.2-200.fc20. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 21, and are still experiencing this issue, please change the version to Fedora 21. If you experience different issues, please open a new bug report for those. Not aware of any change in this behavior. got this in fc21 install: dracut-initqueue ln failed to create symbolic link '/dev/resume' file exists so still seems to be there as of kernel-3.17.7-300.fc21.86_64 dracut-038-32.git20141216.fc21.x86_64 *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.18.7-100.fc20. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 21, and are still experiencing this issue, please change the version to Fedora 21. If you experience different issues, please open a new bug report for those. *********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in over 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously. |