Description of problem: Systemd failes to activate swap on encrypted LVM volumes. ! This is not bugzilla 710839 ! Version-Release number of selected component (if applicable): systemd-26-2.fc15.i686 How reproducible: Always Steps to Reproduce: 1. Set up an encrypted swap partition editing /etc/crypttab as in the attach. 2. Reboot 3. Actual results: The encrypted volume is correctly made, and swap headers are written (when 710839 is circumvented), but the actual 'swapon' does not happen. A later swapon somewhere in rc.local can be used as a workaround it. Expected results: The swap to be available. Additional info: See attach
Created attachment 503458 [details] dmesg output with systemd debug flags
Created attachment 503462 [details] crypttab and fstab
It's interesting that 'mkswap /dev/mapper/swap' succeeded, but systemd never sees dev-mapper-swap.device and times out. What does udev know about the device?: udevadm info -q all --name=/dev/mapper/swap
Created attachment 503476 [details] Output of 'udevadm info -q all --name=/dev/mapper/swap'.
Just in case, would the udevadm output be any different if you don't use the workaround (swapon in /etc/rc.local)?
Just tried it: no difference.
Two issues: 1) there is an ordering cycle, the same as in bug 711150. 2) the swap device has SYSTEMD_READY=0 set in the udev database. That's why systemd fails to act on it. I don't know if 2) can be caused by 1).
(In reply to comment #7) > I don't know if 2) can be caused by 1). I came to the conclusion that it cannot. The device has no ID_FS_* variables defined. Is blkid able to identify the device after boot?: blkid -p -o udev /dev/mapper/swap
Before my 'fix' in /etc/rc.local: [root@waliwanda ~]# blkid -p -o udev /dev/mapper/swap ID_FS_UUID=0a93c435-0c38-417f-809a-cd7f2cc07d97 ID_FS_UUID_ENC=0a93c435-0c38-417f-809a-cd7f2cc07d97 ID_FS_VERSION=2 ID_FS_TYPE=swap ID_FS_USAGE=other After the 'fix', the result is identical. So blkid does identify it...
Do you have udisks installed? If not, see if installing it fixes it.
Finally good news! *) I had NOT installed udisks (was removed from the install, because I blacklist ntfsutils and xfsutils in the kickstart) *) Adding udisks makes it boot correctly. So you will make systemd depend on udisk? Thanks for the efforts!
> So you will make systemd depend on udisk? No, it needs to work without it. SYSTEMD_READY=0 comes from 99-systemd.rules, which has: # Ignore encrypted devices with no identified superblock on it, since # we are probably still calling mke2fs or mkswap on it. SUBSYSTEM=="block", KERNEL!="ram*|loop*", ENV{DM_UUID}=="CRYPT-*", ENV{ID_PART_TABLE_TYPE}=="", ENV{ID_FS_USAGE}=="", ENV{SYSTEMD_READY}="0" The rules assumes that something will set ID_FS_USAGE when mkswap finishes. That something is usually 80-udisks.rules. Either that important piece of the rules needs to be moved from 80-udisks.rules to a more generic rules file, or 99-systemd.rules needs to take care of it by copying the piece from 80-udisks.rules: ... ENV{ID_FS_USAGE}!="", GOTO="..." IMPORT{program}="/sbin/blkid -o udev -p $tempnode" ... Harald, what would you suggest?
Situation with systemd-26-3.fc15 * 'swap,' in /etc/crypttab correctly detected * mkswap is ran correctly, without delay, even when udisks is NOT installed it seems. * swapon is NOT yet run, still needs 'fix' in /etc/rc.local Interesting might be this snippet of /var/log/messages (udisks NOT installed here, no rc.local stuff) Jun 16 09:44:33 zaranj kernel: [ 28.917059] systemd[1]: systemd-readahead-collect.service changed running -> exited Jun 16 09:44:33 zaranj kernel: [ 28.917656] systemd[1]: Accepted connection on private bus. Jun 16 09:44:33 zaranj kernel: [ 28.918604] systemd[1]: Got D-Bus request: org.freedesktop.systemd1.Agent.Released() on /org/freedesktop/systemd1/agent Jun 16 09:44:33 zaranj kernel: [ 28.919273] systemd[1]: systemd-readahead-collect.service: cgroup is empty Jun 16 09:44:33 zaranj kernel: [ 28.919703] systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local Jun 16 09:45:43 zaranj kernel: [ 99.451931] systemd[1]: Job dev-mapper-swap.device/start timed out. Jun 16 09:45:43 zaranj kernel: [ 99.452409] systemd[1]: Job dev-mapper-swap.device/start finished, result=timeout Jun 16 09:45:43 zaranj kernel: [ 99.452897] systemd[1]: Job dev-mapper-swap.device/start failed with result 'timeout'. Jun 16 09:45:43 zaranj kernel: [ 99.453554] systemd[1]: Startup finished in 1s 669ms 388us (kernel) + 7s 163ms 586us (initrd) + 1min 30s 620ms 422us (userspace) = 1min 39s 453ms 396us Jun 16 09:45:43 zaranj kernel: [ 99.454049] systemd[1]: Running GC... Jun 16 09:59:04 zaranj kernel: [ 900.000088] systemd[1]: Timer elapsed on systemd-tmpfiles-clean.timer Jun 16 09:59:04 zaranj kernel: [ 900.000267] systemd[1]: Trying to enqueue job systemd-tmpfiles-clean.service/start/replace Jun 16 09:59:04 zaranj kernel: [ 900.000689] systemd[1]: Installed new job systemd-tmpfiles-clean.service/start as 236 After abt. 30 seconds, a prompt is produced, and the machine seems 'ready'. Much later, this timeout message appears.
# pick up device-mapper data; this REALLY should be done by rules installed # by the device-mapper package # KERNEL!="dm-*", GOTO="device_mapper_end" ACTION!="change", GOTO="device_mapper_end" ENV{UDISKS_DM_TARGET_TYPES}=="|*error*", GOTO="device_mapper_end" # avoid probing if it has already been done earlier # ENV{ID_FS_USAGE}!="", GOTO="device_mapper_end" IMPORT{program}="/sbin/blkid -o udev -p $tempnode" LABEL="device_mapper_end" ... comment says it all... this really should be in the device-mapper rules, which it is. $ rpm -qf /lib/udev/rules.d/13-dm-disk.rules device-mapper-1.02.63-1.fc15.x86_64
What is the output of: # dev=$(readlink -f /dev/mapper/swap) # udevadm test /block/${dev#/dev/}
Created attachment 505271 [details] Output of 'udevadm test' on the swap
... udev_rules_apply_to_event: IMPORT '/sbin/blkid -o udev -p /dev/dm-3' /lib/udev/rules.d/13-dm-disk.rules:22 util_run_program: '/sbin/blkid -o udev -p /dev/dm-3' started util_run_program: '/sbin/blkid' (stdout) 'ID_FS_UUID=35d07279-4e7d-4d1d-b800-4f69ac3faaa5' util_run_program: '/sbin/blkid' (stdout) 'ID_FS_UUID_ENC=35d07279-4e7d-4d1d-b800-4f69ac3faaa5' util_run_program: '/sbin/blkid' (stdout) 'ID_FS_VERSION=2' util_run_program: '/sbin/blkid' (stdout) 'ID_FS_TYPE=swap' util_run_program: '/sbin/blkid' (stdout) 'ID_FS_USAGE=other' util_run_program: '/sbin/blkid -o udev -p /dev/dm-3' returned with exitcode 0 ... udevadm_test: TAGS=:systemd: ... Looks good in my eyes... What is the output of: # udevadm info --query=all --name=/dev/mapper/swap If you boot without udisks?
[root@zaranj ~]# udevadm info --query=all --name=/dev/mapper/swap P: /devices/virtual/block/dm-3 N: dm-3 S: mapper/swap S: disk/by-id/dm-name-swap S: disk/by-id/dm-uuid-CRYPT-PLAIN-swap E: UDEV_LOG=3 E: DEVPATH=/devices/virtual/block/dm-3 E: MAJOR=253 E: MINOR=3 E: DEVNAME=/dev/dm-3 E: DEVTYPE=disk E: SUBSYSTEM=block E: DM_SBIN_PATH=/sbin E: DM_UDEV_PRIMARY_SOURCE_FLAG=1 E: DM_NAME=swap E: DM_UUID=CRYPT-PLAIN-swap E: DM_SUSPENDED=0 E: DM_UDEV_RULES_VSN=2 E: SYSTEMD_READY=0 E: DEVLINKS=/dev/mapper/swap /dev/disk/by-id/dm-name-swap /dev/disk/by-id/dm-uuid-CRYPT-PLAIN-swap E: TAGS=:systemd:
*** Bug 714467 has been marked as a duplicate of this bug. ***
encrypted swap is apparently created, mkswap is run (there is signature) but it fails to activate it. The problem is that there is no change event sent after mkswap is run => ID_FS_TYPE is not set (and SYSTEMD_READY=0) The reason is that 60-persistent-storage.rules contains this rule: # skip rules for inappropriate block devices KERNEL=="fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md*", GOTO="persistent_storage_end" so there is no watch for dm-devices -> mkswap will not generate change event. Removing "dm-" from rule causes change event to be sent btu that has another consequences (mainly for lvm) so it is not real solution. And the same problem is there for all named subsystems. Perhaps systemd should generate change event itself after it runs mkswap or mkfs over crypttab devices?
I think the right place to fix this is dm, not systemd. dm must be able to deal with change events generated at any time by udev or other code, and we should not have to add subsystem specific hacks to systemd just because LVM is broken.
There is no event generated by udev. This rule is not owned by LVM. And please stop repeating that LVM is broken.
What is broken here is the way how system reacts to event generated in reply to "watch" (change event generated on close on write by inotify). There is no problem with change event but change event induces various scans (usually blkid) and this open device. This is racy - even for scripts (so you have to add udevadm settle after all commands which induces inotify event - and even that is not always enough). I have nothing against such asynchronous notification system - but you have to differentiate between opening device from userspace and from udev rules for scanning purposes (which is required to detect FS type - and this is impossible today). Otherwise even the script "mkswap <dev>; swapon <dev>" is racy, because if blkid (running from udev rule) still have device open when swapon tries to open device in exclusive mode, swapon fails (just example - in principle, I know this is not perfect example but it is simple and not lvm related:-). So yes - let's remove all the device exceptions from default udev rules and run scan on every inotify event - but programs opening device skould have higher priority than programs scannig device, so it will "win race" when run open() on device. Then everything will work. Moreover, apparently there are more devices with that exception in udev rules - so all of them are broken (I have no idea why there is so many exceptions)? Removing dm from list just move problem to other subsystem... I think some generic way how to solve asynchronous device scan is needed here.
"mkswap <dev>; swapon <dev>" is not what systemd is doing. We just wait for the udev events before we invoke swapon, and at that time the rules are known to have finished, since udev only notifies other processes after it processed all rules. LVM should do the same if it wants to synchronize on rules to be finished running. LVM needs to subscribe to device events and then process them as soon as they show up and not use anything else for device synchronization. Not "udev settle", not anything else that isn't waiting for udev events. (Kay is likely to drop udev settle eventually to get people to stop misusing it) Rules will execute arbitrary things, and that's something clients need to be able to deal with. It is not an option to just disable all rules because your app is so broken it cannot handle them. All of this isn't really news. This hasn't been fixed in LVM over years. And I see little point in hacking a work-around for this into systemd if the right thing is to fix LVM.
I would not put the blame solely on LVM. The exclusion from persistent storage naming rules is not limited to LVM devices: # skip rules for inappropriate block devices KERNEL=="fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md*", GOTO="persistent_storage_end" For instance, MD RAID will have the same problem. 60-persistent-storage.rules is not shipped by lvm, but by udev itself. I do not quite understand what exactly makes the listed classes of devices 'inappropriate' for persistent naming rules. The list was updated several times in udev history with kinds of devices being added and removed without explanation in the commit messages. 'dm-*' did have an explanation when it was added to the list in May 2006: commit cecd7f9a758fd498ae267dcb64e65e167ffef810 skip device mapper devices for persistent links It conflicts with snapshot creation. It will move to its own rule file after kernel provides needed additional events. Is this reason for exclusion still valid today? If I understand Milan correctly, today LVM would have no problem being handled by the persistent storage rules were it not for the udev's "watch" feature. The inotify-based udev watches were added to udev in 2009 (commit f24036d63b0aee735c3098d09b9e0ed450e93177 and its parent commit). The purpose of the watches is to allow udev to have up-to-date information about the contents (filesystem type, ...) of block devices. They are triggered whenever a process closes a device it had open for writing. The watches solved some problems, but they also added some, because: - They run asynchronously behind your back even when you think you're the only user of the block device you're interested in. - Even when you are aware of their existence, there is no easy way to tell if at any given moment udev is executing things as a result of a watch on the device you're interested in and you should wait for it to finish before you can open the device. The example "mkswap <dev>; swapon <dev>" was given to demonstrate that the problem is not specific to LVM and that even a seemingly trivial and traditionally reliable sequence of commands can be disturbed by a udev watch. It occurs to me that the goal to have up-to-date information in udev could be achieved in several ways. The watches are just one of the alternatives and maybe they are not the best one. In the sequence "mkswap <dev>; swapon <dev>", mkswap _knows_ it is changing an important property (mkswap is a metadata writer) of the device. swapon is just a user of the device. Perhaps instead of having an asynchronous inotify watch, we could teach mkswap (and other metadata-writing tools, like mkfs.*, pvcreate, mdadm --create, ...) to _tell_ udev when they're done and wait synchronously for udev to finish its processing before they themselves quit. This way the swapon in the example would never be disturbed. Assuming the udev watches are here to stay, how else could the example be fixed? Either we: 1) make mkswap (and other metadata writers) wait for the inotify-based udev watch to finish processing before it itself quits - but this differs very little from the already proposed fix above and is less clean; or 2) make swapon (and all other users of block devices) detect that udev is running stuff on the device and wait for it to finish; or 3) introduce a new command: "mkswap <dev>; udev-wait-for-dev-processed <dev>; swapon <dev>"; or 4) implement a kernel-based priority hack like the one Milan hypothesised in comment #23 ('priority' was not meant in the scheduling sense, but in the meaning of 'priority of access to the device') I have a feeling than implementing 4) would necessitate solving the same kind of problems that adding the revoke() syscall would. I don't like 3) at all. The traditional two-command sequence should be reliable. When deciding between 1) and 2), I feel that fixing all 'metadata writers' should be easier than fixing all 'users of block devices', so 1) wins. And since it differs so little from the proposal to drop the watches altogether, I'd say drop the watches. It's cleaner.
Cc'ing Kay Sievers.. Kay welcome to Red Hat! Please see comment#25, it would be wonderful if udev watch rules could be removed.
These devices are excluded because, they don't have media change events like floppy, are network block devices, or the class of devices are managed with their own rules and udev should not carry them in the default set. Dm and md ship their own rules, and need their own logic to exclude some devices from the standard udev behaviour. At this moment, I don't see why removing "watch" from the default would help anything here. Just because DM can't really handle the events "watch" causes, we should remove it for the rest of all other devices too? As far as it looks, "watch" works without any real problems for all other devices and solves a couple of real problems at the same time. I'm not really convinced, that "fixing all 'metadata writers'" is a realistic approach.
If there is a rule that anyone can open device anytime, then all metadata handling programs should be patched to wait until it can open the devices instead of failing... Whatever, I was just curious why the same problem disappeared from MD... (On my older system you can see the same race / busy error problem using MD - e.g. while :; do mdadm -A /dev/md0 /dev/sd[eg]; mdadm --stop /dev/md0 ; done ) Well, the approach is ... just hide the race (in my example with the udev scanning). /* As we have an O_EXCL open, any use of the device * which blocks STOP_ARRAY is probably a transient use, * so it is reasonable to retry for a while - 5 seconds. */ count = 25; err = 0; while (count && fd >= 0 && (err = ioctl(fd, STOP_ARRAY, NULL)) < 0 && errno == EBUSY) { usleep(200000); count --; } (TBH, I used something similar loop in cryptsetup for temp devices :-) So... on the one side we have request to support thousands of LVs and optimise speed for activation/deactivation, on the other side we are adding sleep() ... *shrug*
(In reply to comment #27) > These devices are excluded because, they don't have media change > events like floppy, are network block devices, or the class of devices > are managed with their own rules and udev should not carry them in the > default set. Dm and md ship their own rules, and need their own logic > to exclude some devices from the standard udev behaviour. You need to add loop devices to that list as well. I just spent a wonderful time trying to work out why umount was failing to destroy loop devices (giving EBUSY errors) when running xfstests 216 which does: mount -o loop <image> /mntpt echo "blah" > /mntpt/foo umount /mntpt That randomly fails to destroy the loop device association with the underlying file because udev runs blkid on the device during mount due to the "metadata writers" catchall rule. blkid is still running when umount tries to destroy the loop device and as a result leaves a mess that xfstests can't clean up causing all subsequent tests to then fail.... Adding "loop*" to the avoid list fixes the problem and xfstests runs reliably again.... As it is, I question the "list every type of device we want to avoid" approach versus "list only the the devices we want to have persistent naming" approach taken here. The current approach means that udev rules need to be updated for every new type of virtual/non-persistent block device type instead of just ignoring them by default....
(In reply to comment #28) > If there is a rule that anyone can open device anytime, then all metadata > handling programs should be patched to wait until it can open the devices > instead of failing... > > Whatever, I was just curious why the same problem disappeared from MD... > > (On my older system you can see the same race / busy error problem using MD - > e.g. > while :; do mdadm -A /dev/md0 /dev/sd[eg]; mdadm --stop /dev/md0 ; done ) > > Well, the approach is ... just hide the race (in my example with the udev > scanning). > > /* As we have an O_EXCL open, any use of the device > * which blocks STOP_ARRAY is probably a transient use, > * so it is reasonable to retry for a while - 5 seconds. > */ > count = 25; err = 0; > while (count && fd >= 0 > && (err = ioctl(fd, STOP_ARRAY, NULL)) < 0 > && errno == EBUSY) { > usleep(200000); > count --; > } > > (TBH, I used something similar loop in cryptsetup for temp devices :-) Peter Rajnoha has now proposed such a workaround for lvm, see: http://www.redhat.com/archives/lvm-devel/2011-September/msg00052.html LVM is clearly not alone on needing a workaround for udev WATCH rules. Most test suites suffer from intermittent failures due to udev rules (lvm, xfstests, thinp test suite, etc). Testsuites aside, forcing tools to add these workarounds seems to go against the goal of udev/systemd (fast boot, etc). But the inertia behind WATCH rules is so strong that we're forced to implement unpleasant workarounds. > So... on the one side we have request to support thousands of LVs and optimise > speed for activation/deactivation, on the other side we are adding sleep() ... > > *shrug* Right, it is a serious concern to have so many uevents and async scans (from udev rules) accessing devices. The slowdown associated with these scans is a massive concern for enterprise deployments with large numbers of multipath devices. Seems the WATCH rule is purely to enable a slick UI: https://bugzilla.redhat.com/show_bug.cgi?id=561424#c9 Server admins don't care about such GUI tools if they impose extra overhead when the GUI isn't running/used. They care about dependable tools and overall performance of the solutions they deploy. We really need to get serious about eliminating: 1) the hacks sprinkled around our toolchains to cope with udev WATCH rules. 2) the slowdown such excessive events/scans have on large LUN count systems.
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component.
ping to everybody to rethink
*** Bug 759402 has been marked as a duplicate of this bug. ***
I was directed here because systemd is not getting notified about the /dev/mapper/swap* devices properly. [You can check with "udevadm info -q all -n /dev/mapper/swapa" that they are probably marked with SYSTEMD_READY=0. As a workaround, installing the udisks package still works.] Is there any fix I can test?
*** Bug 773089 has been marked as a duplicate of this bug. ***
Encrypted swap is still not enabled at boot on Fedora 17 Alpha.
Does this imply F17 will have this same bug?
# free total used free shared buffers cached Mem: 4053584 3869076 184508 0 32740 1666724 -/+ buffers/cache: 2169612 1883972 Swap: 2072376 38192 2034184 # swapon -a # free total used free shared buffers cached Mem: 4053584 3870420 183164 0 32780 1667572 -/+ buffers/cache: 2170068 1883516 Swap: 4144752 38192 4106560 So yes it works partly, some of the time. (I have 4 swaps) # udevadm info -q all -n /dev/mapper/swap*|grep SYST E: SUBSYSTEM=block E: SYSTEMD_READY=0 # rpm -q udisks udisks-1.0.4-3.fc16.x86_64 So what is the workaround here? When will we see a solution to this very elementary issue? (for the user it is we just need swap, please make it behave like it did before systemd was 'introduced')
BTW: my swap is NOT swap on encrypted LVM volumes. My swap is on a regular partition. (even more basic than just that!) So why was my bug taken as a duplicate?
To be clear: regular partition <- /etc/crypttab encryption <- swap
This is fixed in F17 since device-mapper-2.02.92-1. The package now enables a watch rule in 13-dm-disk.rules. This allows systemd to receive the necessary notification from udev and do the swapon. F16 does not have watches on dm devices, so I added an F16-specific workaround to systemd to produce a uevent explicitly after running mkswap: http://pkgs.fedoraproject.org/gitweb/?p=systemd.git;a=blob;f=0170-F16-cryptsetup-workaround-missing-watch-rules-for-dm.patch;h=3a842aba07371879e27baab5487f894a17de5fb7;hb=refs/heads/f16 An update has been submitted: https://admin.fedoraproject.org/updates/systemd-37-19.fc16
What about F16?
(In reply to comment #45) > What about F16? Come on. I wrote also about F16 in comment #44. Give the systemd update a try.
Well, why not wait for comments on the fix before closing the bug here? The fix does not work for my system: encrypted swap was not fixed; times out. `swapon -a` fixes it. (yes had to reboot)
I've un-duplicated your bug 759402.