Red Hat Bugzilla – Bug 1475570
Rescue mode fails while trying to access LVM volumes from existing install
Last modified: 2017-09-25 14:07:48 EDT
The openQA rescue mode test started failing between Fedora-Rawhide-20170712.n.1 and Fedora-Rawhide-20170714.n.0 . The test uses an encrypted install. When booting to rescue mode, after entering the passphrase, instead of things proceeding normally, the system just starts printing lines full of equals signs over and over.
'syslog' from these failures contains a bunch of LVM messages clearly indicating some kind of problem, but I'm not quite sure what; I'll attach one.
I don't *immediately* see what changes in the 0714.n.0 compose may be relevant; I don't see changes to any anaconda-related components or to LVM. There *was* a new kernel, though (went from kernel-4.13.0-0.rc0.git5.1.fc27 to kernel-4.13.0-0.rc0.git6.1.fc27).
Created attachment 1305080 [details]
syslog from a failure
This doesn't look like an Anaconda issue. I'm changing component to LVM, maybe they will know what could be the root cause here.
From the trace it looks like PV with Fedora VG is found on a multiple leg device, then it's dropped and lvmetad is confused ?
2 things to try out:
a) try to run the same case with lvm.conf 'use_lvmetad=0' (also in regenerated Anaconda ramdisk)
b) is multipath device properly configured ? is lvm2 seeing only resulting multipath device and not it's leg - is udev properly knowing which device is multipath component device ?
Would it be possible to add "debug" to kernel command line?
There is no multipath device of any kind involved in the test (at least intentionally; if one is showing up, there is some kind of misdetection going on).
I can try with 'debug' on the cmdline easily today, editing lvm.conf will be more complicated.
Created attachment 1308868 [details]
syslog with debug enabled
Here's the syslog output after reproducing the bug with 'debug' added to the params. Also, the bug *is* specifically related to encryption: rescue mode works fine on a non-encrypted install.
Created attachment 1308869 [details]
storage.log with debug enabled
Created attachment 1308870 [details]
program.log with debug enabled
This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle.
Changing version to '27'.
This issue also exists in current server ISO of FC26, the particular system has no LVM running at all (simple partition layout) and also no encryption is used. It happens short after starting anaconda and select 1) for Continue.
strace short before the write (loop) appears is
sendto "Entered spoke: RescueMountSpoke\0"
then the loop starts:
write of 80x "=" \n 80x "="
write of one "\n"
Interesting is that FC25 has the same issue, FC24+23 can't be tested, system is too new, kernel is not booting.
-> have to recover manually now...
Peter: I suspect your case may actually be a different issue that has the same symptom (this is one of those codepaths where almost any failure that happens within a certain block of code will result in the same obvious symptom, I think). It would be good if you could file a new bug, with the same info that the devs asked for here.
So in the attached logs - we can observe this:
before time 21:54:24
there was 'fedora' VG with root & swap LV active on luks device.
Then there was total 'shutdown' of all LVs on VG and luks device has been unmapped:
21:54:22,565 DEBUG systemd:lvm2-pvscan@253:2.service: Control process exited, code=exited status=0
Then follows new luks unlocking, however pvscan service is not being initialized:
21:54:24,816 DEBUG systemd-udevd:LINK 'disk/by-id/lvm-pv-uuid-0loTAy-CPXS-tS26-BzJa-DaZP-X2zH-7u9iK3' /usr/lib/udev/rules.d/69-dm-lvm-metad.rules:38
ATM it's unclear where has been this missing pvscan service reinitialization lost.
It looks there are at least 2 seconds between new luks reactivation so it shouldn't be a race case.
I'll leave to Peter to add more advices about how to enable udev monitoring during this situation - so we can get better info what's going on.
(As a 'quick-hotfix' add direct call 'pvscan --cache' after unlocking luks will likely fix it - or disabling lvmetad - which IMHO would be preferable in case, auto-activation is not needed by Anaconda, which would prefer to activate devices on its own)
Just an update here: this still seems to be happening with all recent anaconda versions. The way anaconda *reacts* to it seems to be changing, though. 27.20-1 just shows the 'Continue / Read only mount / Skip to shell / Quit' choices again, and if you hit 1 (Continue), immediately displays them again. 27.20.1 and 28.2 show a traceback and then 'Pane is dead'. But if you examine program.log for any of these different failure modes, you see the same basic error as in this bug:
14:47:06,848 INF program:  Got error: Failed to call the 'Activate' method on the '/com/redhat/lvmdbus1/Lv/0' object: GDBus.Error:org.freedesktop.DBus.Python.dbus.exceptions.DBusException: ('com.redhat.lvmdbus1.Lv', 'Exit code 5, stderr = Volume group "fedora" not found\n Cannot process volume group fedora\n')
Proposing as a Beta blocker, per Beta criterion "The rescue mode of the installer must start successfully and be able to detect and mount any installation performed according to the applicable criteria, and provide a shell with access to utilities capable of performing typical recovery operations."
Discussed during blocker review :
AcceptedBlocker (Beta) - seems a clear violation of "The rescue mode of the installer must start successfully and be able to detect and mount any installation performed according to the applicable criteria, and provide a shell with access to utilities capable of performing typical recovery operations"
Please, try booting with this added to kernel command line:
debug systemd.log_target=kmsg log_buf_len=8M inst.updates=http://people.redhat.com/prajnoha/bz1475570/updates.img
This update image contains extra systemd unit to monitor udev events (systemd-udev-monitor.service) and "getlvmdebug" script to gather lvm debug information.
When the problem occurs, switch to anaconda's command line and, please, run:
Then copy the /run/lvmdebug.tar.xz file that the script has created and attach it here for us to inspect more. Thanks.
Created attachment 1328401 [details]
I trivially reproduced the issue in a VM using https://kojipkgs.fedoraproject.org/compose/branched/Fedora-27-20170919.n.0/compose/Server/x86_64/iso/Fedora-Server-dvd-x86_64-27-20170919.n.0.iso . I installed a default encrypted system over the whole disk, booted to make sure it works, and then rebooted into rescue mode and tried to unlock it.
Here's the file requested.
OK, this looks like an issue with systemd where the lvm2-pvscan@.service is not instantiated as expected due to an issue with SYSTEMD_WANTS and SYSTEMD_ALIAS directives in udev 69-dm-lvm-metad.rules in a situation where the underlying PV device is deactivated and then reactivated. This technique is supposed to execute lvm2-pvscan@<major>:<minor>.service to execute the LVM autoactivation.
(Here, anaconda is deactivating the crypted device and then reactivating it again.)
Michal Sekletar from systemd team is currently investigating the issue further.
I debugged the issue (thanks to Peter for minimal reproducer) and eventually I've figured out that issue is already fixed upstream . A bit embarrassing is that it was actually me who caused the bug in the first place (while fixing another LVM related bug in the process ).
I went ahead and commit fix to dist-git,
Build with backport is here (ARM builds are still in progress but rest is ready to be tested).
I tested the fix and it works! Great job. Now we need an update submitted to Bodhi ASAP. Thanks.
systemd-234-7.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2017-1ec480964f
Works with Beta RC2.
systemd-234-7.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.