Bug 1475570 - Rescue mode fails while trying to access LVM volumes from existing install
Rescue mode fails while trying to access LVM volumes from existing install
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: systemd (Show other bugs)
27
All Linux
unspecified Severity high
: ---
: ---
Assigned To: systemd-maint
Fedora Extras Quality Assurance
AcceptedBlocker
:
Depends On:
Blocks: F27BetaBlocker
  Show dependency treegraph
 
Reported: 2017-07-26 18:32 EDT by Adam Williamson
Modified: 2017-09-25 14:07 EDT (History)
34 users (show)

See Also:
Fixed In Version: systemd-234-7.fc27
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-25 14:07:48 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
syslog from a failure (995.00 KB, text/plain)
2017-07-26 18:34 EDT, Adam Williamson
no flags Details
syslog with debug enabled (1.67 MB, text/plain)
2017-08-03 18:00 EDT, Adam Williamson
no flags Details
storage.log with debug enabled (1.07 MB, text/plain)
2017-08-03 18:00 EDT, Adam Williamson
no flags Details
program.log with debug enabled (1.46 MB, text/plain)
2017-08-03 18:01 EDT, Adam Williamson
no flags Details
lvmdebug.tar.xz (119.11 KB, application/x-xz)
2017-09-20 07:42 EDT, Kamil Páral
no flags Details

  None (edit)
Description Adam Williamson 2017-07-26 18:32:18 EDT
The openQA rescue mode test started failing between Fedora-Rawhide-20170712.n.1 and Fedora-Rawhide-20170714.n.0 . The test uses an encrypted install. When booting to rescue mode, after entering the passphrase, instead of things proceeding normally, the system just starts printing lines full of equals signs over and over.

'syslog' from these failures contains a bunch of LVM messages clearly indicating some kind of problem, but I'm not quite sure what; I'll attach one.

I don't *immediately* see what changes in the 0714.n.0 compose may be relevant; I don't see changes to any anaconda-related components or to LVM. There *was* a new kernel, though (went from kernel-4.13.0-0.rc0.git5.1.fc27 to kernel-4.13.0-0.rc0.git6.1.fc27).
Comment 1 Adam Williamson 2017-07-26 18:34 EDT
Created attachment 1305080 [details]
syslog from a failure
Comment 2 Jiri Konecny 2017-07-27 04:38:11 EDT
This doesn't look like an Anaconda issue. I'm changing component to LVM, maybe they will know what could be the root cause here.
Comment 3 Zdenek Kabelac 2017-07-27 05:29:11 EDT
From the trace it looks like   PV with Fedora VG is found on a multiple leg device, then it's dropped  and lvmetad is confused ?


2 things to try out:

a) try to run the same case with lvm.conf 'use_lvmetad=0'  (also in regenerated Anaconda ramdisk)


b) is multipath device properly configured ? is lvm2 seeing only resulting multipath device and not it's leg  - is udev properly knowing which device is multipath component device ?
Comment 4 Peter Rajnoha 2017-07-27 08:10:25 EDT
Would it be possible to add "debug" to kernel command line?
Comment 5 Adam Williamson 2017-07-27 12:11:19 EDT
There is no multipath device of any kind involved in the test (at least intentionally; if one is showing up, there is some kind of misdetection going on).

I can try with 'debug' on the cmdline easily today, editing lvm.conf will be more complicated.
Comment 6 Adam Williamson 2017-08-03 18:00 EDT
Created attachment 1308868 [details]
syslog with debug enabled

Here's the syslog output after reproducing the bug with 'debug' added to the params. Also, the bug *is* specifically related to encryption: rescue mode works fine on a non-encrypted install.
Comment 7 Adam Williamson 2017-08-03 18:00 EDT
Created attachment 1308869 [details]
storage.log with debug enabled
Comment 8 Adam Williamson 2017-08-03 18:01 EDT
Created attachment 1308870 [details]
program.log with debug enabled
Comment 9 Jan Kurik 2017-08-15 04:37:06 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle.
Changing version to '27'.
Comment 10 Peter Bieringer 2017-08-19 07:49:04 EDT
This issue also exists in current server ISO of FC26, the particular system has no LVM running at all (simple partition layout) and also no encryption is used. It happens short after starting anaconda and select 1) for Continue.

strace short before the write (loop) appears is

sendto "Entered spoke: RescueMountSpoke\0"

then the loop starts:

write of 80x "=" \n 80x "="
write of one "\n"

Interesting is that FC25 has the same issue, FC24+23 can't be tested, system is too new, kernel is not booting.

-> have to recover manually now...
Comment 11 Adam Williamson 2017-08-30 16:45:19 EDT
Peter: I suspect your case may actually be a different issue that has the same symptom (this is one of those codepaths where almost any failure that happens within a certain block of code will result in the same obvious symptom, I think). It would be good if you could file a new bug, with the same info that the devs asked for here.
Comment 12 Zdenek Kabelac 2017-08-31 05:46:16 EDT
So in the attached logs - we can observe this:


before time  21:54:24


there was  'fedora'  VG with root & swap LV active on luks device.

Then there was total 'shutdown' of  all LVs on VG and  luks device has been unmapped:

21:54:22,565 DEBUG systemd:lvm2-pvscan@253:2.service: Control process exited, code=exited status=0


Then follows new  luks unlocking, however  pvscan service is not being initialized:

21:54:24,816 DEBUG systemd-udevd:LINK 'disk/by-id/lvm-pv-uuid-0loTAy-CPXS-tS26-BzJa-DaZP-X2zH-7u9iK3' /usr/lib/udev/rules.d/69-dm-lvm-metad.rules:38


ATM it's unclear where has been this missing pvscan service reinitialization lost.

It looks there are at least 2 seconds between new luks reactivation so it shouldn't be a race case.


I'll leave to Peter to add more advices about how to enable udev monitoring during this situation - so we can get better info what's going on.
Comment 13 Zdenek Kabelac 2017-08-31 05:48:49 EDT
(As a 'quick-hotfix'   add direct call  'pvscan --cache' after unlocking luks will likely fix it  -  or disabling lvmetad - which IMHO would be preferable in case, auto-activation is not needed by Anaconda, which would prefer to activate devices on its own)
Comment 14 Adam Williamson 2017-09-08 19:43:48 EDT
Just an update here: this still seems to be happening with all recent anaconda versions. The way anaconda *reacts* to it seems to be changing, though. 27.20-1 just shows the 'Continue / Read only mount / Skip to shell / Quit' choices again, and if you hit 1 (Continue), immediately displays them again. 27.20.1 and 28.2 show a traceback and then 'Pane is dead'. But if you examine program.log for any of these different failure modes, you see the same basic error as in this bug:

14:47:06,848 INF program: [11] Got error: Failed to call the 'Activate' method on the '/com/redhat/lvmdbus1/Lv/0' object: GDBus.Error:org.freedesktop.DBus.Python.dbus.exceptions.DBusException: ('com.redhat.lvmdbus1.Lv', 'Exit code 5, stderr =   Volume group "fedora" not found\n  Cannot process volume group fedora\n')
Comment 15 Adam Williamson 2017-09-08 19:45:45 EDT
Proposing as a Beta blocker, per Beta criterion "The rescue mode of the installer must start successfully and be able to detect and mount any installation performed according to the applicable criteria, and provide a shell with access to utilities capable of performing typical recovery operations."
Comment 16 Kamil Páral 2017-09-11 13:20:42 EDT
Discussed during blocker review [1]:

AcceptedBlocker (Beta) - seems a clear violation of "The rescue mode of the installer must start successfully and be able to detect and mount any installation performed according to the applicable criteria, and provide a shell with access to utilities capable of performing typical recovery operations"

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2017-09-11/
Comment 17 Peter Rajnoha 2017-09-20 05:10:25 EDT
Please, try booting with this added to kernel command line:

  debug systemd.log_target=kmsg log_buf_len=8M inst.updates=http://people.redhat.com/prajnoha/bz1475570/updates.img

This update image contains extra systemd unit to monitor udev events (systemd-udev-monitor.service) and "getlvmdebug" script to gather lvm debug information.

When the problem occurs, switch to anaconda's command line and, please, run:

  getlvmdebug

Then copy the /run/lvmdebug.tar.xz file that the script has created and attach it here for us to inspect more. Thanks.
Comment 18 Kamil Páral 2017-09-20 07:42 EDT
Created attachment 1328401 [details]
lvmdebug.tar.xz

I trivially reproduced the issue in a VM using https://kojipkgs.fedoraproject.org/compose/branched/Fedora-27-20170919.n.0/compose/Server/x86_64/iso/Fedora-Server-dvd-x86_64-27-20170919.n.0.iso . I installed a default encrypted system over the whole disk, booted to make sure it works, and then rebooted into rescue mode and tried to unlock it.

Here's the file requested.
Comment 19 Peter Rajnoha 2017-09-20 11:30:40 EDT
OK, this looks like an issue with systemd where the lvm2-pvscan@.service is not instantiated as expected due to an issue with SYSTEMD_WANTS and SYSTEMD_ALIAS directives in udev 69-dm-lvm-metad.rules in a situation where the underlying PV device is deactivated and then reactivated. This technique is supposed to execute lvm2-pvscan@<major>:<minor>.service to execute the LVM autoactivation.

(Here, anaconda is deactivating the crypted device and then reactivating it again.)

Michal Sekletar from systemd team is currently investigating the issue further.
Comment 20 Michal Sekletar 2017-09-20 19:17:39 EDT
I debugged the issue (thanks to Peter for minimal reproducer) and eventually I've figured out that issue is already fixed upstream [1]. A bit embarrassing is that it was actually me who caused the bug in the first place (while fixing another LVM related bug in the process [2]).

I went ahead and commit fix to dist-git,
https://src.fedoraproject.org/rpms/systemd/c/1d5166b9dfc98c9598c20c4871358da42059e53c?branch=f27

Build with backport is here (ARM builds are still in progress but rest is ready to be tested).
https://koji.fedoraproject.org/koji/taskinfo?taskID=21990970

[1] https://github.com/systemd/systemd/commit/cc0df6cc35339976c367977dc292278a1939db0c
[2] https://github.com/systemd/systemd/commit/05e33aa1d5c074df863d9de4bf236dd676347a4c
Comment 21 Kamil Páral 2017-09-21 03:27:31 EDT
I tested the fix and it works! Great job. Now we need an update submitted to Bodhi ASAP. Thanks.
Comment 22 Fedora Update System 2017-09-21 05:28:11 EDT
systemd-234-7.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2017-1ec480964f
Comment 23 Kamil Páral 2017-09-25 06:26:41 EDT
Works with Beta RC2.
Comment 24 Fedora Update System 2017-09-25 14:07:48 EDT
systemd-234-7.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.