Description of problem: Since applying the latest clevis update, unattended reboots hang (at least some of the time) at unlocking the root device, until the user hits Enter. Version-Release number of selected component (if applicable): clevis-14-1.fc32.x86_64 How reproducible: About 30% of the time, perhaps? When applying kernel-5.8.7-200.fc32.x86_64 in combination with clevis-14-1.fc32.x86_64, four of my ~12 or so desktops failed in this fashion. (The others booted normally.) I don't know at this point whether the issue is specific to these machines (seems unlikely; all machines are similar although not identical Dell desktops) or just occurs randomly (e.g. race condition). Steps to Reproduce: 1. Set up clevis/dracut to unlock the encrypted root device on boot: dnf install clevis clevis-luks clevis-dracut cfg='{"t":1,"pins":{"tang":[{"url":"http://a.b.c.d"},{"url":"http://e.f.g.h"}]}}' clevis luks bind -d /dev/md127 sss $cfg dracut -f 2. Reboot the machine. Actual results: 1. "Please enter passphrase for disk luks-xxx" prompt pops up. 2. Network is brought up (I boot without rhgb quiet so see the module output here). 3. Following prompts: Starting Forward Password Requests to Clevis... [ OK ] Finished Forward Password Requests to Clevis. 4. Another "Please enter passphrase for disk luks-xxx" prompt, which hangs (I left this for over 8 hours). If I hit Enter (empty passphrase) at this point boot proceeds normally, so the device is clearly being unlocked via clevis. But obviously this isn't ideal as it breaks an unattended reboot. Expected results: The "Please enter passphrase for disk luks-xxx" prompt pops up, then the disk is unlocked via clevis, then boot proceeds normally without user intervention. This happened with the previous clevis release, i.e. before applying the latest clevis-14-1.fc32.x86_64 package.
(In reply to Ben Webb from comment #0) > Description of problem: > Since applying the latest clevis update, unattended reboots hang (at least > some of the time) at unlocking the root device, until the user hits Enter. > > Version-Release number of selected component (if applicable): > clevis-14-1.fc32.x86_64 Would you be able to check if clevis-14-4.fc32 changes anything? It's currently in [testing]: https://koji.fedoraproject.org/koji/buildinfo?buildID=1607396 > > > How reproducible: > About 30% of the time, perhaps? When applying kernel-5.8.7-200.fc32.x86_64 > in combination with clevis-14-1.fc32.x86_64, four of my ~12 or so desktops > failed in this fashion. (The others booted normally.) I don't know at this > point whether the issue is specific to these machines (seems unlikely; all > machines are similar although not identical Dell desktops) or just occurs > randomly (e.g. race condition). > > Steps to Reproduce: > 1. Set up clevis/dracut to unlock the encrypted root device on boot: > dnf install clevis clevis-luks clevis-dracut > cfg='{"t":1,"pins":{"tang":[{"url":"http://a.b.c.d"},{"url":"http://e.f.g. > h"}]}}' > clevis luks bind -d /dev/md127 sss $cfg > dracut -f > Are you unlocking a single device? In this case the root device? > 2. Reboot the machine. > > Actual results: > 1. "Please enter passphrase for disk luks-xxx" prompt pops up. > 2. Network is brought up (I boot without rhgb quiet so see the module output > here). > 3. Following prompts: > Starting Forward Password Requests to Clevis... > [ OK ] Finished Forward Password Requests to Clevis. Could you provide the log from journalctl -u clevis-luks-askpass?
Created attachment 1714849 [details] log from journalctl -u clevis-luks-askpass
(In reply to Sergio Correia from comment #1) > Would you be able to check if clevis-14-4.fc32 changes anything? Sure, I can try that with the next kernel update... but it won't be until next week as we're not on site that often due to covid/work from home. > Are you unlocking a single device? In this case the root device? Yes to both. > Could you provide the log from journalctl -u clevis-luks-askpass? Sure, I added that as an attachment. I rebooted the machine at 21:31 and it got stuck until I came in the next day at 10:44 to hit Enter.
(In reply to Ben Webb from comment #3) > (In reply to Sergio Correia from comment #1) > > Would you be able to check if clevis-14-4.fc32 changes anything? > > Sure, I can try that with the next kernel update... but it won't be until > next week as we're not on site that often due to covid/work from home. > > > Are you unlocking a single device? In this case the root device? > > Yes to both. > > > Could you provide the log from journalctl -u clevis-luks-askpass? > > Sure, I added that as an attachment. I rebooted the machine at 21:31 and it > got stuck until I came in the next day at 10:44 to hit Enter. Thanks. I believe I reproduced the issue here and I am testing a tentative fix. If you could test this scratch build and report back the results I would appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430 In case you try this package, please rebuild your initramfs with dracut -f before rebooting.
(In reply to Sergio Correia from comment #4) > Thanks. I believe I reproduced the issue here and I am testing a tentative > fix. > > If you could test this scratch build and report back the results I would > appreciate Will do!
It seems to fix the issue, but I can see the Passphrase prompt remains there forever (until GDM starts).
(In reply to Renaud Métrich from comment #6) > It seems to fix the issue, Thanks for testing! > but I can see the Passphrase prompt remains there > forever (until GDM starts). Do you mean the plymouth prompt? If so, there is this BZ to track this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1672369 Reported upstream here as well: https://gitlab.freedesktop.org/plymouth/plymouth/-/issues/126
Plymouth prompt but also "Please enter passphrase for unlocking ..." in text mode. I need to dig more if it's related, since after switching to "network-legacy" dracut module (due to a NetworkManager issue), the prompt was gone on next reboot.
Sorry, also happens with "network-legacy", so may be related to the fix.
(In reply to Sergio Correia from comment #4) > If you could test this scratch build and report back the results I would > appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430 Unfortunately that build doesn't work either - same issue, some of my machines get stuck during boot but will continue to boot normally if someone physically hits "Enter". Applied via: $ rpm -q clevis clevis-14-4.fc32.bz1878892.x86_64 $ rpm -q kernel kernel-5.8.7-200.fc32.x86_64 kernel-5.8.8-200.fc32.x86_64 kernel-5.8.9-200.fc32.x86_64 $ sudo dracut -f 5.8.9-200.fc32.x86_64 $ sudo reboot
(In reply to Ben Webb from comment #10) > (In reply to Sergio Correia from comment #4) > > If you could test this scratch build and report back the results I would > > appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430 > > Unfortunately that build doesn't work either - same issue, some of my > machines get stuck during boot but will continue to boot normally if someone > physically hits "Enter". Applied via: > > $ rpm -q clevis > clevis-14-4.fc32.bz1878892.x86_64 Did you update all clevis packages? $ rpm -qa | grep clevis
(In reply to Sergio Correia from comment #11) > Did you update all clevis packages? > $ rpm -qa | grep clevis Yes. $ rpm -qa|grep clevis clevis-systemd-14-4.fc32.bz1878892.x86_64 clevis-dracut-14-4.fc32.bz1878892.x86_64 clevis-14-4.fc32.bz1878892.x86_64 clevis-luks-14-4.fc32.bz1878892.x86_64
(In reply to Ben Webb from comment #12) > (In reply to Sergio Correia from comment #11) > > Did you update all clevis packages? > > $ rpm -qa | grep clevis > > Yes. > > $ rpm -qa|grep clevis > clevis-systemd-14-4.fc32.bz1878892.x86_64 > clevis-dracut-14-4.fc32.bz1878892.x86_64 > clevis-14-4.fc32.bz1878892.x86_64 > clevis-luks-14-4.fc32.bz1878892.x86_64 Alright. Could you please share the output of "journalctl -b0 | grep 'Password Requests'" from one of the machines that presented this issue?
Created attachment 1715605 [details] Output of "journalctl -b0 | grep 'Password Requests'"
This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.
FWIW, the issue (or a very similar one) is still present in F33 and F34 (clevis-18-1.fc34.x86_64) - about 20-30% of my machines do not boot. Everything works very reliably though if I pin clevis to 13-1 (although BTW recently I've had to pin jose to 10-9 too; older clevis does not work with jose-11-1.fc34.x86_64, failing at boot with "JWE missing required 'clevis.tang.adv' header parameter!" suggesting that newer jose is missing a Conflicts: or similar with older clevis). Happy to provide more information if you need it to help track down this issue - although I am rarely on site.
Thanks for reopening this, Ben. Unfortunately I have been unable to reproduce it, but I have a suspicion on what could have caused the issue, so I will prepare a scratch build and it would be much appreciated if you could test it out. Good point about jose and "Conflicts", also.
@Ben: It would be great if you could test this build and report back whether it helps with the issue at hand: https://koji.fedoraproject.org/koji/taskinfo?taskID=68750126 -- please rebuild the initramfs after installing the updated packages.
Fortuitous timing! I was on site today. Was able to apply this scratch build to all but one of my machines (the one holdout is still on F33), rebuild the initrd, and every machine now boots normally. So I think you got it. Thanks! For reference here's the package load on one machine: $ rpm -qa|grep 'clevis\|jose' clevis-pin-tpm2-0.3.0-1.fc34.x86_64 libjose-11-1.fc34.x86_64 jose-11-1.fc34.x86_64 clevis-18-1.fc34.bz1878892.01.x86_64 clevis-luks-18-1.fc34.bz1878892.01.x86_64 clevis-systemd-18-1.fc34.bz1878892.01.x86_64 clevis-dracut-18-1.fc34.bz1878892.01.x86_64 $ uname -r 5.12.6-300.fc34.x86_64
Good news, thanks for testing! For reference, in this scratch build I reverted upstream commit cbb64c4 ("dracut: favour systemd units over dracut hooks")[1]. @Jonathan, would you have any insights here? [1] https://github.com/latchset/clevis/commit/cbb64c4efb66b57a07fe1afbee10b012523ab71c
Just to make sure I understand correctly, is the latest issue you're hitting still that the systemd clevis service *does* unlock the device, but for whatever reason, you still have to press Enter before it goes forward? Can you provide the full journal logs of the initrd when that happens? Can you also provide the versions of dracut and systemd? Can you verify that your initrd contains all of: `cryptsetup.target`, `cryptsetup-pre.target`, and `remote-cryptsetup.target`? (You can do e.g. `lsinitrd | grep cryptsetup`).
(In reply to Jonathan Lebon from comment #22) > Just to make sure I understand correctly, is the latest issue you're hitting > still that the systemd clevis service *does* unlock the device, but for > whatever reason, you still have to press Enter before it goes forward? That was certainly the issue with F32, which was never resolved. I can confirm that the system still doesn't boot with F34, but can't be 100% sure it was the same issue. I can check next time I'm on site and get the logs/versions you're asking for (the machines all boot right now because they are using the scratch build Sergio provided; I will need to 'break' one first). > Can you provide the full journal logs of the initrd when that happens? journalctl --system ? Or something else?
(In reply to Ben Webb from comment #23) > (In reply to Jonathan Lebon from comment #22) > > Just to make sure I understand correctly, is the latest issue you're hitting > > still that the systemd clevis service *does* unlock the device, but for > > whatever reason, you still have to press Enter before it goes forward? > > That was certainly the issue with F32, which was never resolved. I can > confirm that the system still doesn't boot with F34, but can't be 100% sure > it was the same issue. I can check next time I'm on site and get the > logs/versions you're asking for (the machines all boot right now because > they are using the scratch build Sergio provided; I will need to 'break' one > first). Gotcha. Yeah, would be good to make sure it's the same failure or something else happening here. > > Can you provide the full journal logs of the initrd when that happens? > > journalctl --system ? Or something else? Yes, something like `journalctl --system -b 0`. If you're in the initramfs emergency prompt, it might be hard for you to get the logs off unless you can connect to the machine from another machine via serial, or since you should have networking in the initramfs (since you're using Tang pinning), you might be able to POST it someplace if those machines have public access (see example curl invocations in https://paste.centos.org/api).
(In reply to Jonathan Lebon from comment #24) > (see example curl invocations in https://paste.centos.org/api). Ahh too bad, that actually requires an API key and at least this CentOS instance doesn't let you request your own key AFAICT. There's probably other pastebins out there which do if you really need to do this.
Created attachment 1789276 [details] Full initrd journal
(In reply to Jonathan Lebon from comment #22) > Can you provide the full journal logs of the initrd when that happens? See attachment https://bugzilla.redhat.com/attachment.cgi?id=1789276 > Can you also provide the versions of dracut and systemd? Can you verify that > your initrd contains all of: `cryptsetup.target`, `cryptsetup-pre.target`, > and `remote-cryptsetup.target`? (You can do e.g. `lsinitrd | grep > cryptsetup`). This is what I ran on one of my affected systems: # dnf distro-sync --refresh # rpm -q dracut dracut-054-12.git20210521.fc34.x86_64 # rpm -q clevis clevis-18-1.fc34.x86_64 # rpm -q systemd systemd-248.3-1.fc34.x86_64 # uname -r 5.12.8-300.fc34.x86_64 # dracut -f # lsinitrd /boot/initramfs-5.12.8-300.fc34.x86_64.img | grep cryptsetup drwxr-xr-x 2 root root 0 May 21 05:25 etc/systemd/system/cryptsetup.target.wants lrwxrwxrwx 1 root root 48 May 21 05:25 etc/systemd/system/cryptsetup.target.wants/clevis-luks-askpass.path -> /usr/lib/systemd/system/clevis-luks-askpass.path -rwxr-xr-x 1 root root 491256 May 21 05:25 usr/lib64/libcryptsetup.so.12.6.0 lrwxrwxrwx 1 root root 35 May 21 05:25 usr/lib64/libcryptsetup.so.12 -> ../../lib64/libcryptsetup.so.12.6.0 -rw-r--r-- 1 root root 473 May 15 09:33 usr/lib/systemd/system/cryptsetup-pre.target -rw-r--r-- 1 root root 420 May 15 09:33 usr/lib/systemd/system/cryptsetup.target -rwxr-xr-x 1 root root 70168 May 15 10:14 usr/lib/systemd/systemd-cryptsetup -rwxr-xr-x 1 root root 41280 May 15 10:14 usr/lib/systemd/system-generators/systemd-cryptsetup-generator lrwxrwxrwx 1 root root 27 May 21 05:25 usr/lib/systemd/system/initrd-root-device.target.wants/remote-cryptsetup.target -> ../remote-cryptsetup.target -rw-r--r-- 1 root root 557 May 15 09:33 usr/lib/systemd/system/remote-cryptsetup.target lrwxrwxrwx 1 root root 20 May 21 05:25 usr/lib/systemd/system/sysinit.target.wants/cryptsetup.target -> ../cryptsetup.target -rw-r--r-- 1 root root 35 May 21 05:25 usr/lib/tmpfiles.d/cryptsetup.conf -rwxr-xr-x 1 root root 142104 May 21 05:25 usr/sbin/cryptsetup # reboot The boot hangs at the prompt for the password for the luks volume. On this occasion, hitting Enter doesn't do anything (I just get another identical prompt). Eventually it times out, and after entering the root password I get the emergency shell. I ran journalctl --system -b 0 in that shell and uploaded the result (with mild redaction of hostname/IP) here. I can't claim to know what's going on here, but it certainly looks to me as if NetworkManager is being called too early (or with too short a timeout) and it gives up trying to configure the network before the NIC gets a link. I've previously hacked around that (see https://bugzilla.redhat.com/show_bug.cgi?id=1702524#c23) but that hack only worked with older clevis (perhaps because it only "fixed" the dracut hook, not the systemd service?)
Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that networking is up before systemd tries to unlock the LUKS device. Can you try adding that to the kernel command-line?
(In reply to Jonathan Lebon from comment #28) > Hmm, I think you're missing `rd.luks.options=_netdev`. I believe that I tried adding `_netdev` to the root filesystem in `/etc/fstab` in the past with no effect - is this doing something different? (You no doubt saw that I also have `rd.neednet=1`.) But I'll try this next time I'm on site and report back, thanks!
(In reply to Ben Webb from comment #29) > (In reply to Jonathan Lebon from comment #28) > > Hmm, I think you're missing `rd.luks.options=_netdev`. > > I believe that I tried adding `_netdev` to the root filesystem in > `/etc/fstab` in the past with no effect - is this doing something different? Yes, they're related but different. You'll want to put it on the block device itself (it can be added as an option in /etc/crypttab, but you can also use rd.luks.options). Otherwise, systemd won't know that unlocking requires networking, even if the subsequent filesystem mount itself is marked as such.
(In reply to Jonathan Lebon from comment #28) > Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that > networking is up before systemd tries to unlock the LUKS device. Can you try > adding that to the kernel command-line? Unfortunately while that made a difference, it made things worse, not better! With that option, I don't see any output related to clevis unlocking, and no LUKS password prompt; it just gets stuck after bringing up the network (which incidentally has never been an issue - I can always ping these machines when they get stuck). I left the machine for an hour or so and it didn't fail over to the emergency shell, so no journalctl output, sorry (although I do have a photo of the screen if that helps!) FWIW, the last output I see on boot is NetworkManager starting up (looks OK, ends up in CONNECTED_GLOBAL state) followed by "Finished nm-wait-online-initrd.service" and "Starting dracut initqueue hook..." This is with the same clevis/dracut/systemd as comment #27, although now with kernel 5.12.9-300.fc34.x86_64.
(In reply to Ben Webb from comment #31) > (In reply to Jonathan Lebon from comment #28) > > Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that > > networking is up before systemd tries to unlock the LUKS device. Can you try > > adding that to the kernel command-line? > > Unfortunately while that made a difference, it made things worse, not > better! With that option, I don't see any output related to clevis > unlocking, and no LUKS password prompt; it just gets stuck after bringing up > the network (which incidentally has never been an issue - I can always ping > these machines when they get stuck). I left the machine for an hour or so > and it didn't fail over to the emergency shell, so no journalctl output, > sorry (although I do have a photo of the screen if that helps!) Yeah, picture would help if you still have it! > FWIW, the last output I see on boot is NetworkManager starting up (looks OK, > ends up in CONNECTED_GLOBAL state) followed by "Finished > nm-wait-online-initrd.service" and "Starting dracut initqueue hook..." Hmm, can you try booting with rd.debug so we can try to have a peak into what dracut is waiting on?
Created attachment 1796022 [details] Image at the point where the boot gets stuck
Created attachment 1796023 [details] Image at the point where the boot gets stuck, with rd.debug
(In reply to Jonathan Lebon from comment #32) > Yeah, picture would help if you still have it! See attachment 1796022 [details]. > Hmm, can you try booting with rd.debug so we can try to have a peak into > what dracut is waiting on? I added `rd.shell rd.debug log_buf_len=1M rd.luks.options=_netdev` to the command line and it still refused to give me an emergency shell (although I only left it for ~10 minutes this time) so you get a low-tech photo again. See attachment 1796023 [details]. This is with dracut-055-2.fc34.x86_64, clevis-18-1.fc34.x86_64, systemd-248.3-1.fc34.x86_64, kernel 5.12.11-300.fc34.x86_64.
Hmm OK yes, I think I see what's going on. I think the 90crypt dracut module is conflicting with systemd-cryptsetup-generator. As a test, can you try using rd_NO_LUKS? 90crypt respects this but not systemd-cryptsetup-generator. Are you using `rd.luks.uuid=<UUID>` by change? As a second independent test, can you instead try using `rd.luks.name=$UUID=$NAME`? 90crypt doesn't handle that karg, but systemd-cryptsetup-generator does.
Created attachment 1797373 [details] initrd journal with rd_NO_LUKS set
Created attachment 1797374 [details] initrd journal using rd.luks.name
(In reply to Jonathan Lebon from comment #36) > As a test, can you try using rd_NO_LUKS? 90crypt respects this but not > systemd-cryptsetup-generator. With that it looks like dracut can't find the root device. Eventually it gives up and gives me an emergency shell. Contents of `journalctl --system -b 0` as attachment 1797373 [details]. > Are you using `rd.luks.uuid=<UUID>` by change? As a second independent test, > can you instead try using `rd.luks.name=$UUID=$NAME`? 90crypt doesn't handle > that karg, but systemd-cryptsetup-generator does. I am, so I tried rd.luks.name as you suggest. This also results in not being able to find the root device, attachment 1797374 [details].
Can you describe your setup? If I understand correctly from your logs, it seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right? I think the problem now is that we're hitting a dependency cycle very similar to https://github.com/dracutdevs/dracut/pull/931: - systemd-cryptsetup is waiting for remote-fs-pre.target - remote-fs-pre.target is waiting for dracut-initqueue.service - dracut-initqueue.service is waiting for /dev/vg0/root to appear - /dev/vg0/root is on LUKS, so won't appear until systemd-cryptsetup runs As another test, on top of rd_NO_LUKS, can you also add rd.auto and remove rd.lvm.lv=vg0/root and rd.lvm.lv=vg0/swap?
Created attachment 1799438 [details] Image at the point where the boot gets stuck, with rd.auto
(In reply to Jonathan Lebon from comment #40) > Can you describe your setup? If I understand correctly from your logs, it > seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right? Yes. Two physical disks each with three partitions. /boot and /boot/efi are each one partition on each disk mirrored with mdraid; the remaining partition is also a RAID 1, with as you said, LVM on LUKS on it. The LUKS device is unlocked with clevis talking to one of a pair of tang servers. > As another test, on top of rd_NO_LUKS, can you also add rd.auto and remove > rd.lvm.lv=vg0/root and rd.lvm.lv=vg0/swap? This also gets stuck at "Remote Encrypted Volumes" but it does look like it's at least trying to unlock the root device. I didn't get a shell after 15 minutes or so, so attachment #1799438 [details] has a photo.
Created attachment 1801000 [details] Kickstart for basic F34 Server environment to reproduce
(In reply to Jonathan Lebon from comment #40) > Can you describe your setup? If I understand correctly from your logs, it > seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right? I was able to reproduce something pretty similar to my setup in a VM, so I can easily attach a serial console or reboot it without bothering users of my actual hardware (or perhaps you can reproduce at your end). See attachment 1801000 [details]. This is a kickstart for Fedora 34 Server, which I installed via VirtualBox. No RAID in this setup, but I added an encrypted volume group via the installer's custom partitioning. With updates applied this boots with the 5.12.14 kernel, dracut 055-3, clevis-18, and prompts for the encryption password as you would expect. Next I set up clevis/tang to talk to my two tang servers (at MYIP1 and MYIP2) with dnf install clevis clevis-luks clevis-dracut cfg='{"t":1,"pins":{"tang":[{"url":"http://MYIP1"},{"url":"http://MYIP2"}]}}' clevis luks bind -d /dev/sda2 sss $cfg dracut -f Then added `ip=dhcp rd.neednet=1` to the kernel command line and rebooted. This does not work, same as on my physical F34 desktops. Looks like the network doesn't come up, and I see no output from clevis. (By "does not work", I mean I can still type in the encryption password and boot normally, but clevis doesn't unlock the device automatically.) It *does* work if I use the clevis scratch build provided in comment #19 (clevis bz1878892.01) and downgrade dracut to the original F34 version (`dnf distro-sync --disablerepo=updates dracut\*` yielding dracut 053-4). I also tried previous F34 updates dracut versions (054-6, 054-12, and 055-2) and these all fall into the "does not work" category.
Can you try with the latest packages of everything and this dracut build: https://jlebon.fedorapeople.org/dracut-rhbz1878892/ ? Make sure also to use _netdev in your /etc/crypttab and regenerate the initramfs. I think I've narrowed down the root issue here, and I've got a somewhat hacky approach which seems to fix it for me locally, though would be good to confirm with you before submitting it upstream.
Going to write things here before I forget. So the main issue here is indeed essentially what I've described in https://bugzilla.redhat.com/show_bug.cgi?id=1878892#c40 (except it's not just LVM -- the same applies for 90crypt, RAID, and the rootfs generator, the latter of which was hacked around with https://github.com/dracutdevs/dracut/pull/931). Essentially, the problem is basically that we can't unlock until we exit the initqueue (dracut-initqueue.service has Before=remote-fs-re.target), but we can't exit the initqueue until we unlock (because a bunch of modules add dracut hooks to ensure devices they expect to exist exist before exiting). What confounds things further (and took me a while to separate out) is the recent change in dracut to have NetworkManager run as a proper systemd unit instead of as a one-shot service. That change essentially now requires people who use Tang to use _netdev (which they should anyway, but it did work without it before) because otherwise there's a deadlock between NM waiting for D-Bus and D-Bus waiting for sysinit.target, which is after cryptsetup.target. The patch I've added in https://jlebon.fedorapeople.org/dracut-rhbz1878892/ is simply: diff --git a/modules.d/98dracut-systemd/dracut-initqueue.service b/modules.d/98dracut-systemd/dracut-initqueue.service index 3a8679a5..e8154ced 100644 --- a/modules.d/98dracut-systemd/dracut-initqueue.service +++ b/modules.d/98dracut-systemd/dracut-initqueue.service @@ -6,8 +6,8 @@ Description=dracut initqueue hook Documentation=man:dracut-initqueue.service(8) DefaultDependencies=no -Before=remote-fs-pre.target -Wants=remote-fs-pre.target +Before=remote-fs.target +Wants=remote-fs.target After=systemd-udev-trigger.service Wants=systemd-udev-trigger.service ConditionPathExists=/usr/lib/initrd-release What this does is it allows systemd-cryptsetup to run in parallel to the initqueue, because the service has `After=remote-fs-pre`. So then systemd as usual asks for the password, and clevis-luks-askpass.service as expected kicks in, unlocks the device, and all the initqueue/finished hooks are happy. But that patch undermines this really old dracut commit: https://github.com/dracutdevs/dracut/commit/b31250e7e6e6e104674dc304ba74965bb56074d6. So we need to make sure we understand it and we're not introducing another regression there. We should make sure to get input from dracut and systemd devs for a clean way to fix this. Ideally IMO long-term I think all of these dracut modules (crypt, lvm, raid) should, instead of outputting dracut hooks, if systemd is present ship systemd generators which create systemd units that can sanely be ordered with respect to e.g. udev, networking, remote-fs.target, cryptsetup.target, etc...
(In reply to Jonathan Lebon from comment #45) > Can you try with the latest packages of everything and this dracut build: > https://jlebon.fedorapeople.org/dracut-rhbz1878892/ ? Make sure also to use > _netdev in your /etc/crypttab and regenerate the initramfs. Unfortunately this doesn't work for me in the VM I set up (myrepo is a local repository with dracut 53 and the clevis scratch build from comment #19): # dnf distro-sync --refresh --disablerepo=myrepo # rpm -Uvh dracut*rhbz1878892*.rpm # cat /etc/crypttab luks-12f7e3ad-3039-4de8-bedb-ac565aebca06 UUID=12f7e3ad-3039-4de8-bedb-ac565aebca06 none discard,_netdev # dracut /boot/initramfs-5.13.4-200.fc34.x86_64.test.img 5.13.4-200.fc34.x86_64 # rpm -q dracut clevis dracut-055-4.rhbz1878892.fc34.x86_64 clevis-18-1.fc34.x86_64 Then rebooted, hit E at the GRUB screen and changed the initrd from initramfs-5.13.4-200.fc34.x86_64.img (generated with dracut 53, which works) to initramfs-5.13.4-200.fc34.x86_64.test.img Here's the tail of the serial console after waiting a few minutes: [ 2.897760] RPC: Registered tcp NFSv4.1 backchannel transport module. [^[[0;32m OK ^[[0m] Finished ^[[0;1;39mdracut pre-udev hook^[[0m. [ 2.977734] audit: type=1130 audit(1627436056.550:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-pre-udev comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Starting ^[[0;1;39mRule-based Manage…for Device Events and Files^[[0m... [^[[0;32m OK ^[[0m] Started ^[[0;1;39mRule-based Manager for Device Events and Files^[[0m. Starting ^[[0;1;39mdracut pre-trigger hook^[[0m... [^[[0;32m OK ^[[0m] Finished ^[[0;1;39mdracut pre-trigger hook^[[0m. Starting ^[[0;1;39mColdplug All udev Devices^[[0m... [^[[0;32m OK ^[[0m] Finished ^[[0;1;39mColdplug All udev Devices^[[0m. Starting ^[[0;1;39mnm-initrd.service^[[0m... Starting ^[[0;1;39mShow Plymouth Boot Screen^[[0m... ^[[2J^[[3J^[[-1;-1f Starting ^[[0;1;39mWait for udev To …plete Device Initialization^[[0m...^M [^[[0;32m OK ^[[0m] Started ^[[0;1;39mShow Plymouth Boot Screen^[[0m.^M [^[[0;32m OK ^[[0m] Started ^[[0;1;39mForward Password R…s to Plymouth Directory Watch^[[0m.^M [^[[0;32m OK ^[[0m] Reached target ^[[0;1;39mPath Units^[[0m.^M <info> [1627436056.8312] NetworkManager (version 1.30.6-1.fc34) is starting... (for the first time)^M <info> [1627436056.8344] Read config: /etc/NetworkManager/NetworkManager.conf (lib: initrd-no-auto-default.conf)^M [ 3.299118] ACPI: video: Video Device [GFX0] (multi-head: yes rom: no post: no) [ 3.330326] ACPI: \_SB_.LNKD: Enabled at IRQ 11 [ 3.380469] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/LNXVIDEO:00/input/input6 [ 3.407886] e1000: Intel(R) PRO/1000 Network Driver [ 3.408742] e1000: Copyright (c) 1999-2006 Intel Corporation. [ 3.435047] vboxguest: host-version: 6.1.18r142142 0x8000000f [ 3.452679] vbg_heartbeat_init: Setting up heartbeat to trigger every 2000 milliseconds [ 3.514601] ACPI: \_SB_.LNKC: Enabled at IRQ 9 [ 3.544952] input: VirtualBox mouse integration as /devices/pci0000:00/0000:00:04.0/input/input7 [ 3.606281] vboxguest: misc device minor 124, IRQ 11, I/O port d020, MMIO at 0x00000000f0400000 (size 0x0000000000400000) [^[[0;32m OK ^[[0m] Found device ^[[0;1;39mVBOX_HARDDISK 2^[[0m.^M Starting ^[[0;1;39mCryptography Setu…3039-4de8-bedb-ac565aebca06^[[0m...^M Please enter passphrase for disk VBOX_HARDDISK (luks-12f7e3ad-3039-4de8-bedb-ac565aebca06)::[ 3.889077] vboxvideo 0000:00:02.0: vgaarb: deactivate vga console [ 3.930922] Console: switching to colour dummy device 80x25 [ 3.993548] [drm] VRAM 00c00000 [ 4.052226] [drm] Initialized vboxvideo 1.0.0 20130823 for 0000:00:02.0 on minor 0 [ 4.114858] fbcon: vboxvideodrmfb (fb0) is primary device [ 4.135906] Console: switching to colour frame buffer device 112x39 [ 4.189578] vboxvideo 0000:00:02.0: [drm] fb0: vboxvideodrmfb frame buffer device [ 4.594098] e1000 0000:00:03.0 eth0: (PCI:33MHz:32-bit) 08:00:27:e4:cd:b7 [ 4.599280] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection [ 4.608944] e1000 0000:00:03.0 enp0s3: renamed from eth0 [ 183.676390] kauditd_printk_skb: 11 callbacks suppressed
This is impacting me as well. I wanted to automatically unlock my F34 laptop, but a basic LUKS+Clevis/Tang is failing to automatically unlock the device.
(In reply to John Call from comment #48) > This is impacting me as well. I wanted to automatically unlock my F34 > laptop, but a basic LUKS+Clevis/Tang is failing to automatically unlock the > device. Is it this same issue? Could you elaborate on your setup, please?
(In reply to Sergio Correia from comment #49) > Is it this same issue? Could you elaborate on your setup, please? Yes I believe it is the same issue of systemd unit ordering/dependencies. I tried using my procedure which has worked for RHEL7 and RHEL8 on Fedora34, and it didn't work. I noticed on RHEL8 that I can ping the server during the "Please enter passphrase..." prompt, but not on Fedora 34. It seems to me that the networking details given on the kernel_cmdline are either being ignored, or not applied early enough. Here is my basic steps for binding the root disk to tang... sudo dnf -y install clevis-dracut echo 'kernel_cmdline="ip=dhcp"' | sudo tee /etc/dracut.conf.d/clevis.conf sudo dracut -f sudo reboot I'm attempting a very basic deployment, that looks like this: [jcall@fedora ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom zram0 251:0 0 1.9G 0 disk [SWAP] vda 252:0 0 20G 0 disk ├─vda1 252:1 0 1G 0 part /boot └─vda2 252:2 0 19G 0 part └─luks-3cef56d0-1221-4eb3-87e9-f7aa811e2036 253:0 0 19G 0 crypt /home [jcall@fedora ~]$ findmnt / TARGET SOURCE FSTYPE OPTIONS / /dev/mapper/luks-3cef56d0-1221-4eb3-87e9-f7aa811e2036[/root] btrfs rw,relatime,seclabel,compress=zstd:1,space_cache,subvolid=258,subvol=/root Wow, off topic, BTRFS looks weird to my XFS-trained eyes...
@Ben Webb: I finally managed to reproduce the issue and we may have a potential fix. Would you be able to test it and report back, please? Here's a scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=77924657
FEDORA-2021-f5be677d41 has been submitted as an update to Fedora 36. https://bodhi.fedoraproject.org/updates/FEDORA-2021-f5be677d41
FEDORA-2021-f5be677d41 has been pushed to the Fedora 36 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2021-320c541101 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-320c541101
FEDORA-2021-d6afa567fe has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-d6afa567fe
FEDORA-2021-6faffb265a has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-6faffb265a
FEDORA-2021-320c541101 has been pushed to the Fedora 35 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-320c541101` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-320c541101 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2021-d6afa567fe has been pushed to the Fedora 34 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-d6afa567fe` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-d6afa567fe See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2021-6faffb265a has been pushed to the Fedora 33 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-6faffb265a` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-6faffb265a See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
(In reply to Sergio Correia from comment #51) > @Ben Webb: I finally managed to reproduce the issue and we may have a > potential fix. Would you be able to test it and report back, please? Here's > a scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=77924657 It works in my test VM - great! I'll test it on my real hardware next time I'm on site.
(In reply to Ben Webb from comment #60) > (In reply to Sergio Correia from comment #51) > > @Ben Webb: I finally managed to reproduce the issue and we may have a > > potential fix. Would you be able to test it and report back, please? Here's > > a scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=77924657 > > It works in my test VM - great! I'll test it on my real hardware next time > I'm on site. I tested it on about half of my F34 workstations and it worked fine on all of them, so looks good. Thanks!
Thanks for testing Ben, much appreciated. @John Call: could you retest with either the builds under testing? From your description, it looks more like an issue of dracut not bringing up the network.
(In reply to Sergio Correia from comment #62) > @John Call: could you retest with either the builds under testing? From your > description, it looks more like an issue of dracut not bringing up the > network. This is working for me now too, thank you! [jcall@jcall-laptop ~]$ uname -a Linux jcall-laptop 5.14.15-100.fc33.x86_64 #1 SMP Wed Oct 27 16:46:06 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux [jcall@jcall-laptop ~]$ sudo dnf list installed '*clevis*' Installed Packages clevis.x86_64 18-2.fc33 @updates-testing clevis-dracut.x86_64 18-2.fc33 @updates-testing clevis-luks.x86_64 18-2.fc33 @updates-testing clevis-pin-tpm2.x86_64 0.2.0-1.fc33 @updates clevis-systemd.x86_64 18-2.fc33 @updates-testing [jcall@jcall-laptop ~]$ cat /etc/dracut.conf.d/clevis.conf #This doesn't work, moving it to /etc/default/grub instead kernel_cmdline="rd.neednet=1 ip=enp0s31f6:dhcp" omit_dracutmodules+=" ifcfg " [jcall@jcall-laptop ~]$ grep CMDLINE /etc/default/grub GRUB_CMDLINE_LINUX="rd.lvm.lv=f33/root rd.luks.uuid=luks-e65ff550-8d23-473f-ba69-1c00d774fcfd rhgb quiet rd.neednet=1 ip=enp0s31f6:dhcp"
FEDORA-2021-320c541101 has been pushed to the Fedora 35 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2021-6faffb265a has been pushed to the Fedora 33 stable repository. If problem still persists, please make note of it in this bug report.
FEDORA-2021-d6afa567fe has been pushed to the Fedora 34 stable repository. If problem still persists, please make note of it in this bug report.