1878892 – Unattended reboots broken with latest clevis/dracut

Bug 1878892 - Unattended reboots broken with latest clevis/dracut

Summary: Unattended reboots broken with latest clevis/dracut

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	clevis
Sub Component:
Version:	34
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Sergio Correia
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-14 19:41 UTC by Ben Webb
Modified:	2021-11-07 01:11 UTC (History)
CC List:	8 users (show)
Fixed In Version:	clevis-18-5.fc36 clevis-18-4.fc35 clevis-18-2.fc33 clevis-18-2.fc34
Clone Of:
Environment:
Last Closed:	2021-10-29 15:32:03 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log from journalctl -u clevis-luks-askpass (1.63 KB, text/plain) 2020-09-14 20:08 UTC, Ben Webb	no flags	Details
Output of "journalctl -b0 \| grep 'Password Requests'" (1.97 KB, text/plain) 2020-09-21 22:11 UTC, Ben Webb	no flags	Details
Full initrd journal (369.65 KB, text/plain) 2021-06-07 19:18 UTC, Ben Webb	no flags	Details
Image at the point where the boot gets stuck (564.38 KB, image/jpeg) 2021-06-29 20:54 UTC, Ben Webb	no flags	Details
Image at the point where the boot gets stuck, with rd.debug (853.00 KB, image/jpeg) 2021-06-29 20:54 UTC, Ben Webb	no flags	Details
initrd journal with rd_NO_LUKS set (9.14 MB, text/plain) 2021-07-03 00:00 UTC, Ben Webb	no flags	Details
initrd journal using rd.luks.name (8.37 MB, text/plain) 2021-07-03 00:01 UTC, Ben Webb	no flags	Details
Image at the point where the boot gets stuck, with rd.auto (850.99 KB, image/jpeg) 2021-07-07 21:24 UTC, Ben Webb	no flags	Details
Kickstart for basic F34 Server environment to reproduce (1021 bytes, text/plain) 2021-07-12 23:25 UTC, Ben Webb	no flags	Details
View All

Description Ben Webb 2020-09-14 19:41:28 UTC

Description of problem:
Since applying the latest clevis update, unattended reboots hang (at least some of the time) at unlocking the root device, until the user hits Enter.

Version-Release number of selected component (if applicable):
clevis-14-1.fc32.x86_64


How reproducible:
About 30% of the time, perhaps? When applying kernel-5.8.7-200.fc32.x86_64 in combination with clevis-14-1.fc32.x86_64, four of my ~12 or so desktops failed in this fashion. (The others booted normally.) I don't know at this point whether the issue is specific to these machines (seems unlikely; all machines are similar although not identical Dell desktops) or just occurs randomly (e.g. race condition).

Steps to Reproduce:
1. Set up clevis/dracut to unlock the encrypted root device on boot:
dnf install clevis clevis-luks clevis-dracut
cfg='{"t":1,"pins":{"tang":[{"url":"http://a.b.c.d"},{"url":"http://e.f.g.h"}]}}'
clevis luks bind -d /dev/md127 sss $cfg
dracut -f

2. Reboot the machine.

Actual results:
1. "Please enter passphrase for disk luks-xxx" prompt pops up.
2. Network is brought up (I boot without rhgb quiet so see the module output here).
3. Following prompts:
       Starting Forward Password Requests to Clevis...
[ OK ] Finished Forward Password Requests to Clevis.
4. Another "Please enter passphrase for disk luks-xxx" prompt, which hangs (I left this for over 8 hours).

If I hit Enter (empty passphrase) at this point boot proceeds normally, so the device is clearly being unlocked via clevis. But obviously this isn't ideal as it breaks an unattended reboot.

Expected results:
The "Please enter passphrase for disk luks-xxx" prompt pops up, then the disk is unlocked via clevis, then boot proceeds normally without user intervention. This happened with the previous clevis release, i.e. before applying the latest clevis-14-1.fc32.x86_64 package.

Comment 1 Sergio Correia 2020-09-14 19:58:46 UTC

(In reply to Ben Webb from comment #0)
> Description of problem:
> Since applying the latest clevis update, unattended reboots hang (at least
> some of the time) at unlocking the root device, until the user hits Enter.
> 
> Version-Release number of selected component (if applicable):
> clevis-14-1.fc32.x86_64

Would you be able to check if clevis-14-4.fc32 changes anything? It's currently in [testing]: https://koji.fedoraproject.org/koji/buildinfo?buildID=1607396

> 
> 
> How reproducible:
> About 30% of the time, perhaps? When applying kernel-5.8.7-200.fc32.x86_64
> in combination with clevis-14-1.fc32.x86_64, four of my ~12 or so desktops
> failed in this fashion. (The others booted normally.) I don't know at this
> point whether the issue is specific to these machines (seems unlikely; all
> machines are similar although not identical Dell desktops) or just occurs
> randomly (e.g. race condition).
> 
> Steps to Reproduce:
> 1. Set up clevis/dracut to unlock the encrypted root device on boot:
> dnf install clevis clevis-luks clevis-dracut
> cfg='{"t":1,"pins":{"tang":[{"url":"http://a.b.c.d"},{"url":"http://e.f.g.
> h"}]}}'
> clevis luks bind -d /dev/md127 sss $cfg
> dracut -f
> 

Are you unlocking a single device? In this case the root device?


> 2. Reboot the machine.
> 
> Actual results:
> 1. "Please enter passphrase for disk luks-xxx" prompt pops up.
> 2. Network is brought up (I boot without rhgb quiet so see the module output
> here).
> 3. Following prompts:
>        Starting Forward Password Requests to Clevis...
> [ OK ] Finished Forward Password Requests to Clevis.

Could you provide the log from journalctl -u clevis-luks-askpass?

Comment 2 Ben Webb 2020-09-14 20:08:42 UTC

Created attachment 1714849 [details]
log from journalctl -u clevis-luks-askpass

Comment 3 Ben Webb 2020-09-14 20:12:37 UTC

(In reply to Sergio Correia from comment #1)
> Would you be able to check if clevis-14-4.fc32 changes anything?

Sure, I can try that with the next kernel update... but it won't be until next week as we're not on site that often due to covid/work from home.

> Are you unlocking a single device? In this case the root device?

Yes to both.

> Could you provide the log from journalctl -u clevis-luks-askpass?

Sure, I added that as an attachment. I rebooted the machine at 21:31 and it got stuck until I came in the next day at 10:44 to hit Enter.

Comment 4 Sergio Correia 2020-09-17 12:52:41 UTC

(In reply to Ben Webb from comment #3)
> (In reply to Sergio Correia from comment #1)
> > Would you be able to check if clevis-14-4.fc32 changes anything?
> 
> Sure, I can try that with the next kernel update... but it won't be until
> next week as we're not on site that often due to covid/work from home.
> 
> > Are you unlocking a single device? In this case the root device?
> 
> Yes to both.
> 
> > Could you provide the log from journalctl -u clevis-luks-askpass?
> 
> Sure, I added that as an attachment. I rebooted the machine at 21:31 and it
> got stuck until I came in the next day at 10:44 to hit Enter.


Thanks. I believe I reproduced the issue here and I am testing a tentative fix.

If you could test this scratch build and report back the results I would appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430
In case you try this package, please rebuild your initramfs with dracut -f before rebooting.

Comment 5 Ben Webb 2020-09-17 17:57:24 UTC

(In reply to Sergio Correia from comment #4)
> Thanks. I believe I reproduced the issue here and I am testing a tentative
> fix.
> 
> If you could test this scratch build and report back the results I would
> appreciate

Will do!

Comment 6 Renaud Métrich 2020-09-17 20:00:43 UTC

It seems to fix the issue, but I can see the Passphrase prompt remains there forever (until GDM starts).

Comment 7 Sergio Correia 2020-09-17 20:10:51 UTC

(In reply to Renaud Métrich from comment #6)
> It seems to fix the issue, 

Thanks for testing!

> but I can see the Passphrase prompt remains there
> forever (until GDM starts).

Do you mean the plymouth prompt? If so, there is this BZ to track this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1672369
Reported upstream here as well: https://gitlab.freedesktop.org/plymouth/plymouth/-/issues/126

Comment 8 Renaud Métrich 2020-09-17 20:21:27 UTC

Plymouth prompt but also "Please enter passphrase for unlocking ..." in text mode.
I need to dig more if it's related, since after switching to "network-legacy" dracut module (due to a NetworkManager issue), the prompt was gone on next reboot.

Comment 9 Renaud Métrich 2020-09-17 20:24:01 UTC

Sorry, also happens with "network-legacy", so may be related to the fix.

Comment 10 Ben Webb 2020-09-21 21:00:59 UTC

(In reply to Sergio Correia from comment #4)
> If you could test this scratch build and report back the results I would
> appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430

Unfortunately that build doesn't work either - same issue, some of my machines get stuck during boot but will continue to boot normally if someone physically hits "Enter". Applied via:

$ rpm -q clevis
clevis-14-4.fc32.bz1878892.x86_64
$ rpm -q kernel
kernel-5.8.7-200.fc32.x86_64
kernel-5.8.8-200.fc32.x86_64
kernel-5.8.9-200.fc32.x86_64
$ sudo dracut -f 5.8.9-200.fc32.x86_64
$ sudo reboot

Comment 11 Sergio Correia 2020-09-21 21:26:39 UTC

(In reply to Ben Webb from comment #10)
> (In reply to Sergio Correia from comment #4)
> > If you could test this scratch build and report back the results I would
> > appreciate: https://koji.fedoraproject.org/koji/taskinfo?taskID=51609430
> 
> Unfortunately that build doesn't work either - same issue, some of my
> machines get stuck during boot but will continue to boot normally if someone
> physically hits "Enter". Applied via:
> 
> $ rpm -q clevis
> clevis-14-4.fc32.bz1878892.x86_64

Did you update all clevis packages?
$ rpm -qa | grep clevis

Comment 12 Ben Webb 2020-09-21 21:28:33 UTC

(In reply to Sergio Correia from comment #11)
> Did you update all clevis packages?
> $ rpm -qa | grep clevis

Yes.

$ rpm -qa|grep clevis
clevis-systemd-14-4.fc32.bz1878892.x86_64
clevis-dracut-14-4.fc32.bz1878892.x86_64
clevis-14-4.fc32.bz1878892.x86_64
clevis-luks-14-4.fc32.bz1878892.x86_64

Comment 13 Sergio Correia 2020-09-21 21:35:57 UTC

(In reply to Ben Webb from comment #12)
> (In reply to Sergio Correia from comment #11)
> > Did you update all clevis packages?
> > $ rpm -qa | grep clevis
> 
> Yes.
> 
> $ rpm -qa|grep clevis
> clevis-systemd-14-4.fc32.bz1878892.x86_64
> clevis-dracut-14-4.fc32.bz1878892.x86_64
> clevis-14-4.fc32.bz1878892.x86_64
> clevis-luks-14-4.fc32.bz1878892.x86_64

Alright. Could you please share the output of "journalctl -b0 | grep 'Password Requests'" from one of the machines that presented this issue?

Comment 14 Ben Webb 2020-09-21 22:11:19 UTC

Created attachment 1715605 [details]
Output of "journalctl -b0 | grep 'Password Requests'"

Comment 15 Fedora Program Management 2021-04-29 16:37:44 UTC

This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 16 Ben Cotton 2021-05-25 16:44:28 UTC

Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 17 Ben Webb 2021-05-25 17:14:11 UTC

FWIW, the issue (or a very similar one) is still present in F33 and F34 (clevis-18-1.fc34.x86_64) - about 20-30% of my machines do not boot.  Everything works very reliably though if I pin clevis to 13-1 (although BTW recently I've had to pin jose to 10-9 too; older clevis does not work with jose-11-1.fc34.x86_64, failing at boot with "JWE missing required 'clevis.tang.adv' header parameter!" suggesting that newer jose is missing a Conflicts: or similar with older clevis).

Happy to provide more information if you need it to help track down this issue - although I am rarely on site.

Comment 18 Sergio Correia 2021-05-25 19:11:35 UTC

Thanks for reopening this, Ben. Unfortunately I have been unable to reproduce it, but I have a suspicion on what could have caused the issue, so I will prepare a scratch build and it would be much appreciated if you could test it out. Good point about jose and "Conflicts", also.

Comment 19 Sergio Correia 2021-05-26 02:30:56 UTC

@Ben: It would be great if you could test this build and report back whether it helps with the issue at hand: https://koji.fedoraproject.org/koji/taskinfo?taskID=68750126 -- please rebuild the initramfs after installing the updated packages.

Comment 20 Ben Webb 2021-05-26 20:06:15 UTC

Fortuitous timing! I was on site today. Was able to apply this scratch build to all but one of my machines (the one holdout is still on F33), rebuild the initrd, and every machine now boots normally. So I think you got it. Thanks!

For reference here's the package load on one machine:

$ rpm -qa|grep 'clevis\|jose'
clevis-pin-tpm2-0.3.0-1.fc34.x86_64
libjose-11-1.fc34.x86_64
jose-11-1.fc34.x86_64
clevis-18-1.fc34.bz1878892.01.x86_64
clevis-luks-18-1.fc34.bz1878892.01.x86_64
clevis-systemd-18-1.fc34.bz1878892.01.x86_64
clevis-dracut-18-1.fc34.bz1878892.01.x86_64
$ uname -r
5.12.6-300.fc34.x86_64

Comment 21 Sergio Correia 2021-05-27 17:34:45 UTC

Good news, thanks for testing!

For reference, in this scratch build I reverted upstream commit cbb64c4 ("dracut: favour systemd units over dracut hooks")[1].
@Jonathan, would you have any insights here?

[1] https://github.com/latchset/clevis/commit/cbb64c4efb66b57a07fe1afbee10b012523ab71c

Comment 22 Jonathan Lebon 2021-06-02 15:52:14 UTC

Just to make sure I understand correctly, is the latest issue you're hitting still that the systemd clevis service *does* unlock the device, but for whatever reason, you still have to press Enter before it goes forward?

Can you provide the full journal logs of the initrd when that happens?

Can you also provide the versions of dracut and systemd? Can you verify that your initrd contains all of: `cryptsetup.target`, `cryptsetup-pre.target`, and `remote-cryptsetup.target`? (You can do e.g. `lsinitrd | grep cryptsetup`).

Comment 23 Ben Webb 2021-06-04 19:13:20 UTC

(In reply to Jonathan Lebon from comment #22)
> Just to make sure I understand correctly, is the latest issue you're hitting
> still that the systemd clevis service *does* unlock the device, but for
> whatever reason, you still have to press Enter before it goes forward?

That was certainly the issue with F32, which was never resolved. I can confirm that the system still doesn't boot with F34, but can't be 100% sure it was the same issue. I can check next time I'm on site and get the logs/versions you're asking for (the machines all boot right now because they are using the scratch build Sergio provided; I will need to 'break' one first).

> Can you provide the full journal logs of the initrd when that happens?

journalctl --system ? Or something else?

Comment 24 Jonathan Lebon 2021-06-04 19:21:59 UTC

(In reply to Ben Webb from comment #23)
> (In reply to Jonathan Lebon from comment #22)
> > Just to make sure I understand correctly, is the latest issue you're hitting
> > still that the systemd clevis service *does* unlock the device, but for
> > whatever reason, you still have to press Enter before it goes forward?
> 
> That was certainly the issue with F32, which was never resolved. I can
> confirm that the system still doesn't boot with F34, but can't be 100% sure
> it was the same issue. I can check next time I'm on site and get the
> logs/versions you're asking for (the machines all boot right now because
> they are using the scratch build Sergio provided; I will need to 'break' one
> first).

Gotcha. Yeah, would be good to make sure it's the same failure or something else happening here.

> > Can you provide the full journal logs of the initrd when that happens?
> 
> journalctl --system ? Or something else?

Yes, something like `journalctl --system -b 0`. If you're in the initramfs emergency prompt, it might be hard for you to get the logs off unless you can connect to the machine from another machine via serial, or since you should have networking in the initramfs (since you're using Tang pinning), you might be able to POST it someplace if those machines have public access (see example curl invocations in https://paste.centos.org/api).

Comment 25 Jonathan Lebon 2021-06-04 19:29:14 UTC

(In reply to Jonathan Lebon from comment #24)
> (see example curl invocations in https://paste.centos.org/api).

Ahh too bad, that actually requires an API key and at least this CentOS instance doesn't let you request your own key AFAICT. There's probably other pastebins out there which do if you really need to do this.

Comment 26 Ben Webb 2021-06-07 19:18:03 UTC

Created attachment 1789276 [details]
Full initrd journal

Comment 27 Ben Webb 2021-06-07 19:30:07 UTC

(In reply to Jonathan Lebon from comment #22)
> Can you provide the full journal logs of the initrd when that happens?

See attachment https://bugzilla.redhat.com/attachment.cgi?id=1789276

> Can you also provide the versions of dracut and systemd? Can you verify that
> your initrd contains all of: `cryptsetup.target`, `cryptsetup-pre.target`,
> and `remote-cryptsetup.target`? (You can do e.g. `lsinitrd | grep
> cryptsetup`).

This is what I ran on one of my affected systems:

# dnf distro-sync --refresh
# rpm -q dracut
dracut-054-12.git20210521.fc34.x86_64
# rpm -q clevis
clevis-18-1.fc34.x86_64
# rpm -q systemd
systemd-248.3-1.fc34.x86_64
# uname -r
5.12.8-300.fc34.x86_64
# dracut -f
# lsinitrd /boot/initramfs-5.12.8-300.fc34.x86_64.img | grep cryptsetup
drwxr-xr-x   2 root     root            0 May 21 05:25 etc/systemd/system/cryptsetup.target.wants
lrwxrwxrwx   1 root     root           48 May 21 05:25 etc/systemd/system/cryptsetup.target.wants/clevis-luks-askpass.path -> /usr/lib/systemd/system/clevis-luks-askpass.path
-rwxr-xr-x   1 root     root       491256 May 21 05:25 usr/lib64/libcryptsetup.so.12.6.0
lrwxrwxrwx   1 root     root           35 May 21 05:25 usr/lib64/libcryptsetup.so.12 -> ../../lib64/libcryptsetup.so.12.6.0
-rw-r--r--   1 root     root          473 May 15 09:33 usr/lib/systemd/system/cryptsetup-pre.target
-rw-r--r--   1 root     root          420 May 15 09:33 usr/lib/systemd/system/cryptsetup.target
-rwxr-xr-x   1 root     root        70168 May 15 10:14 usr/lib/systemd/systemd-cryptsetup
-rwxr-xr-x   1 root     root        41280 May 15 10:14 usr/lib/systemd/system-generators/systemd-cryptsetup-generator
lrwxrwxrwx   1 root     root           27 May 21 05:25 usr/lib/systemd/system/initrd-root-device.target.wants/remote-cryptsetup.target -> ../remote-cryptsetup.target
-rw-r--r--   1 root     root          557 May 15 09:33 usr/lib/systemd/system/remote-cryptsetup.target
lrwxrwxrwx   1 root     root           20 May 21 05:25 usr/lib/systemd/system/sysinit.target.wants/cryptsetup.target -> ../cryptsetup.target
-rw-r--r--   1 root     root           35 May 21 05:25 usr/lib/tmpfiles.d/cryptsetup.conf
-rwxr-xr-x   1 root     root       142104 May 21 05:25 usr/sbin/cryptsetup
# reboot

The boot hangs at the prompt for the password for the luks volume. On this occasion, hitting Enter doesn't do anything (I just get another identical prompt). Eventually it times out, and after entering the root password I get the emergency shell. I ran journalctl --system -b 0 in that shell and uploaded the result (with mild redaction of hostname/IP) here.

I can't claim to know what's going on here, but it certainly looks to me as if NetworkManager is being called too early (or with too short a timeout) and it gives up trying to configure the network before the NIC gets a link. I've previously hacked around that (see https://bugzilla.redhat.com/show_bug.cgi?id=1702524#c23) but that hack only worked with older clevis (perhaps because it only "fixed" the dracut hook, not the systemd service?)

Comment 28 Jonathan Lebon 2021-06-08 20:27:39 UTC

Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that networking is up before systemd tries to unlock the LUKS device. Can you try adding that to the kernel command-line?

Comment 29 Ben Webb 2021-06-08 21:29:23 UTC

(In reply to Jonathan Lebon from comment #28)
> Hmm, I think you're missing `rd.luks.options=_netdev`.

I believe that I tried adding `_netdev` to the root filesystem in `/etc/fstab` in the past with no effect - is this doing something different? (You no doubt saw that I also have `rd.neednet=1`.) But I'll try this next time I'm on site and report back, thanks!

Comment 30 Jonathan Lebon 2021-06-09 15:13:16 UTC

(In reply to Ben Webb from comment #29)
> (In reply to Jonathan Lebon from comment #28)
> > Hmm, I think you're missing `rd.luks.options=_netdev`.
> 
> I believe that I tried adding `_netdev` to the root filesystem in
> `/etc/fstab` in the past with no effect - is this doing something different?

Yes, they're related but different. You'll want to put it on the block device itself (it can be added as an option in /etc/crypttab, but you can also use rd.luks.options). Otherwise, systemd won't know that unlocking requires networking, even if the subsequent filesystem mount itself is marked as such.

Comment 31 Ben Webb 2021-06-11 23:55:05 UTC

(In reply to Jonathan Lebon from comment #28)
> Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that
> networking is up before systemd tries to unlock the LUKS device. Can you try
> adding that to the kernel command-line?

Unfortunately while that made a difference, it made things worse, not better! With that option, I don't see any output related to clevis unlocking, and no LUKS password prompt; it just gets stuck after bringing up the network (which incidentally has never been an issue - I can always ping these machines when they get stuck). I left the machine for an hour or so and it didn't fail over to the emergency shell, so no journalctl output, sorry (although I do have a photo of the screen if that helps!)

FWIW, the last output I see on boot is NetworkManager starting up (looks OK, ends up in CONNECTED_GLOBAL state) followed by "Finished nm-wait-online-initrd.service" and "Starting dracut initqueue hook..."

This is with the same clevis/dracut/systemd as comment #27, although now with kernel 5.12.9-300.fc34.x86_64.

Comment 32 Jonathan Lebon 2021-06-28 18:13:30 UTC

(In reply to Ben Webb from comment #31)
> (In reply to Jonathan Lebon from comment #28)
> > Hmm, I think you're missing `rd.luks.options=_netdev`. This ensures that
> > networking is up before systemd tries to unlock the LUKS device. Can you try
> > adding that to the kernel command-line?
> 
> Unfortunately while that made a difference, it made things worse, not
> better! With that option, I don't see any output related to clevis
> unlocking, and no LUKS password prompt; it just gets stuck after bringing up
> the network (which incidentally has never been an issue - I can always ping
> these machines when they get stuck). I left the machine for an hour or so
> and it didn't fail over to the emergency shell, so no journalctl output,
> sorry (although I do have a photo of the screen if that helps!)

Yeah, picture would help if you still have it!

> FWIW, the last output I see on boot is NetworkManager starting up (looks OK,
> ends up in CONNECTED_GLOBAL state) followed by "Finished
> nm-wait-online-initrd.service" and "Starting dracut initqueue hook..."

Hmm, can you try booting with rd.debug so we can try to have a peak into what dracut is waiting on?

Comment 33 Ben Webb 2021-06-29 20:54:13 UTC

Created attachment 1796022 [details]
Image at the point where the boot gets stuck

Comment 34 Ben Webb 2021-06-29 20:54:41 UTC

Created attachment 1796023 [details]
Image at the point where the boot gets stuck, with rd.debug

Comment 35 Ben Webb 2021-06-29 21:00:23 UTC

(In reply to Jonathan Lebon from comment #32)
> Yeah, picture would help if you still have it!

See attachment 1796022 [details].

> Hmm, can you try booting with rd.debug so we can try to have a peak into
> what dracut is waiting on?

I added `rd.shell rd.debug log_buf_len=1M rd.luks.options=_netdev` to the command line and it still refused to give me an emergency shell (although I only left it for ~10 minutes this time) so you get a low-tech photo again. See attachment 1796023 [details]. This is with dracut-055-2.fc34.x86_64, clevis-18-1.fc34.x86_64, systemd-248.3-1.fc34.x86_64, kernel 5.12.11-300.fc34.x86_64.

Comment 36 Jonathan Lebon 2021-06-30 14:27:45 UTC

Hmm OK yes, I think I see what's going on. I think the 90crypt dracut module is conflicting with systemd-cryptsetup-generator.

As a test, can you try using rd_NO_LUKS? 90crypt respects this but not systemd-cryptsetup-generator.

Are you using `rd.luks.uuid=<UUID>` by change? As a second independent test, can you instead try using `rd.luks.name=$UUID=$NAME`? 90crypt doesn't handle that karg, but systemd-cryptsetup-generator does.

Comment 37 Ben Webb 2021-07-03 00:00:39 UTC

Created attachment 1797373 [details]
initrd journal with rd_NO_LUKS set

Comment 38 Ben Webb 2021-07-03 00:01:26 UTC

Created attachment 1797374 [details]
initrd journal using rd.luks.name

Comment 39 Ben Webb 2021-07-03 00:05:39 UTC

(In reply to Jonathan Lebon from comment #36)
> As a test, can you try using rd_NO_LUKS? 90crypt respects this but not
> systemd-cryptsetup-generator.

With that it looks like dracut can't find the root device. Eventually it gives up and gives me an emergency shell. Contents of `journalctl --system -b 0` as attachment 1797373 [details].

> Are you using `rd.luks.uuid=<UUID>` by change? As a second independent test,
> can you instead try using `rd.luks.name=$UUID=$NAME`? 90crypt doesn't handle
> that karg, but systemd-cryptsetup-generator does.

I am, so I tried rd.luks.name as you suggest. This also results in not being able to find the root device, attachment 1797374 [details].

Comment 40 Jonathan Lebon 2021-07-05 19:12:24 UTC

Can you describe your setup? If I understand correctly from your logs, it seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right?

I think the problem now is that we're hitting a dependency cycle very similar to https://github.com/dracutdevs/dracut/pull/931:
- systemd-cryptsetup is waiting for remote-fs-pre.target
- remote-fs-pre.target is waiting for dracut-initqueue.service
- dracut-initqueue.service is waiting for /dev/vg0/root to appear
- /dev/vg0/root is on LUKS, so won't appear until systemd-cryptsetup runs

As another test, on top of rd_NO_LUKS, can you also add rd.auto and remove rd.lvm.lv=vg0/root and rd.lvm.lv=vg0/swap?

Comment 41 Ben Webb 2021-07-07 21:24:24 UTC

Created attachment 1799438 [details]
Image at the point where the boot gets stuck, with rd.auto

Comment 42 Ben Webb 2021-07-07 21:30:21 UTC

(In reply to Jonathan Lebon from comment #40)
> Can you describe your setup? If I understand correctly from your logs, it
> seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right?

Yes. Two physical disks each with three partitions. /boot and /boot/efi are each one partition on each disk mirrored with mdraid; the remaining partition is also a RAID 1, with as you said, LVM on LUKS on it. The LUKS device is unlocked with clevis talking to one of a pair of tang servers.

> As another test, on top of rd_NO_LUKS, can you also add rd.auto and remove
> rd.lvm.lv=vg0/root and rd.lvm.lv=vg0/swap?

This also gets stuck at "Remote Encrypted Volumes" but it does look like it's at least trying to unlock the root device. I didn't get a shell after 15 minutes or so, so attachment #1799438 [details] has a photo.

Comment 43 Ben Webb 2021-07-12 23:25:37 UTC

Created attachment 1801000 [details]
Kickstart for basic F34 Server environment to reproduce

Comment 44 Ben Webb 2021-07-12 23:52:29 UTC

(In reply to Jonathan Lebon from comment #40)
> Can you describe your setup? If I understand correctly from your logs, it
> seems like you're doing: root and swap on LVM on LUKS on RAID. Is that right?

I was able to reproduce something pretty similar to my setup in a VM, so I can easily attach a serial console or reboot it without bothering users of my actual hardware (or perhaps you can reproduce at your end). See attachment 1801000 [details]. This is a kickstart for Fedora 34 Server, which I installed via VirtualBox. No RAID in this setup, but I added an encrypted volume group via the installer's custom partitioning. With updates applied this boots with the 5.12.14 kernel, dracut 055-3, clevis-18, and prompts for the encryption password as you would expect.

Next I set up clevis/tang to talk to my two tang servers (at MYIP1 and MYIP2) with

dnf install clevis clevis-luks clevis-dracut
cfg='{"t":1,"pins":{"tang":[{"url":"http://MYIP1"},{"url":"http://MYIP2"}]}}'
clevis luks bind -d /dev/sda2 sss $cfg
dracut -f

Then added `ip=dhcp rd.neednet=1` to the kernel command line and rebooted.

This does not work, same as on my physical F34 desktops. Looks like the network doesn't come up, and I see no output from clevis. (By "does not work", I mean I can still type in the encryption password and boot normally, but clevis doesn't unlock the device automatically.)

It *does* work if I use the clevis scratch build provided in comment #19 (clevis bz1878892.01) and downgrade dracut to the original F34 version (`dnf distro-sync --disablerepo=updates dracut\*` yielding dracut 053-4).

I also tried previous F34 updates dracut versions (054-6, 054-12, and 055-2) and these all fall into the "does not work" category.

Comment 45 Jonathan Lebon 2021-07-21 20:17:41 UTC

Can you try with the latest packages of everything and this dracut build: https://jlebon.fedorapeople.org/dracut-rhbz1878892/ ? Make sure also to use _netdev in your /etc/crypttab and regenerate the initramfs.

I think I've narrowed down the root issue here, and I've got a somewhat hacky approach which seems to fix it for me locally, though would be good to confirm with you before submitting it upstream.

Comment 46 Jonathan Lebon 2021-07-21 21:03:21 UTC

Going to write things here before I forget.

So the main issue here is indeed essentially what I've described in https://bugzilla.redhat.com/show_bug.cgi?id=1878892#c40 (except it's not just LVM -- the same applies for 90crypt, RAID, and the rootfs generator, the latter of which was hacked around with https://github.com/dracutdevs/dracut/pull/931). Essentially, the problem is basically that we can't unlock until we exit the initqueue (dracut-initqueue.service has Before=remote-fs-re.target), but we can't exit the initqueue until we unlock (because a bunch of modules add dracut hooks to ensure devices they expect to exist exist before exiting).

What confounds things further (and took me a while to separate out) is the recent change in dracut to have NetworkManager run as a proper systemd unit instead of as a one-shot service. That change essentially now requires people who use Tang to use _netdev (which they should anyway, but it did work without it before) because otherwise there's a deadlock between NM waiting for D-Bus and D-Bus waiting for sysinit.target, which is after cryptsetup.target.

The patch I've added in https://jlebon.fedorapeople.org/dracut-rhbz1878892/ is simply:

diff --git a/modules.d/98dracut-systemd/dracut-initqueue.service b/modules.d/98dracut-systemd/dracut-initqueue.service
index 3a8679a5..e8154ced 100644
--- a/modules.d/98dracut-systemd/dracut-initqueue.service
+++ b/modules.d/98dracut-systemd/dracut-initqueue.service
@@ -6,8 +6,8 @@
 Description=dracut initqueue hook
 Documentation=man:dracut-initqueue.service(8)
 DefaultDependencies=no
-Before=remote-fs-pre.target
-Wants=remote-fs-pre.target
+Before=remote-fs.target
+Wants=remote-fs.target
 After=systemd-udev-trigger.service
 Wants=systemd-udev-trigger.service
 ConditionPathExists=/usr/lib/initrd-release

What this does is it allows systemd-cryptsetup to run in parallel to the initqueue, because the service has `After=remote-fs-pre`. So then systemd as usual asks for the password, and clevis-luks-askpass.service as expected kicks in, unlocks the device, and all the initqueue/finished hooks are happy. But that patch undermines this really old dracut commit: https://github.com/dracutdevs/dracut/commit/b31250e7e6e6e104674dc304ba74965bb56074d6. So we need to make sure we understand it and we're not introducing another regression there.

We should make sure to get input from dracut and systemd devs for a clean way to fix this. Ideally IMO long-term I think all of these dracut modules (crypt, lvm, raid) should, instead of outputting dracut hooks, if systemd is present ship systemd generators which create systemd units that can sanely be ordered with respect to e.g. udev, networking, remote-fs.target, cryptsetup.target, etc...

Comment 47 Ben Webb 2021-07-28 01:52:02 UTC

(In reply to Jonathan Lebon from comment #45)
> Can you try with the latest packages of everything and this dracut build:
> https://jlebon.fedorapeople.org/dracut-rhbz1878892/ ? Make sure also to use
> _netdev in your /etc/crypttab and regenerate the initramfs.

Unfortunately this doesn't work for me in the VM I set up (myrepo is a local repository with dracut 53 and the clevis scratch build from comment #19):

# dnf distro-sync --refresh --disablerepo=myrepo
# rpm -Uvh dracut*rhbz1878892*.rpm
# cat /etc/crypttab
luks-12f7e3ad-3039-4de8-bedb-ac565aebca06 UUID=12f7e3ad-3039-4de8-bedb-ac565aebca06 none discard,_netdev
# dracut /boot/initramfs-5.13.4-200.fc34.x86_64.test.img 5.13.4-200.fc34.x86_64
# rpm -q dracut clevis
dracut-055-4.rhbz1878892.fc34.x86_64
clevis-18-1.fc34.x86_64

Then rebooted, hit E at the GRUB screen and changed the initrd from initramfs-5.13.4-200.fc34.x86_64.img (generated with dracut 53, which works) to initramfs-5.13.4-200.fc34.x86_64.test.img

Here's the tail of the serial console after waiting a few minutes:

[    2.897760] RPC: Registered tcp NFSv4.1 backchannel transport module.
[^[[0;32m  OK  ^[[0m] Finished ^[[0;1;39mdracut pre-udev hook^[[0m.
[    2.977734] audit: type=1130 audit(1627436056.550:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=dracut-pre-udev comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
         Starting ^[[0;1;39mRule-based Manage…for Device Events and Files^[[0m...
[^[[0;32m  OK  ^[[0m] Started ^[[0;1;39mRule-based Manager for Device Events and Files^[[0m.
         Starting ^[[0;1;39mdracut pre-trigger hook^[[0m...
[^[[0;32m  OK  ^[[0m] Finished ^[[0;1;39mdracut pre-trigger hook^[[0m.
         Starting ^[[0;1;39mColdplug All udev Devices^[[0m...
[^[[0;32m  OK  ^[[0m] Finished ^[[0;1;39mColdplug All udev Devices^[[0m.
         Starting ^[[0;1;39mnm-initrd.service^[[0m...
         Starting ^[[0;1;39mShow Plymouth Boot Screen^[[0m...
^[[2J^[[3J^[[-1;-1f         Starting ^[[0;1;39mWait for udev To …plete Device Initialization^[[0m...^M
[^[[0;32m  OK  ^[[0m] Started ^[[0;1;39mShow Plymouth Boot Screen^[[0m.^M
[^[[0;32m  OK  ^[[0m] Started ^[[0;1;39mForward Password R…s to Plymouth Directory Watch^[[0m.^M
[^[[0;32m  OK  ^[[0m] Reached target ^[[0;1;39mPath Units^[[0m.^M
<info>  [1627436056.8312] NetworkManager (version 1.30.6-1.fc34) is starting... (for the first time)^M
<info>  [1627436056.8344] Read config: /etc/NetworkManager/NetworkManager.conf (lib: initrd-no-auto-default.conf)^M
[    3.299118] ACPI: video: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
[    3.330326] ACPI: \_SB_.LNKD: Enabled at IRQ 11
[    3.380469] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/LNXVIDEO:00/input/input6
[    3.407886] e1000: Intel(R) PRO/1000 Network Driver
[    3.408742] e1000: Copyright (c) 1999-2006 Intel Corporation.
[    3.435047] vboxguest: host-version: 6.1.18r142142 0x8000000f
[    3.452679] vbg_heartbeat_init: Setting up heartbeat to trigger every 2000 milliseconds
[    3.514601] ACPI: \_SB_.LNKC: Enabled at IRQ 9
[    3.544952] input: VirtualBox mouse integration as /devices/pci0000:00/0000:00:04.0/input/input7
[    3.606281] vboxguest: misc device minor 124, IRQ 11, I/O port d020, MMIO at 0x00000000f0400000 (size 0x0000000000400000)
[^[[0;32m  OK  ^[[0m] Found device ^[[0;1;39mVBOX_HARDDISK 2^[[0m.^M
         Starting ^[[0;1;39mCryptography Setu…3039-4de8-bedb-ac565aebca06^[[0m...^M

Please enter passphrase for disk VBOX_HARDDISK (luks-12f7e3ad-3039-4de8-bedb-ac565aebca06)::[    3.889077] vboxvideo 0000:00:02.0: vgaarb: deactivate vga console
[    3.930922] Console: switching to colour dummy device 80x25
[    3.993548] [drm] VRAM 00c00000
[    4.052226] [drm] Initialized vboxvideo 1.0.0 20130823 for 0000:00:02.0 on minor 0
[    4.114858] fbcon: vboxvideodrmfb (fb0) is primary device
[    4.135906] Console: switching to colour frame buffer device 112x39
[    4.189578] vboxvideo 0000:00:02.0: [drm] fb0: vboxvideodrmfb frame buffer device
[    4.594098] e1000 0000:00:03.0 eth0: (PCI:33MHz:32-bit) 08:00:27:e4:cd:b7
[    4.599280] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
[    4.608944] e1000 0000:00:03.0 enp0s3: renamed from eth0
[  183.676390] kauditd_printk_skb: 11 callbacks suppressed

Comment 48 John Call 2021-09-02 07:57:29 UTC

This is impacting me as well. I wanted to automatically unlock my F34 laptop, but a basic LUKS+Clevis/Tang is failing to automatically unlock the device.

Comment 49 Sergio Correia 2021-09-02 11:45:15 UTC

(In reply to John Call from comment #48)
> This is impacting me as well. I wanted to automatically unlock my F34
> laptop, but a basic LUKS+Clevis/Tang is failing to automatically unlock the
> device.

Is it this same issue? Could you elaborate on your setup, please?

Comment 50 John Call 2021-09-03 17:25:24 UTC

(In reply to Sergio Correia from comment #49)
> Is it this same issue? Could you elaborate on your setup, please?

Yes I believe it is the same issue of systemd unit ordering/dependencies. I tried using my procedure which has worked for RHEL7 and RHEL8 on Fedora34, and it didn't work. I noticed on RHEL8 that I can ping the server during the "Please enter passphrase..." prompt, but not on Fedora 34. It seems to me that the networking details given on the kernel_cmdline are either being ignored, or not applied early enough.

Here is my basic steps for binding the root disk to tang...
  sudo dnf -y install clevis-dracut
  echo 'kernel_cmdline="ip=dhcp"' | sudo tee /etc/dracut.conf.d/clevis.conf
  sudo dracut -f
  sudo reboot

I'm attempting a very basic deployment, that looks like this:
[jcall@fedora ~]$ lsblk
NAME                                          MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sr0                                            11:0    1 1024M  0 rom   
zram0                                         251:0    0  1.9G  0 disk  [SWAP]
vda                                           252:0    0   20G  0 disk  
├─vda1                                        252:1    0    1G  0 part  /boot
└─vda2                                        252:2    0   19G  0 part  
  └─luks-3cef56d0-1221-4eb3-87e9-f7aa811e2036 253:0    0   19G  0 crypt /home

[jcall@fedora ~]$ findmnt /
TARGET SOURCE                                                       FSTYPE OPTIONS
/      /dev/mapper/luks-3cef56d0-1221-4eb3-87e9-f7aa811e2036[/root] btrfs  rw,relatime,seclabel,compress=zstd:1,space_cache,subvolid=258,subvol=/root


Wow, off topic, BTRFS looks weird to my XFS-trained eyes...

Comment 51 Sergio Correia 2021-10-27 19:19:16 UTC

@Ben Webb: I finally managed to reproduce the issue and we may have a potential fix. Would you be able to test it and report back, please? Here's a scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=77924657

Comment 52 Fedora Update System 2021-10-29 15:29:19 UTC

FEDORA-2021-f5be677d41 has been submitted as an update to Fedora 36. https://bodhi.fedoraproject.org/updates/FEDORA-2021-f5be677d41

Comment 53 Fedora Update System 2021-10-29 15:32:03 UTC

FEDORA-2021-f5be677d41 has been pushed to the Fedora 36 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 54 Fedora Update System 2021-10-29 15:53:09 UTC

FEDORA-2021-320c541101 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-320c541101

Comment 55 Fedora Update System 2021-10-29 15:53:26 UTC

FEDORA-2021-d6afa567fe has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-d6afa567fe

Comment 56 Fedora Update System 2021-10-29 15:53:38 UTC

FEDORA-2021-6faffb265a has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-6faffb265a

Comment 57 Fedora Update System 2021-10-29 21:17:38 UTC

FEDORA-2021-320c541101 has been pushed to the Fedora 35 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-320c541101`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-320c541101

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 58 Fedora Update System 2021-10-30 01:00:53 UTC

FEDORA-2021-d6afa567fe has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-d6afa567fe`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-d6afa567fe

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 59 Fedora Update System 2021-10-30 01:16:32 UTC

FEDORA-2021-6faffb265a has been pushed to the Fedora 33 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-6faffb265a`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-6faffb265a

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 60 Ben Webb 2021-11-01 18:42:05 UTC

(In reply to Sergio Correia from comment #51)
> @Ben Webb: I finally managed to reproduce the issue and we may have a
> potential fix. Would you be able to test it and report back, please? Here's
> a scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=77924657

It works in my test VM - great! I'll test it on my real hardware next time I'm on site.

Comment 61 Ben Webb 2021-11-02 22:05:51 UTC

(In reply to Ben Webb from comment #60)
> (In reply to Sergio Correia from comment #51)
> > @Ben Webb: I finally managed to reproduce the issue and we may have a
> > potential fix. Would you be able to test it and report back, please? Here's
> > a scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=77924657
> 
> It works in my test VM - great! I'll test it on my real hardware next time
> I'm on site.

I tested it on about half of my F34 workstations and it worked fine on all of them, so looks good. Thanks!

Comment 62 Sergio Correia 2021-11-03 12:51:01 UTC

Thanks for testing Ben, much appreciated.

@John Call: could you retest with either the builds under testing? From your description, it looks more like an issue of dracut not bringing up the network.

Comment 63 John Call 2021-11-03 21:55:06 UTC

(In reply to Sergio Correia from comment #62)
> @John Call: could you retest with either the builds under testing? From your
> description, it looks more like an issue of dracut not bringing up the
> network.

This is working for me now too, thank you!

[jcall@jcall-laptop ~]$ uname -a
Linux jcall-laptop 5.14.15-100.fc33.x86_64 #1 SMP Wed Oct 27 16:46:06 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

[jcall@jcall-laptop ~]$ sudo dnf list installed '*clevis*'
Installed Packages
clevis.x86_64                        18-2.fc33                  @updates-testing
clevis-dracut.x86_64                 18-2.fc33                  @updates-testing
clevis-luks.x86_64                   18-2.fc33                  @updates-testing
clevis-pin-tpm2.x86_64               0.2.0-1.fc33               @updates        
clevis-systemd.x86_64                18-2.fc33                  @updates-testing

[jcall@jcall-laptop ~]$ cat /etc/dracut.conf.d/clevis.conf 
#This doesn't work, moving it to /etc/default/grub instead
kernel_cmdline="rd.neednet=1 ip=enp0s31f6:dhcp"
omit_dracutmodules+=" ifcfg "

[jcall@jcall-laptop ~]$ grep CMDLINE /etc/default/grub 
GRUB_CMDLINE_LINUX="rd.lvm.lv=f33/root rd.luks.uuid=luks-e65ff550-8d23-473f-ba69-1c00d774fcfd rhgb quiet rd.neednet=1 ip=enp0s31f6:dhcp"

Comment 64 Fedora Update System 2021-11-06 01:26:44 UTC

FEDORA-2021-320c541101 has been pushed to the Fedora 35 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 65 Fedora Update System 2021-11-07 01:10:23 UTC

FEDORA-2021-6faffb265a has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 66 Fedora Update System 2021-11-07 01:11:06 UTC

FEDORA-2021-d6afa567fe has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.