1702524 – clevis not decrypting on boot for root fs

Bug 1702524 - clevis not decrypting on boot for root fs

Summary: clevis not decrypting on boot for root fs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	clevis
Sub Component:
Version:	31
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Sergio Correia
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-24 03:00 UTC by David Luong
Modified:	2020-09-25 16:40 UTC (History)
CC List:	13 users (show)
Fixed In Version:	clevis-14-1.fc32
Clone Of:
Environment:
Last Closed:	2020-09-07 17:13:47 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description David Luong 2019-04-24 03:00:09 UTC

Description of problem:
After installing clevis, configuring it to use tang, it doesn't decrypt the device AND it doesn't prompt for password if volume is mount on /

Version-Release number of selected component (if applicable):
clevis-11-4.fc29.src.rpm

How reproducible:
Always

Steps to Reproduce:
1.  Install clevis clevis-dracut clevis-luks
2.  clevis luks bind -d /dev/nvme0n1p3 sss "$cfg"  OR
    clevis luks bind -d /dev/mapper/fedora-root sss "$cfg"  
3.  systemctl enable clevis-luks-askpass.path
4.  dracut -f

Actual results:
Doesn't boot, boots me to dracut emergency shell

Expected results:
Unlocks due to it being able to connect to tang servers, or at least let me enter in a password

Additional info:
1st example, simple luks encrypted lvm lv (both don't work):

/etc/crypttab:

luks-12456 UUID=1234567890 none _netdev

/etc/fstab:
/dev/mapper/luks-123456 /ext4  defaults,x-systemd.device-timeout=0,_netdev 1 1

2nd example (LVM on LUKS):
luks-9b910fb0-d2c0-40d4-8b90-2490a9cf2a40 UUID=9b910fb0-d2c0-40d4-8b90-2490a9cf2a40 none _netdev


/etc/fstab:
/dev/mapper/fedora-root /                       ext4    defaults,x-systemd.device-timeout=0,_netdev 1 1
UUID=59273060-3bcd-4228-8a0b-0306549525ab /boot                   ext4    defaults        1 2
UUID=222A-CFE8          /boot/efi               vfat    umask=0077,shortname=winnt 0 2
/dev/mapper/fedora-home /home                   ext4    _netdev,defaults,x-systemd.device-timeout=0,_netdev 1 2
/dev/mapper/fedora-swap swap                    swap    _netdev,defaults,x-systemd.device-timeout=0,_netdev 0 0

Comment 2 Cedric Buissart 2019-08-20 07:41:02 UTC

Hello,
I think I am in a similar situation as described in '2nd example' (encrypted PV containing the root partition). [although I am using legacy BIOS instead of UEFI, but that should not matter, I guess].

Without tang available [I haven't tried yet what happens with tang available], my boot gets stuck and eventually timing out because Swap is not found. I am not being asked for a password. 
After falling to the emergency shell (you have to wait a _looong_ time to get into that shell), I can see that :
* the PV hasn't been decrypted.
* /run/systemd/ask-password/ is empty
* /initramfs/usr/lib/dracut/hooks/initqueue/finished/ contains tests for decrypted PV and swap & root LVs


=> Note : the installation clevis-luks-bind appears to work fine. The issue seems to be related with the initramfs image.
Workaround : remove `clevis-dracut` or rebuild the imiramfs omitting the clevis module so that you are being asked for a password.

I tried to replace /usr/libexec/clevis-luks-askpass with the latest upstream git code : it is marginally better -> I am asked for a password, but pressing <enter> does not work.

Comment 3 Sergio Correia 2019-08-22 00:47:38 UTC

Hi David and Cedric: 

The root partition /, /usr and swap will be mounted by dracut during early boot unlocking; please remove the _netdev option for these partitions (the ones that exist) in both /etc/crypttab and /etc/fstab and give it a try.

Comment 4 Cedric Buissart 2019-08-23 14:49:53 UTC

(In reply to Sergio Correia from comment #3)
> Hi David and Cedric: 
> 
> The root partition /, /usr and swap will be mounted by dracut during early
> boot unlocking; please remove the _netdev option for these partitions (the
> ones that exist) in both /etc/crypttab and /etc/fstab and give it a try.

I tried to remove all the _netdev (both crypttab and fstab), then rebuild the initramfs, but no luck : still same behavior

Comment 5 Brian Daniels 2019-09-09 18:55:36 UTC

I am seeing a similar issue with Fedora release 30.  The VM was installed with root encryption enabled and a passphrase set.  Clevis was configured with:

sudo yum install clevis clevis-luks clevis-dracut
sudo clevis luks bind -d /dev/sda2 sss '{"t":1, "pins": { "tang": [{"url": "http://nbdetest1.test.com"},{"url": "http://nbdetest2.test.com"}]}}'
sudo dracut -f

No changes were made to /etc/fstab or /etc/crypttab.  On reboot, I am prompted for the decrypt password and no network activity is observed.  Hitting esc to view the startup process reveals a flood dracut errors, "dracut-initqueue[457]: Error communicating with the server!"

It looks like the network is not up yet at the point clevis-dracut is trying to contact the tang servers?  

If I enter the password, the boot proceeds as expected.  The same setup process works correctly on RHEL7 workstation and auto-unlocks.

Comment 6 Sergio Correia 2019-09-16 11:51:40 UTC

(In reply to Brian Daniels from comment #5)

[snip]

> It looks like the network is not up yet at the point clevis-dracut is trying
> to contact the tang servers?  
> 

This is probably due this patch, which was applied as a workaround for a regression caused by dracut: https://src.fedoraproject.org/rpms/clevis/blob/f30/f/0001-Drop-rd.neednet-1-for-the-time-being-so-tpm2-unlock-.patch
You may try to revert it (i.e., edit /usr/lib/dracut/modules.d/60clevis/module-setup.sh and update both depends() and cmdline() functions) and recreate your initramfs.

Comment 7 Brian Daniels 2019-10-16 17:21:59 UTC

(In reply to Sergio Correia from comment #6)

> This is probably due this patch, which was applied as a workaround for a
> regression caused by dracut:
> https://src.fedoraproject.org/rpms/clevis/blob/f30/f/0001-Drop-rd.neednet-1-
> for-the-time-being-so-tpm2-unlock-.patch

Unfortunately that did not resolve the issue.  Adding back the network lines removed in that patch does cause the network to start up, but it still hangs at the flood of "Error communicating with the server!" messages.

Comment 8 Sergio Correia 2019-10-18 14:11:26 UTC

(In reply to Brian Daniels from comment #7)
> (In reply to Sergio Correia from comment #6)
> 
> > This is probably due this patch, which was applied as a workaround for a
> > regression caused by dracut:
> > https://src.fedoraproject.org/rpms/clevis/blob/f30/f/0001-Drop-rd.neednet-1-
> > for-the-time-being-so-tpm2-unlock-.patch
> 
> Unfortunately that did not resolve the issue.  Adding back the network lines
> removed in that patch does cause the network to start up, but it still hangs
> at the flood of "Error communicating with the server!" messages.

Could you check the output of the following command, please?
# lsinitrd /boot/initramfs-$(uname -r).img etc/cmdline.d/99clevis.conf

You should see "rd.neednet=1".

Comment 9 Ben Cotton 2019-10-31 18:45:35 UTC

This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 10 Brian Daniels 2019-11-01 19:47:53 UTC

The issue of not decrypting automatically at boot still exists in Fedora 31.  After the boot, the login process just repeats the "Error communicating with the server!" messages.  Manually entering the LUKS password does work to unlock the volume.

Comment 11 Sergio Correia 2019-11-08 00:46:51 UTC

Felix: It's better to discuss here (regarding https://bugzilla.redhat.com/show_bug.cgi?id=1687753#c8)

(In reply to Felix Schwarz from comment https://bugzilla.redhat.com/show_bug.cgi?id=1687753#c8)
[...]
> Unfortunately it works only partially:
>
>  - The VM activates the network interface and gets a new IP via DHCP (at least in the default install, did not bother to setup systemd-networkd > in the VM at first). Also I think the VM makes a successful request to my tang server:
>
>     tangd[29971]: 192.168.122.235 POST /rec/sBFzW0CiL0E8nEjDQxu7HIdeeQM => 200 (src/tangd.c:165)
>
>  - *Afterwards* the password dialog comes up (maybe it could not process the tang response?). I can enter characters there (at least I see the "*" after typing) BUT when I press enter the password dialog does not change (as if "enter" was not processed). So unfortunately I can not boot the VM.
> 

What happens if you wait without entering the password manually? Does it eventually complete the startup process? I am wondering whether you are hitting this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1726617


> - I resetted the VM and stopped the tang server. Next boot I can enter the password (including proper reaction for "enter") BUT boot only 
> progresses until systemd says "reached target basic system" :-(
>
>   On the screen I see many lines with the same error message:
> 
>     /lib/dracut/hooks/initqueue/settled/99-nm-run.sh: line 13: basename: command not fouund

Yeah, this behavior of progressing only up to "reached target basic system" is strange. I see these "basename: command not found" messages from 99-nm-run.sh as well, however the system still boots properly, including the scenario of stopping the tang server, in which case I need to type the password manually. The regular case with tang running works as well, with the automatic unlock of my root partition.

Could you please provide the contents of your /etc/fstab and /etc/crypttab?

Comment 12 Marcel D'Avis 2019-11-10 14:26:20 UTC

I'm having the same issue as Brian. The issue might be dracut itself. After updating to 31 neither modules for the NIC nor other stuff related to networking was incorporated into the initrd. Additionally I got the basename error. I created a custom dracut.conf for adding kernel modules, basename .. . and also changed the grub cmdline to rd.neednet=1 and ip=${mycustomconfig}
In this way i get networking in the initrd and can succesfully ping the machine while it is sitting at the password prompt. No error related to basename anymore but I still get "Error communicating with the server!"
The initrd does not seem to make any connection at all to the tang server. I did not yet try this: https://github.com/dracutdevs/dracut/commit/adee5b97bc5418b6e357342bb3be20568668aa55
I added random.trust_cpu=on and rd.debug=1 to the cmdline, but still after 10 minutes nothing happens.

Comment 13 Ben Webb 2019-11-19 05:56:50 UTC

I think this must be a bit more involved than just "rd.neednet is missing" - hopefully my experience can yield some more clues:

I have a bunch of Dell Precision workstations, all with the root filesystem a LV on an encrypted PV, using clevis to unlock at boot by contacting a Tang server. I have always explicitly had rd.neednet=1 set. These all worked fine with the 5.2 kernel in F30, but most stopped booting when the 5.3 kernel was released (except for some older workstations which continued to work fine). Symptoms were as described by others here, the network would come up but the Tang server would not be contacted and so boot could not proceed.

I've started rolling out F31, and the exact same setup *mostly* works fine again. Network comes up, Tang server is contacted, the PV is unlocked and boot proceeds normally. But one workstation still fails to boot - this time the network does not come up at all, but if I manually put in the encryption password everything proceeds normally. I added rd.debug to the kernel command line on this machine to see if I could see what was going on and... now it boots normally (albeit with a ton of debug output). This leads me to suspect a race condition or deadlock between network and clevis. I'm happy to try any suggestions from dracut experts or provide diagnostics.

Comment 14 Sergio Correia 2019-11-19 09:30:11 UTC

(In reply to Ben Webb from comment #13)
[...] 
> I've started rolling out F31, and the exact same setup *mostly* works fine
> again. Network comes up, Tang server is contacted, the PV is unlocked and
> boot proceeds normally. But one workstation still fails to boot - this time
> the network does not come up at all, but if I manually put in the encryption
> password everything proceeds normally. I added rd.debug to the kernel
> command line on this machine to see if I could see what was going on and...
> now it boots normally (albeit with a ton of debug output). This leads me to
> suspect a race condition or deadlock between network and clevis. I'm happy
> to try any suggestions from dracut experts or provide diagnostics.

This sounds a bit like the issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=1726617.

Would you please try out this dracut scratch build in this machine that fails to boot and report back the results? Please remove rd.debug for the test:
https://koji.fedoraproject.org/koji/taskinfo?taskID=39095650

Comment 15 Ben Webb 2019-11-19 17:07:36 UTC

(In reply to Sergio Correia from comment #14)
> (In reply to Ben Webb from comment #13)
> [...] 
> > I've started rolling out F31, and the exact same setup *mostly* works fine
> > again. Network comes up, Tang server is contacted, the PV is unlocked and
> > boot proceeds normally. But one workstation still fails to boot
> This sounds a bit like the issue reported in
> https://bugzilla.redhat.com/show_bug.cgi?id=1726617.

If I'm reading that correctly, this makes the boot slow due to a lack of entropy. My systems aren't slow - they don't boot at all. One sat at the LUKS password prompt for 24 hours before I got to it.

> Would you please try out this dracut scratch build in this machine that
> fails to boot and report back the results? Please remove rd.debug for the
> test:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=39095650

I tried it anyway (rpm -Uvh dracut*.rpm; dracut -f; grub2-mkconfig) and it did not fix the issue. Network does not come up, machine is stuck at the password prompt (I let it sit there for just over an hour). But with this scratch build now I can't reboot with ctrl-alt-del either ;) It gets stuck at "systemd-shutdown[1]: Waiting for process: rngd".

Comment 16 David Luong 2019-12-03 17:14:50 UTC

Worked for me just now:

Steps:

1. Have a wired connection (does not work on wifi)
2. Install clevis clevis-dracut clevis-luks
3. sudo clevis luks bind -d /dev/nvme0n1p3 sss '{"t":1, "pins": { "tang": [{"url": "http://url"}]}}'
4. dracut -f

Packages:
clevis-dracut-11-8.fc31.x86_64
clevis-systemd-11-8.fc31.x86_64
clevis-luks-11-8.fc31.x86_64
clevis-11-8.fc31.x86_64

Comment 17 Ben Webb 2019-12-03 17:33:17 UTC

> Worked for me just now:

Also works for me "some", but not all, of the time. Of 8 desktop machines running F31, 6 boot fine with a very similar configuration to yours; one boots only if rd.debug is added to the kernel command line; one still refuses to boot (I have to put in the LUKS password by hand). (These are all Dell Precision workstations with the same kernel/dracut/clevis/LVM setup, although they are of varying vintages.) A VM with similar configuration also boots fine.

Comment 18 Sergio Correia 2019-12-03 18:02:18 UTC

This may or may not be related, but someone reported earlier on Github that the dracut unlocker + tang pin cannot perform DNS lookups in F31 [1].

He also mentioned that resolving the "basename: command not found" issue from "99-nm-run.sh" allowed his tang setup to work (when using IPs in the url field). I submitted a PR to the dracut package to backport an upstream fix for the "basename not found" issue in [2].

[1] https://github.com/latchset/clevis/issues/148
[2] https://src.fedoraproject.org/rpms/dracut/pull-request/8

Comment 19 Ben Webb 2019-12-03 18:13:58 UTC

> This may or may not be related, but someone reported earlier on Github that the dracut unlocker + tang pin cannot perform DNS lookups in F31 [1].

It's certainly not the cause of the issues I'm seeing since all my systems use IP addresses for the Tang servers (and on the systems that don't boot, the network doesn't come up anyway so it hasn't got to the point of DNS being an issue).

Comment 20 Chuck Liggett 2019-12-03 18:51:41 UTC

Switching to an IP-address based tang URL resolved the issue for me.

I would like to add that I still get the basename errors out of 99-nm-run.sh, but that didn't prevent it from working.

Thank you Sergio!

Comment 21 Chuck Liggett 2019-12-03 19:32:55 UTC

Interesting...  now on my F31 Workstation which is also a KVM Hypervisor, after it boots up using NBDE with a Tang server, Network Manager is getting a new "Wired Connection" automatically created, and it's preventing my normal br0 interface from starting.

On my test KVM VM of F31 Server, minimal installation, nmcli con show only shows one interface, eth0, but when I reboot it with NBDE, it boots up and shows an activated (green) "Wired Connection" and my normal eth0 interface not-active (white).

Comment 22 Paul K 2019-12-12 19:11:16 UTC

I think this may also be related to https://github.com/dracutdevs/dracut/issues/537. Pull request #578 (https://github.com/dracutdevs/dracut/pull/578) removes the use of basename entirely. Unfortunately that patch didn't make it into release 049.

I've manually applied that patch and it stopped the error, but didn't help booting with clevis. The odd part is that I thought this was working for me on F30 and F31 about a month ago. I haven't changed the boot ISO or packages (I'm configuring this with a kickstart file).

Comment 23 Ben Webb 2019-12-13 22:07:01 UTC

FWIW, I have a solution for my systems (summary of comment 13 and comment 17; I have a number of F31 machines with essentially identical configuration; ~10 systems boot with clevis+tang OK, one only boots if rd.debug is set, three won't boot at all - the network doesn't come up - and need a password to be manually input; all machines worked fine with F30 and the 5.2 kernel).

The problem is that NetworkManager runs as expected (/usr/lib/dracut/modules.d/35network-manager/nm-run.sh) but fails to bring up the network. It parses the ip= parameter in the kernel command line correctly (/run/NetworkManager/system-connections/enp0s25.nmconnection is created with correct contents). I also see on the console that the NIC has a link:
[   15.336457] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   15.338352] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s25: link becomes ready
... but NetworkManager does not bring up the device (/run/NetworkManager/devices/2 contains only "managed=true", no "connection-uuid", thus nm-run.sh does not proceed to "source_hook initqueue/online" which would trigger the clevis unlock). After this NetworkManager is never called again so the network will never come up.

I don't know if the root cause is that NetworkManager needs a longer timeout somewhere (I didn't see anywhere where I could configure that in dracut-network) or that it's being called too early, before the network device is ready. But it seems clear to me that it "sometimes" works because on some systems the device is ready in time, and that on machines where it doesn't work setting rd.debug might slow down NetworkManager just enough. But the workaround that works for me (obviously a hack though) is to edit nm-run.sh and simply duplicate the line:
    /usr/sbin/NetworkManager --configure-and-quit=initrd --no-daemon

While NetworkManager will fail to configure the device on the first call, it works the second time around. Alternatively, I found that adding "sleep 20" before NetworkManager is called works too.

Comment 24 Paul K 2019-12-16 19:46:10 UTC

I finally found the problem I encountered. I'd changed the partition type from MBR to GPT without changing the device in the clevis command in %post. GPT adds a partition that MBR doesn't have which moved the LUKS PV from sda2 to sda3. 

Moral of the story - never believe anyone that says "nothing changed," something probably has. :)

(In reply to Paul K from comment #22)
> I think this may also be related to
> https://github.com/dracutdevs/dracut/issues/537. Pull request #578
> (https://github.com/dracutdevs/dracut/pull/578) removes the use of basename
> entirely. Unfortunately that patch didn't make it into release 049.
> 
> I've manually applied that patch and it stopped the error, but didn't help
> booting with clevis. The odd part is that I thought this was working for me
> on F30 and F31 about a month ago. I haven't changed the boot ISO or packages
> (I'm configuring this with a kickstart file).

Comment 25 Fedora Update System 2020-08-31 12:39:44 UTC

FEDORA-2020-b35219394f has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2020-b35219394f

Comment 26 Fedora Update System 2020-08-31 13:18:32 UTC

FEDORA-2020-d42f4e90f9 has been submitted as an update to Fedora 32. https://bodhi.fedoraproject.org/updates/FEDORA-2020-d42f4e90f9

Comment 27 Fedora Update System 2020-08-31 15:55:50 UTC

FEDORA-2020-d42f4e90f9 has been pushed to the Fedora 32 testing repository.
In short time you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-d42f4e90f9`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-d42f4e90f9

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 28 Fedora Update System 2020-08-31 18:58:03 UTC

FEDORA-2020-b35219394f has been pushed to the Fedora 33 testing repository.
In short time you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-b35219394f`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-b35219394f

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 29 Fedora Update System 2020-09-07 17:13:47 UTC

FEDORA-2020-d42f4e90f9 has been pushed to the Fedora 32 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 30 Fedora Update System 2020-09-25 16:40:14 UTC

FEDORA-2020-b35219394f has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.