Bug 1877995 - [4.6] [vmware] Fail to boot with Secure Boot enabled, kernel lockdown denies iopl access to afterburn
Summary: [4.6] [vmware] Fail to boot with Secure Boot enabled, kernel lockdown denies ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.6.0
Assignee: Luca BRUNO
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-11 02:46 UTC by jima
Modified: 2021-01-04 16:37 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1880417 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:40:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The screenshot of vm console (323.50 KB, image/png)
2020-09-11 02:46 UTC, jima
no flags Details
new error from master/worker nodes. (457.20 KB, image/png)
2020-09-22 07:33 UTC, jima
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github lucab vmw_backdoor-rs issues 6 0 None open kernel_lockdown denies iopl calls 2021-02-05 08:50:36 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:40:23 UTC

Description jima 2020-09-11 02:46:49 UTC
Created attachment 1714486 [details]
The screenshot of vm console

Description of problem:
When verifying Bug 1862851, enabling secure boot during installing ocp with rhcos-46.82.202009091306-0, hit the new issue from VM console:
Failed at step STDIN spawning /bin/dracut-emergency: Inappropriate ioctl for device

Please see attached screenshot.

Version-Release number of selected component (if applicable):


How reproducible:
Always once enabling secure boot

Steps to Reproduce:
1. Install OCP on vsphere with secureboot enabled, rhcos template is rhcos-46.82.202009091306-0
2.
3.

Actual results:
VMs should be up successfully.

Expected results:
VMc could not be started correctly.

Additional info:

Comment 1 Luca BRUNO 2020-09-11 09:39:02 UTC
From the attached console screenshot: kernel_lockdown (which is automatically enabled in Secure Boot) is denying iopl access. Afterburn uses that to unlock I/O access to the hypervisor, in order to read guestinfo properties.

The underlying root-cause is https://github.com/lucab/vmw_backdoor-rs/issues/6.

Comment 2 Luca BRUNO 2020-09-11 09:50:53 UTC
I'm trying to reproduce/investigate this with the current 4.6 pre-release OVA (4.6.0-0.nightly-2020-09-09-130911 [0]) but it even fails to boot to the kernel due to invalid signature (normal EFI boot is fine though).

Jinyun, does that image boot up to the kernel for you?

[0] https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest-4.6/rhcos-4.6.0-0.nightly-2020-09-09-130911-x86_64-vmware.x86_64.ova

Comment 3 Colin Walters 2020-09-11 11:40:17 UTC
See https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1112 - TL;DR 4.6 is back to an 8.2.z kernel which fixes this.  You need to use the build browser to get the latest.

Comment 4 Luca BRUNO 2020-09-11 12:49:33 UTC
Indeed, the latest nightly internally available at this time (rhcos-46.82.202009101640-0) has a good signature and can reach kernel-space.

I'm fully seeing the failure due to iopl ETERM there. As a (very bad) quick workaround, I've verified that this can be bypassed by catching the GRUB prompt and manually adding `ip=dhcp,dhcp6` as a kernel argument.

Comment 5 Luca BRUNO 2020-09-11 13:07:39 UTC
I had some bad feeling about Ignition, and indeed bypassing the Afterburn failure with the hackish workaround above leads to the same thing causing failures to Ignition fetch stages.

For reference, that's tracked separately at https://github.com/coreos/ignition/issues/1092.

Comment 6 Micah Abbott 2020-09-11 13:31:15 UTC
Not sure how many customers are using SecureBoot on VMware, so setting medium priority for now.  I wonder if this should really be considered a TestBlocker because of that.

Comment 7 Micah Abbott 2020-09-11 17:53:26 UTC
Created a doc BZ to advise users to not use SecureBoot on VMWware - https://bugzilla.redhat.com/show_bug.cgi?id=1878262

Comment 10 Benjamin Gilbert 2020-09-14 17:04:08 UTC
@jima, about the Regression keyword: can you confirm that Secure Boot worked on VMware on previous versions of RHCOS?  Due to the nature of the bug, I suspect that it did not.

Comment 12 Benjamin Gilbert 2020-09-15 00:00:34 UTC
On 4.5 with Secure Boot enabled, how did you install the nodes?  Did you use the RHCOS installer PXE or ISO image to install to the VM disk, passing the Ignition config via coreos.inst.ignition_url?  Did you start from the OVA and pass the Ignition config via guestinfo ignition.config.data?  Or something else?

Comment 13 jima 2020-09-15 00:22:42 UTC
(In reply to Benjamin Gilbert from comment #12)
> On 4.5 with Secure Boot enabled, how did you install the nodes?  Did you use
> the RHCOS installer PXE or ISO image to install to the VM disk, passing the
> Ignition config via coreos.inst.ignition_url?  Did you start from the OVA
> and pass the Ignition config via guestinfo ignition.config.data?  Or
> something else?

On 4.5 with secure boot enabled, VMs are created from ova template and set ignition config via guestinfo ignition.config.data.

Comment 14 JP Jung 2020-09-15 00:34:27 UTC
And I did a 4.5 "bare metal" installation on pre-created VMs, booting from the CoreOS ISO & passing the ignition config. On VMware, when I create a VM of type RHEL 8 it defaults to EFI boot and SecureBoot is enabled; it is the option I chose to create my VMs to do the bare metal install. If I pick CoreOS for a new VM, it defaults to BIOS, no secure boot.

Comment 15 Benjamin Gilbert 2020-09-15 03:20:57 UTC
JP: the "CoreOS" OS type in VMware probably refers to CoreOS Container Linux.

Amending my statement in comment 10: I'd expect that both Ignition and Afterburn will fail when accessing VMware guestinfo variables if Secure Boot is enabled.  Afterburn only gained this functionality in RHCOS 4.6, and Ignition does not access guestinfo when a machine is installed via the bare-metal installer.  So the success report in comment 14 makes sense, but I'm surprised by the report in comment 13.

At least as to Afterburn in the bare-metal install case, I agree that this is a regression.

Comment 16 Luca BRUNO 2020-09-15 08:21:38 UTC
> So the success report in comment 14 makes sense, but I'm surprised by the report in comment 13.

I think I can explain that: the older library in Ignition 0.x did not perform an `iopl` and thus it won't fail this way (but that in turn means it is prone to other non-deterministic failures).
I have noted more details and references about this at https://github.com/coreos/ignition/issues/1092#issuecomment-692549607.

Comment 24 Luca BRUNO 2020-09-18 15:44:17 UTC
Pushed `rust-afterburn-4.5.0-2.rhaos4.6.el8` with a patch to skip the `iopl` call (essentially matching previous Ignition 0.x behavior) as a quickfix for 4.6.

Comment 25 Luca BRUNO 2020-09-21 12:10:07 UTC
The other half of this fix have been pushed to `ignition-2.6.0-4.rhaos4.6.git947598e.el8`.

Both sides landed in RHCOS 4.6 nightlies. I have manually tested that RHCOS `46.82.202009182140-0`:
```
$ head -2 /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="46.82.202009182140-0"

$ grep -o ignition.platform.id='[[:alnum:]]*' /proc/cmdline
ignition.platform.id=vmware

$ mokutil --sb-state
SecureBoot enabled
```

Comment 27 jima 2020-09-22 07:32:37 UTC
I installed ocp 4.6.0-0.nightly-2020-09-21-182309 with template rhcos 46.82.202009182140-0 on vsphere, secureboot error is not reproduced but got new error from master/worker node console when getting ignition file:
x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

Please see attached screenshot.

Comment 28 jima 2020-09-22 07:33:40 UTC
Created attachment 1715647 [details]
new error from master/worker nodes.

Comment 29 Luca BRUNO 2020-09-22 08:26:16 UTC
Thanks for confirming the SecureBoot fixes worked.
The additional error you are seeing depends on TLS certificates set up in your environment.

If you are providing this custom certificate, you need to adjust the Subject Alternative Name as suggested in the error.

If it was the Openshift Installer that auto-generated it, please open a BZ against that component so that the certificate generation logic is tweaked to introduce hostname-matching SAN entries.

Comment 30 Micah Abbott 2020-09-22 13:22:50 UTC
Marking verified with 4.6.0-0.nightly-2020-09-21-182309 based on comment #27

Comment 33 errata-xmlrpc 2020-10-27 16:40:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 34 Matan Carmeli 2020-12-22 15:31:51 UTC
I am using Openshift 4.6.8 and using BIOS to boot my CoreOS vms and I get inappropriate ioctl for device errors, the problem looks like only happened in UEFI mode, there is something else that I can do?
Thanks

Comment 35 Micah Abbott 2021-01-04 16:37:32 UTC
(In reply to Matan Carmeli from comment #34)
> I am using Openshift 4.6.8 and using BIOS to boot my CoreOS vms and I get
> inappropriate ioctl for device errors, the problem looks like only happened
> in UEFI mode, there is something else that I can do?
> Thanks

Please open a new BZ with the problem you are facing.


Note You need to log in before you can comment on or make changes to this bug.