Bug 2011634
Summary: | Boot fails when disk encryption is enabled | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alexander Chuzhoy <sasha> |
Component: | Documentation | Assignee: | Aidan Reilly <aireilly> |
Status: | CLOSED DEFERRED | QA Contact: | Omri Hochman <ohochman> |
Severity: | urgent | Docs Contact: | Tomas 'Sheldon' Radej <tradej> |
Priority: | urgent | ||
Version: | 4.9 | CC: | atraeger, fpercoco, lmurthy, nmagnezi, shashank, trwest, wlewis, ybettan, ykashtan |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-03-09 01:07:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alexander Chuzhoy
2021-10-06 23:11:38 UTC
Seems like you have a wrong boot order - from the UI: ``` This host is pending user action Host failed to reboot within timeout, please boot the host from the the OpenShift installation disk (sda, /dev/disk/by-id/wwn-0x62cea7f04ab5ee00269768ff2e107728). The installation will resume once the host reboot. ``` @ So the host is pending user action, because it never comes back from reboot - getting stuck. More observation. Disabled TPM in AI and Enabled TPM in BIOS => The issue reproduced. Disabled TPM in AI and Disabled TPM in BIOS => The issue DOES NOT reproduce. Successfully installed. Enabled TPM in BIOS and rebooted the server => Booted fine. Enabled TPM in BIOS and Disabled TPM in AI => The issue DOES NOT reproduce. Successfully installed. Unable to proceed, started to hit https://bugzilla.redhat.com/show_bug.cgi?id=2012206 Will resume tests once https://bugzilla.redhat.com/show_bug.cgi?id=2012206 is fixed. Enabled TPM in BIOS and Enabled TPM in AI => The issue reproduces. Just finished running the flow using BM SNO: TPM enabled in BIOS, TPM disabled in assisted --> installation succeed https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/166e1946-b1b3-439a-ad76-9389f50caed4 (need admin access for the link) My guess is that you gave up to early, on SNO the controller runs at a later stage, therefore there is no one to update the host status configuring, join, done and the next update is operator being installed (long time with no update in the UI) Trying now to run the second configuration: TPM enabled in BIOS, TPM enabled in assisted I have managed to reproduce the issue on the BM SNO host when activating disk-encryption in the service. Is there a way to capture the console output and scrol up to find the errror (beside trying to get multiple screenshots?) ? @sasha (In reply to Yonathan Bettan from comment #7) > Just finished running the flow using BM SNO: > > TPM enabled in BIOS, TPM disabled in assisted --> installation succeed > https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/ > 166e1946-b1b3-439a-ad76-9389f50caed4 (need admin access for the link) > > My guess is that you gave up to early, on SNO the controller runs at a later > stage, therefore there is no one to update the host status configuring, > join, done and the next update is operator being installed (long time with > no update in the UI) > > Trying now to run the second configuration: TPM enabled in BIOS, TPM enabled > in assisted In comment #5 I reported: Enabled TPM in BIOS and Disabled TPM in AI => The issue DOES NOT reproduce. Successfully installed. (In reply to Yonathan Bettan from comment #8) > I have managed to reproduce the issue on the BM SNO host when activating > disk-encryption in the service. > > Is there a way to capture the console output and scrol up to find the errror > (beside trying to get multiple screenshots?) ? > @sasha From the idrac console: Maintenance -> Troubleshooting Select a bootcapture to view and click on Save. This will download an mpg video file your machine where you can view it. This may be the root cause, not sure yet how to solve it. ignition[2262]: disks: createLuks: op(5): [failed] Clevis bind: exit status 1: Cmd: "clevis" "luks" "bind" "-f" {}}, \"t\":1\" Stdout: "" stderr: "WARNING: esys: src\tss2-esys/api/Esys_Create.c:375:Esys_Create_Finish() Received TPM Error \nError: essays: src"tss2-esys/api/Esys_Create.c:120: Esys_Create() Esys Finish ErrorCode (0x00000921) \nError: Esys_create(0x921) - tpm:warn(2.0): authorisations for objects subject to DA protection are not allowed at this time because the TPM is in DA lockout mode\nERROR: Unable to run tpm2_create\nCreating TPM2 object for jwk failed!\n" Any chance you can give it a try @sasha Just tried a normal SNO deployment without enabling TPM(adding attachments), its working fine! Here's what I'm seeing on the machine with tpm module (deployed with : [root@master-0 ~]# tpm2_clear -c platform WARNING:esys:src/tss2-esys/api/Esys_Clear.c:286:Esys_Clear_Finish() Received TPM Error ERROR:esys:src/tss2-esys/api/Esys_Clear.c:97:Esys_Clear() Esys Finish ErrorCode (0x000009a2) ERROR: Esys_Clear(0x9A2) - tpm:session(1):authorization failure without DA implications ERROR: Unable to run tpm2_clear [root@master-0 ~]# tpm2_clear WARNING:esys:src/tss2-esys/api/Esys_Clear.c:286:Esys_Clear_Finish() Received TPM Error ERROR:esys:src/tss2-esys/api/Esys_Clear.c:97:Esys_Clear() Esys Finish ErrorCode (0x00000921) ERROR: Esys_Clear(0x921) - tpm:warn(2.0): authorizations for objects subject to DA protection are not allowed at this time because the TPM is in DA lockout mode ERROR: Unable to run tpm2_clear I was able to select the option in BIOS to run "clear" action on tpm. After rebooting the machine , the console was showing that the system configuration was changed and after the OS was booted: [root@master-0 ~]# tpm2_clear (returned the prompt with no error) [root@master-0 ~]# tpm2_clear -c platform WARNING:esys:src/tss2-esys/api/Esys_Clear.c:286:Esys_Clear_Finish() Received TPM Error ERROR:esys:src/tss2-esys/api/Esys_Clear.c:97:Esys_Clear() Esys Finish ErrorCode (0x000009a2) ERROR: Esys_Clear(0x9A2) - tpm:session(1):authorization failure without DA implications ERROR: Unable to run tpm2_clear When we deploy SNO with disk encryption enabled, we observe the following (note the crypt next to /sysroot): [root@master-0 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 104M 0 rom nvme0n1 259:0 0 477G 0 disk ├─nvme0n1p1 259:1 0 1M 0 part ├─nvme0n1p2 259:2 0 127M 0 part ├─nvme0n1p3 259:3 0 384M 0 part /boot └─nvme0n1p4 259:4 0 476.4G 0 part └─root 253:0 0 476.4G 0 crypt /sysroot nvme1n1 259:5 0 477G 0 disk When we deploy without enabling disk encryption, we see the following: lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 104M 0 rom nvme0n1 259:0 0 477G 0 disk ├─nvme0n1p1 259:1 0 1M 0 part ├─nvme0n1p2 259:2 0 127M 0 part ├─nvme0n1p3 259:3 0 384M 0 part /boot └─nvme0n1p4 259:4 0 476.4G 0 part /sysroot nvme1n1 259:5 0 477G 0 disk Based on comment #16, removed the testblocker flag. This needs to be documented. What will be the best place to document such information in your opinion? @atraeger *** Bug 2019582 has been marked as a duplicate of this bug. *** "I was able to select the option in BIOS to run "clear" action on tpm." How this option was called? Can you add a screenshot? @sasha (In reply to Yonathan Bettan from comment #20) > "I was able to select the option in BIOS to run "clear" action on tpm." > > How this option was called? Can you add a screenshot? @sasha This really depends on the machine type/model. On HPE Proliant e910: Go into System Utilities -> System Configuration -> Bios/Platform Configuration -> Server security -> Trusted Platform Module options -> TPM 2.0 Operatons - and select Clear (like in the attached image) My suggestion is: 1. Research TPM a bit more 2. See how we can understand the state of the TPM from the agent 3. If the TPM needs to be cleared: a. If it can be cleared, then the agent should just do it b. If it can't be cleared, a validation should inform the user to update the BIOS and boot the discovery image again Adding a follow up task to handle it and closing the ticket. https://issues.redhat.com/browse/MGMT-7888 The cleanup in general seems like a dangerous procedure: Seems like clearing erases information stored on the TPM. We will lose all created keys and access to data encrypted by these keys. (In reply to Yonathan Bettan from comment #11) > This may be the root cause, not sure yet how to solve it. > > ignition[2262]: disks: createLuks: op(5): [failed] Clevis bind: exit status > 1: Cmd: "clevis" "luks" "bind" "-f" {}}, \"t\":1\" Stdout: "" stderr: > "WARNING: esys: src\tss2-esys/api/Esys_Create.c:375:Esys_Create_Finish() > Received TPM Error \nError: essays: src"tss2-esys/api/Esys_Create.c:120: > Esys_Create() Esys Finish ErrorCode (0x00000921) \nError: Esys_create(0x921) > - tpm:warn(2.0): authorisations for objects subject to DA protection are not > allowed at this time because the TPM is in DA lockout mode\nERROR: Unable to > run tpm2_create\nCreating TPM2 object for jwk failed!\n" see https://superuser.com/questions/1404738/tpm-2-0-hardware-error-da-lockout-mode can you provide the information the suggested there? and or the solution ... The solution was to "clean" previous keys of the TPM for machine that were already using TPM before the installation. @shashank should be able to supply the exact commands. "clean" is a brute force solution - it will delete ALL keys on the TPM including ones that in use by other components (data disk is my main use case with Telco) I'm suggesting to see if simply removing the lock solves the issue. BTW: if so, we might try to see what causing it to lock, if it's some reboot race condition, we might be able to improve and solve it entirely FYI @nmagnezi We have documented this limitation of stuck cluster deployments when there are left-over TPM encryption keys from a previous installation on the host. See the note here: https://access.redhat.com/documentation/en-us/assisted_installer_for_openshift_container_platform/2022/html/assisted_installer_for_openshift_container_platform/assembly_enabling-disk-encryption Is this update sufficient to close this docs bug? There is currently no work around for this issue. OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-8998 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |