Bug 2011634

Summary: Boot fails when disk encryption is enabled
Product: OpenShift Container Platform Reporter: Alexander Chuzhoy <sasha>
Component: DocumentationAssignee: Aidan Reilly <aireilly>
Status: CLOSED DEFERRED QA Contact: Omri Hochman <ohochman>
Severity: urgent Docs Contact: Tomas 'Sheldon' Radej <tradej>
Priority: urgent    
Version: 4.9CC: atraeger, fpercoco, lmurthy, nmagnezi, shashank, trwest, wlewis, ybettan, ykashtan
Target Milestone: ---Keywords: Reopened
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:07:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2021-10-06 23:11:38 UTC
Version:
Assisted-ui-lib version:  1.5.37

Hardware:
Model	PowerEdge R740
BIOS Version	2.5.4
iDRAC Firmware Version	5.00.00.00


Enabled disk encryption for the cluster and generated ISO.
Booted the server with this ISO and proceeded with deploying SNO.

The deployment got stuck "Installing 7/10" - look at the attached image.


Checking the console of the server, the boot doesn't complete and fails into emergency target and constantly reboots  - look at the attached image.




curl -s  -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json'  https://api.stage.openshift.com/api/assisted-install/v1/clusters/91441aee-3139-49a9-9ccd-163d050fa798 |jq .disk_encryption
{
  "enable_on": "all",
  "mode": "tpmv2"
}

Comment 4 Yoni Bettan 2021-10-07 14:44:57 UTC
Seems like you have a wrong boot order - from the UI:

```
This host is pending user action
Host failed to reboot within timeout, please boot the host from the the OpenShift installation disk (sda, /dev/disk/by-id/wwn-0x62cea7f04ab5ee00269768ff2e107728). The installation will resume once the host reboot.
```
@

Comment 5 Alexander Chuzhoy 2021-10-08 16:58:58 UTC
So the host is pending user action, because it never comes back from reboot - getting stuck.


More observation.



Disabled TPM in AI and Enabled TPM in BIOS =>  The issue reproduced.

Disabled TPM in AI and Disabled TPM in BIOS => The issue DOES NOT reproduce. Successfully installed.



Enabled TPM in BIOS and rebooted the server => Booted fine.

Enabled TPM in BIOS and Disabled TPM in AI => The issue DOES NOT reproduce. Successfully installed.


Unable to proceed, started to hit https://bugzilla.redhat.com/show_bug.cgi?id=2012206

Will resume tests once https://bugzilla.redhat.com/show_bug.cgi?id=2012206 is fixed.

Comment 6 Alexander Chuzhoy 2021-10-12 14:40:29 UTC
Enabled TPM in BIOS and Enabled TPM in AI => The issue reproduces.

Comment 7 Yoni Bettan 2021-10-19 10:40:49 UTC
Just finished running the flow using BM SNO:

TPM enabled in BIOS, TPM disabled in assisted --> installation succeed https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/166e1946-b1b3-439a-ad76-9389f50caed4 (need admin access for the link)

My guess is that you gave up to early, on SNO the controller runs at a later stage, therefore there is no one to update the host status configuring, join, done and the next update is operator being installed (long time with no update in the UI)

Trying now to run the second configuration: TPM enabled in BIOS, TPM enabled in assisted

Comment 8 Yoni Bettan 2021-10-19 12:16:01 UTC
I have managed to reproduce the issue on the BM SNO host when activating disk-encryption in the service.

Is there a way to capture the console output and scrol up to find the errror (beside trying to get multiple screenshots?) ?
@sasha

Comment 9 Alexander Chuzhoy 2021-10-19 14:01:02 UTC
(In reply to Yonathan Bettan from comment #7)
> Just finished running the flow using BM SNO:
> 
> TPM enabled in BIOS, TPM disabled in assisted --> installation succeed
> https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/
> 166e1946-b1b3-439a-ad76-9389f50caed4 (need admin access for the link)
> 
> My guess is that you gave up to early, on SNO the controller runs at a later
> stage, therefore there is no one to update the host status configuring,
> join, done and the next update is operator being installed (long time with
> no update in the UI)
> 
> Trying now to run the second configuration: TPM enabled in BIOS, TPM enabled
> in assisted

In comment #5 I reported:
Enabled TPM in BIOS and Disabled TPM in AI => The issue DOES NOT reproduce. Successfully installed.

Comment 10 Alexander Chuzhoy 2021-10-19 18:50:56 UTC
(In reply to Yonathan Bettan from comment #8)
> I have managed to reproduce the issue on the BM SNO host when activating
> disk-encryption in the service.
> 
> Is there a way to capture the console output and scrol up to find the errror
> (beside trying to get multiple screenshots?) ?
> @sasha

From the idrac console:
   Maintenance -> Troubleshooting
   Select a bootcapture to view and click on Save.

This will download an mpg video file your machine where you can view it.

Comment 11 Yoni Bettan 2021-10-20 08:09:37 UTC
This may be the root cause, not sure yet how to solve it.

ignition[2262]: disks: createLuks: op(5): [failed] Clevis bind: exit status 1: Cmd: "clevis" "luks" "bind" "-f" {}}, \"t\":1\" Stdout: "" stderr: "WARNING: esys: src\tss2-esys/api/Esys_Create.c:375:Esys_Create_Finish() Received TPM Error \nError: essays: src"tss2-esys/api/Esys_Create.c:120: Esys_Create() Esys Finish ErrorCode (0x00000921) \nError: Esys_create(0x921) - tpm:warn(2.0): authorisations for objects subject to DA protection are not allowed at this time because the TPM is in DA lockout mode\nERROR: Unable to run tpm2_create\nCreating TPM2 object for jwk failed!\n"

Comment 13 Yoni Bettan 2021-11-03 10:40:58 UTC
Any chance you can give it a try @sasha

Comment 14 DirectedSoul 2021-11-04 16:11:24 UTC
Just tried a normal SNO deployment without enabling TPM(adding attachments), its working fine!

Comment 16 Alexander Chuzhoy 2021-11-04 22:00:35 UTC
Here's what I'm seeing on the machine with tpm module (deployed with :

[root@master-0 ~]# tpm2_clear -c platform
WARNING:esys:src/tss2-esys/api/Esys_Clear.c:286:Esys_Clear_Finish() Received TPM Error 
ERROR:esys:src/tss2-esys/api/Esys_Clear.c:97:Esys_Clear() Esys Finish ErrorCode (0x000009a2) 
ERROR: Esys_Clear(0x9A2) - tpm:session(1):authorization failure without DA implications
ERROR: Unable to run tpm2_clear


[root@master-0 ~]# tpm2_clear
WARNING:esys:src/tss2-esys/api/Esys_Clear.c:286:Esys_Clear_Finish() Received TPM Error 
ERROR:esys:src/tss2-esys/api/Esys_Clear.c:97:Esys_Clear() Esys Finish ErrorCode (0x00000921) 
ERROR: Esys_Clear(0x921) - tpm:warn(2.0): authorizations for objects subject to DA protection are not allowed at this time because the TPM is in DA lockout mode
ERROR: Unable to run tpm2_clear


I was able to select the option in BIOS to run "clear" action on tpm.
After rebooting the machine , the console was showing that the system configuration was changed and after the OS was booted:

[root@master-0 ~]# tpm2_clear  (returned the prompt with no error)

[root@master-0 ~]# tpm2_clear -c platform
WARNING:esys:src/tss2-esys/api/Esys_Clear.c:286:Esys_Clear_Finish() Received TPM Error 
ERROR:esys:src/tss2-esys/api/Esys_Clear.c:97:Esys_Clear() Esys Finish ErrorCode (0x000009a2) 
ERROR: Esys_Clear(0x9A2) - tpm:session(1):authorization failure without DA implications
ERROR: Unable to run tpm2_clear



When we deploy SNO with disk encryption enabled, we observe the following (note the crypt next to /sysroot):
[root@master-0 ~]# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sr0          11:0    1   104M  0 rom   
nvme0n1     259:0    0   477G  0 disk  
├─nvme0n1p1 259:1    0     1M  0 part  
├─nvme0n1p2 259:2    0   127M  0 part  
├─nvme0n1p3 259:3    0   384M  0 part  /boot
└─nvme0n1p4 259:4    0 476.4G  0 part  
  └─root    253:0    0 476.4G  0 crypt /sysroot
nvme1n1     259:5    0   477G  0 disk 


When we deploy without enabling disk encryption, we see the following:

lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sr0          11:0    1   104M  0 rom  
nvme0n1     259:0    0   477G  0 disk 
├─nvme0n1p1 259:1    0     1M  0 part 
├─nvme0n1p2 259:2    0   127M  0 part 
├─nvme0n1p3 259:3    0   384M  0 part /boot
└─nvme0n1p4 259:4    0 476.4G  0 part /sysroot
nvme1n1     259:5    0   477G  0 disk

Comment 17 Alexander Chuzhoy 2021-11-04 22:02:02 UTC
Based on comment #16, removed the testblocker flag.

This needs to be documented.

Comment 18 Yoni Bettan 2021-11-06 20:34:27 UTC
What will be the best place to document such information in your opinion?
@atraeger

Comment 19 Yoni Bettan 2021-11-07 08:50:44 UTC
*** Bug 2019582 has been marked as a duplicate of this bug. ***

Comment 20 Yoni Bettan 2021-11-08 10:09:26 UTC
"I was able to select the option in BIOS to run "clear" action on tpm."

How this option was called? Can you add a screenshot? @sasha

Comment 21 Alexander Chuzhoy 2021-11-08 16:44:40 UTC
(In reply to Yonathan Bettan from comment #20)
> "I was able to select the option in BIOS to run "clear" action on tpm."
> 
> How this option was called? Can you add a screenshot? @sasha

This really depends on the machine type/model.

On HPE Proliant e910:
Go into System Utilities ->
           System Configuration ->
                 Bios/Platform Configuration  ->
                      Server security   ->
                              Trusted Platform Module options  ->
                                             TPM 2.0 Operatons  - and select Clear (like in the attached image)

Comment 23 Avishay Traeger 2021-11-09 05:09:22 UTC
My suggestion is:
1. Research TPM a bit more
2. See how we can understand the state of the TPM from the agent
3. If the TPM needs to be cleared:
   a. If it can be cleared, then the agent should just do it
   b. If it can't be cleared, a validation should inform the user to update the BIOS and boot the discovery image again

Comment 24 Yoni Bettan 2021-11-09 08:48:36 UTC
Adding a follow up task to handle it and closing the ticket.
https://issues.redhat.com/browse/MGMT-7888

Comment 26 Alexander Chuzhoy 2021-11-10 18:05:52 UTC
The cleanup in general seems like a dangerous procedure:

Seems like clearing erases information stored on the TPM. We will lose all created keys and access to data encrypted by these keys.

Comment 28 Yuval Kashtan 2021-12-08 09:08:49 UTC
(In reply to Yonathan Bettan from comment #11)
> This may be the root cause, not sure yet how to solve it.
> 
> ignition[2262]: disks: createLuks: op(5): [failed] Clevis bind: exit status
> 1: Cmd: "clevis" "luks" "bind" "-f" {}}, \"t\":1\" Stdout: "" stderr:
> "WARNING: esys: src\tss2-esys/api/Esys_Create.c:375:Esys_Create_Finish()
> Received TPM Error \nError: essays: src"tss2-esys/api/Esys_Create.c:120:
> Esys_Create() Esys Finish ErrorCode (0x00000921) \nError: Esys_create(0x921)
> - tpm:warn(2.0): authorisations for objects subject to DA protection are not
> allowed at this time because the TPM is in DA lockout mode\nERROR: Unable to
> run tpm2_create\nCreating TPM2 object for jwk failed!\n"

see https://superuser.com/questions/1404738/tpm-2-0-hardware-error-da-lockout-mode
can you provide the information the suggested there?
and or the solution ...

Comment 29 Yoni Bettan 2021-12-09 08:13:46 UTC
The solution was to "clean" previous keys of the TPM for machine that were already using TPM before the installation.

@shashank should be able to supply the exact commands.

Comment 30 Yuval Kashtan 2021-12-09 08:41:26 UTC
"clean" is a brute force solution - it will delete ALL keys on the TPM including ones that in use by other components (data disk is my main use case with Telco)

I'm suggesting to see if simply removing the lock solves the issue.
BTW: if so, we might try to see what causing it to lock, if it's some reboot race condition, we might be able to improve and solve it entirely

Comment 31 Yoni Bettan 2021-12-13 07:29:41 UTC
FYI @nmagnezi

Comment 40 Aidan Reilly 2023-01-23 13:29:16 UTC
We have documented this limitation of stuck cluster deployments when there are left-over TPM encryption keys from a previous installation on the host. See the note here: https://access.redhat.com/documentation/en-us/assisted_installer_for_openshift_container_platform/2022/html/assisted_installer_for_openshift_container_platform/assembly_enabling-disk-encryption

Is this update sufficient to close this docs bug? There is currently no work around for this issue.

Comment 41 Shiftzilla 2023-03-09 01:07:53 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-8998

Comment 42 Red Hat Bugzilla 2023-09-18 04:26:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days