Bug 2086759

Summary: [IPI] OCP-4.11 baremetal - boot partition is not mounted on temporary directory
Product: OpenShift Container Platform Reporter: Derek Higgins <derekh>
Component: Bare Metal Hardware ProvisioningAssignee: Derek Higgins <derekh>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: lshilin, rpittau
Version: 4.11Keywords: Triaged
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:12:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2053752    

Description Derek Higgins 2022-05-16 14:39:59 UTC
This bug was initially created as a copy of Bug #2061278

I am copying this bug because: 
I need a first fix the problem in 4.11


+++ This bug was initially created as a clone of Bug #2053752 +++

Description of problem:

While attempting to install OCP 4.10.0 RC 2022-02-04 with baremetal IPI we have found random failures on ironic_python_agent/extensions/image.py when attempting to mount the boot partition (/dev/sda2) into a temporary file, this is causing to not complete the installation on all the master nodes.

Version-Release number of selected component (if applicable):


OpenShift Installer 4.10.0-rc.1
Built from commit 4fc9fa88c22221b6cede2456b1c33847943b75c9


How reproducible:

Frequently, but not all the time


Steps to Reproduce:
1. Installing OCP using dci-openshift-agent (that uses IPI)
2. Sometimes a master node won't complete

Actual results:

Excerpt of the install log:

time="2022-02-11T13:06:19-05:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m1s [id=2ec51de8-ddbf-4c6e-94d6-d9da581ea8d3]"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node a7985248-1cf3-494a-9771-de48e4500a62. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command."
time="2022-02-11T13:06:19-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmp0wftrg8k/boot/efi"
time="2022-02-11T13:06:19-05:00" level=error msg="Exit code: 32"
time="2022-02-11T13:06:19-05:00" level=error msg="Stdout: ''"
time="2022-02-11T13:06:19-05:00" level=error msg="Stderr: 'mount: /tmp/tmp0wftrg8k/boot/efi: special device /dev/sda2 does not exist.\\n'.'"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error msg="  on ../../tmp/openshift-install-masters-1714874582/main.tf line 43, in resource \"ironic_deployment\" \"openshift-master-deployment\":"
time="2022-02-11T13:06:19-05:00" level=error msg="  43: resource \"ironic_deployment\" \"openshift-master-deployment\" {"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change

master-0 node:

$ df -PhT /boot
Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/loop1     squashfs  883M  883M     0 100% /boot


$ ls -ltr /tmp
total 4
drwx------ 3 root root   60 Feb 11 18:00 systemd-private-b818b9aaddd64104ac98e75ebcf63b8d-chronyd.service-l0sYIf                                                                                                  
-rw-r--r-- 1 root root 1708 Feb 11 18:05 ironic.ign

$ last
reboot   system boot  4.18.0-305.34.2. Fri Feb 11 18:00   still running

wtmp begins Fri Feb 11 18:00:44 2022


$ mkdir /tmp/test
$ sudo mount /dev/sda2 /tmp/test/
$ ls /tmp/test/
EFI


$ sudo umount /dev/sda2
$ sudo mount /dev/sda2 /non-existing-dir
mount: /non-existing-dir: mount point does not exist.
$ echo $?
32

$ sudo fdisk -l /dev/sda
GPT PMBR size mismatch (7884799 != 936640511) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Disk /dev/sda: 446.6 GiB, 479559942144 bytes, 936640512 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: 00000000-0000-4000-A000-000000000001

Device       Start     End Sectors  Size Type
/dev/sda1     2048    4095    2048    1M BIOS boot
/dev/sda2     4096  264191  260096  127M EFI System
/dev/sda3   264192 1050623  786432  384M Linux filesystem
/dev/sda4  1050624 7884766 6834143  3.3G Linux filesystem


Expected results:

The installation should complete on all the master nodes.

Additional info:

Through dci-openshift-agent we upload logs and details of the installation, these are the last 3 times we've seen this issue.

- https://www.distributed-ci.io/jobs/6ae69cec-ee6f-4de0-a69e-2a86bed35c9c/files
- https://www.distributed-ci.io/jobs/e7305c1e-e0a7-4ee6-afac-9da05e89e297/files


Including openshift_install.log but can provide more logs

--- Additional comment from Dmitry Tantsur on 2022-02-14 10:59:41 GMT ---

a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.093 1 DEBUG oslo_concurrency.processutils [-] CMD "partx -a /dev/sda" returned: 1 in 0.003s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG oslo_concurrency.processutils [-] 'partx -a /dev/sda' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stdout is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stderr is: "partx: /dev/sda: error adding partitions 1-4
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: " _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100


Okay, we're unable to tell the kernel about new partitions. Unfortunately, we run without the -v flag, so hard to tell why.

--- Additional comment from Dmitry Tantsur on 2022-02-25 16:20:55 GMT ---



--- Additional comment from Dmitry Tantsur on 2022-02-25 16:22:02 GMT ---

Unmarking this as triaged. We don't know the root cause, the errors in comment 1 also happen on successful deployments. We can observe sda2 existing both before and after the failure.

--- Additional comment from Dmitry Tantsur on 2022-02-25 16:23:37 GMT ---

I think Derek is looking into adding retries.

--- Additional comment from Derek Higgins on 2022-03-01 16:40:36 GMT ---

We havn't been able to isolate/reproduce this to get to the root of the problem, in the mean time
I've pushed a retry for the failing operation

--- Additional comment from Marius Cornea on 2022-03-03 13:34:39 GMT ---

I was also seeing this issue on ProLiant DL380 Gen10 machines and it was reproducing pretty consistently with 4.10.0-rc.6

--- Additional comment from Lubov on 2022-03-07 08:59:54 GMT ---

@mcornea could you, please try to deploy on your setup the last 4.11 nightly, please
On out CI setup the deployment passed last night

Comment 1 Derek Higgins 2022-05-16 14:41:16 UTC
Testing in the CI environment where this bug reproduces proves that the current number of reties isn't adequate 

bumping the number

Comment 3 Lubov 2022-05-24 18:50:57 UTC
verified on 4.11.0-0.nightly-2022-05-20-213928

Comment 5 errata-xmlrpc 2022-08-10 11:12:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069