Bug 2061278 - [IPI] OCP-4.10 baremetal - boot partition is not mounted on temporary directory
Summary: [IPI] OCP-4.10 baremetal - boot partition is not mounted on temporary directory
Keywords:
Status: CLOSED DUPLICATE of bug 2053752
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.z
Assignee: Derek Higgins
QA Contact: Lubov
URL:
Whiteboard:
Depends On: 2053752
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-07 09:13 UTC by Derek Higgins
Modified: 2022-05-25 14:43 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2053752
Environment:
Last Closed: 2022-05-25 08:13:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ironic-agent-image pull 43 0 None open Bug 2061278: [release-4.10] Retry UEFI boot Configuration 2022-03-07 09:16:49 UTC

Description Derek Higgins 2022-03-07 09:13:57 UTC
+++ This bug was initially created as a clone of Bug #2053752 +++

Description of problem:

While attempting to install OCP 4.10.0 RC 2022-02-04 with baremetal IPI we have found random failures on ironic_python_agent/extensions/image.py when attempting to mount the boot partition (/dev/sda2) into a temporary file, this is causing to not complete the installation on all the master nodes.

Version-Release number of selected component (if applicable):


OpenShift Installer 4.10.0-rc.1
Built from commit 4fc9fa88c22221b6cede2456b1c33847943b75c9


How reproducible:

Frequently, but not all the time


Steps to Reproduce:
1. Installing OCP using dci-openshift-agent (that uses IPI)
2. Sometimes a master node won't complete

Actual results:

Excerpt of the install log:

time="2022-02-11T13:06:19-05:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m1s [id=2ec51de8-ddbf-4c6e-94d6-d9da581ea8d3]"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node a7985248-1cf3-494a-9771-de48e4500a62. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command."
time="2022-02-11T13:06:19-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmp0wftrg8k/boot/efi"
time="2022-02-11T13:06:19-05:00" level=error msg="Exit code: 32"
time="2022-02-11T13:06:19-05:00" level=error msg="Stdout: ''"
time="2022-02-11T13:06:19-05:00" level=error msg="Stderr: 'mount: /tmp/tmp0wftrg8k/boot/efi: special device /dev/sda2 does not exist.\\n'.'"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error msg="  on ../../tmp/openshift-install-masters-1714874582/main.tf line 43, in resource \"ironic_deployment\" \"openshift-master-deployment\":"
time="2022-02-11T13:06:19-05:00" level=error msg="  43: resource \"ironic_deployment\" \"openshift-master-deployment\" {"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change

master-0 node:

$ df -PhT /boot
Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/loop1     squashfs  883M  883M     0 100% /boot


$ ls -ltr /tmp
total 4
drwx------ 3 root root   60 Feb 11 18:00 systemd-private-b818b9aaddd64104ac98e75ebcf63b8d-chronyd.service-l0sYIf                                                                                                  
-rw-r--r-- 1 root root 1708 Feb 11 18:05 ironic.ign

$ last
reboot   system boot  4.18.0-305.34.2. Fri Feb 11 18:00   still running

wtmp begins Fri Feb 11 18:00:44 2022


$ mkdir /tmp/test
$ sudo mount /dev/sda2 /tmp/test/
$ ls /tmp/test/
EFI


$ sudo umount /dev/sda2
$ sudo mount /dev/sda2 /non-existing-dir
mount: /non-existing-dir: mount point does not exist.
$ echo $?
32

$ sudo fdisk -l /dev/sda
GPT PMBR size mismatch (7884799 != 936640511) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Disk /dev/sda: 446.6 GiB, 479559942144 bytes, 936640512 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: 00000000-0000-4000-A000-000000000001

Device       Start     End Sectors  Size Type
/dev/sda1     2048    4095    2048    1M BIOS boot
/dev/sda2     4096  264191  260096  127M EFI System
/dev/sda3   264192 1050623  786432  384M Linux filesystem
/dev/sda4  1050624 7884766 6834143  3.3G Linux filesystem


Expected results:

The installation should complete on all the master nodes.

Additional info:

Through dci-openshift-agent we upload logs and details of the installation, these are the last 3 times we've seen this issue.

- https://www.distributed-ci.io/jobs/6ae69cec-ee6f-4de0-a69e-2a86bed35c9c/files
- https://www.distributed-ci.io/jobs/e7305c1e-e0a7-4ee6-afac-9da05e89e297/files


Including openshift_install.log but can provide more logs

--- Additional comment from Dmitry Tantsur on 2022-02-14 10:59:41 GMT ---

a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.093 1 DEBUG oslo_concurrency.processutils [-] CMD "partx -a /dev/sda" returned: 1 in 0.003s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG oslo_concurrency.processutils [-] 'partx -a /dev/sda' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stdout is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stderr is: "partx: /dev/sda: error adding partitions 1-4
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: " _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100


Okay, we're unable to tell the kernel about new partitions. Unfortunately, we run without the -v flag, so hard to tell why.

--- Additional comment from Dmitry Tantsur on 2022-02-25 16:20:55 GMT ---



--- Additional comment from Dmitry Tantsur on 2022-02-25 16:22:02 GMT ---

Unmarking this as triaged. We don't know the root cause, the errors in comment 1 also happen on successful deployments. We can observe sda2 existing both before and after the failure.

--- Additional comment from Dmitry Tantsur on 2022-02-25 16:23:37 GMT ---

I think Derek is looking into adding retries.

--- Additional comment from Derek Higgins on 2022-03-01 16:40:36 GMT ---

We havn't been able to isolate/reproduce this to get to the root of the problem, in the mean time
I've pushed a retry for the failing operation

--- Additional comment from Marius Cornea on 2022-03-03 13:34:39 GMT ---

I was also seeing this issue on ProLiant DL380 Gen10 machines and it was reproducing pretty consistently with 4.10.0-rc.6

--- Additional comment from Lubov on 2022-03-07 08:59:54 GMT ---

@mcornea could you, please try to deploy on your setup the last 4.11 nightly, please
On out CI setup the deployment passed last night

Comment 10 Lubov 2022-05-01 13:22:19 UTC
verification failed on 4.10.0-0.nightly-2022-04-30-165345

msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node 0e28efd2-6885-4c01-af13-aa3bee3d6031. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command."
time="2022-05-01T14:44:22+03:00" level=error msg="Command: mount /dev/sda2 /tmp/tmph8ib8cp0/boot/efi"

here is must-gather https://s3.upshift.redhat.com/DH-PROD-OCP-EDGE-QE-CI/Infra/must-gather/1042/index.html

bootstrap log bundle: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ2061278_log-bundle-bootstrap.tar.gz

Comment 11 Derek Higgins 2022-05-03 09:13:36 UTC
Looking at the mustgather I can see that the patch was used and failed to fix the problem
(although It did seem to work on 4.11, maybe due to different hardware),
so we need to do more.

I had another PR pushed upstream here
https://github.com/metal3-io/ironic-agent-image/pull/19
This adds more retries with some sleep time between them

So we need to merge this, sync it downstream, test on the same hardware and assuming it fixes the problem then backport it to 4.10

Comment 12 Lubov 2022-05-04 08:10:07 UTC
(In reply to Derek Higgins from comment #11)
> Looking at the mustgather I can see that the patch was used and failed to
> fix the problem
> (although It did seem to work on 4.11, maybe due to different hardware),
> so we need to do more.
> 
> I had another PR pushed upstream here
> https://github.com/metal3-io/ironic-agent-image/pull/19
> This adds more retries with some sleep time between them
> 
> So we need to merge this, sync it downstream, test on the same hardware and
> assuming it fixes the problem then backport it to 4.10

we had only one setup, where this problem happens. I tested fix for 4.11 on the same setup. Hope, that this failure is not happens from time to time now after the first fix

Comment 19 Derek Higgins 2022-05-25 08:13:20 UTC
Both bz 2053752 and 2061278 are refering to the same bug in 4.10
closing this one

*** This bug has been marked as a duplicate of bug 2053752 ***


Note You need to log in before you can comment on or make changes to this bug.