2053752 – [IPI] OCP-4.10 baremetal - boot partition is not mounted on temporary directory

Bug 2053752 - [IPI] OCP-4.10 baremetal - boot partition is not mounted on temporary directory

Summary: [IPI] OCP-4.10 baremetal - boot partition is not mounted on temporary directory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Derek Higgins
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	2057668 2061278 2090631 (view as bug list)
Depends On:	2086759
Blocks:	2061278
TreeView+	depends on / blocked

Reported:	2022-02-11 22:56 UTC by tonyg
Modified:	2022-07-13 10:10 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2061278 (view as bug list)
Environment:
Last Closed:	2022-06-28 11:50:26 UTC
Target Upstream Version:
Embargoed:
Flags:	tsedovic: needinfo-

Attachments	(Terms of Use)
OpenShift install log from latest deployment (173.32 KB, text/plain) 2022-02-11 22:56 UTC, tonyg	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ironic-agent-image pull 42	None	Merged	Bug 2053752: Retry UEFI boot Configuration	2022-06-10 09:34:21 UTC
Github	openshift ironic-agent-image pull 48	None	Merged	Bug 2053752: Bump retries for UEFI boot configuration	2022-06-10 09:34:23 UTC
Github	openshift ironic-agent-image pull 49	None	Merged	Bug 2053752: import time	2022-06-10 09:34:24 UTC
Red Hat Product Errata	RHBA-2022:5172	None	None	None	2022-06-28 11:50:54 UTC

Description tonyg 2022-02-11 22:56:59 UTC

Created attachment 1860700 [details]
OpenShift install log from latest deployment

Description of problem:

While attempting to install OCP 4.10.0 RC 2022-02-04 with baremetal IPI we have found random failures on ironic_python_agent/extensions/image.py when attempting to mount the boot partition (/dev/sda2) into a temporary file, this is causing to not complete the installation on all the master nodes.

Version-Release number of selected component (if applicable):


OpenShift Installer 4.10.0-rc.1
Built from commit 4fc9fa88c22221b6cede2456b1c33847943b75c9


How reproducible:

Frequently, but not all the time


Steps to Reproduce:
1. Installing OCP using dci-openshift-agent (that uses IPI)
2. Sometimes a master node won't complete

Actual results:

Excerpt of the install log:

time="2022-02-11T13:06:19-05:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m1s [id=2ec51de8-ddbf-4c6e-94d6-d9da581ea8d3]"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node a7985248-1cf3-494a-9771-de48e4500a62. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command."
time="2022-02-11T13:06:19-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmp0wftrg8k/boot/efi"
time="2022-02-11T13:06:19-05:00" level=error msg="Exit code: 32"
time="2022-02-11T13:06:19-05:00" level=error msg="Stdout: ''"
time="2022-02-11T13:06:19-05:00" level=error msg="Stderr: 'mount: /tmp/tmp0wftrg8k/boot/efi: special device /dev/sda2 does not exist.\\n'.'"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error msg="  on ../../tmp/openshift-install-masters-1714874582/main.tf line 43, in resource \"ironic_deployment\" \"openshift-master-deployment\":"
time="2022-02-11T13:06:19-05:00" level=error msg="  43: resource \"ironic_deployment\" \"openshift-master-deployment\" {"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change

master-0 node:

$ df -PhT /boot
Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/loop1     squashfs  883M  883M     0 100% /boot


$ ls -ltr /tmp
total 4
drwx------ 3 root root   60 Feb 11 18:00 systemd-private-b818b9aaddd64104ac98e75ebcf63b8d-chronyd.service-l0sYIf                                                                                                  
-rw-r--r-- 1 root root 1708 Feb 11 18:05 ironic.ign

$ last
reboot   system boot  4.18.0-305.34.2. Fri Feb 11 18:00   still running

wtmp begins Fri Feb 11 18:00:44 2022


$ mkdir /tmp/test
$ sudo mount /dev/sda2 /tmp/test/
$ ls /tmp/test/
EFI


$ sudo umount /dev/sda2
$ sudo mount /dev/sda2 /non-existing-dir
mount: /non-existing-dir: mount point does not exist.
$ echo $?
32

$ sudo fdisk -l /dev/sda
GPT PMBR size mismatch (7884799 != 936640511) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Disk /dev/sda: 446.6 GiB, 479559942144 bytes, 936640512 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: 00000000-0000-4000-A000-000000000001

Device       Start     End Sectors  Size Type
/dev/sda1     2048    4095    2048    1M BIOS boot
/dev/sda2     4096  264191  260096  127M EFI System
/dev/sda3   264192 1050623  786432  384M Linux filesystem
/dev/sda4  1050624 7884766 6834143  3.3G Linux filesystem


Expected results:

The installation should complete on all the master nodes.

Additional info:

Through dci-openshift-agent we upload logs and details of the installation, these are the last 3 times we've seen this issue.

- https://www.distributed-ci.io/jobs/6ae69cec-ee6f-4de0-a69e-2a86bed35c9c/files
- https://www.distributed-ci.io/jobs/e7305c1e-e0a7-4ee6-afac-9da05e89e297/files


Including openshift_install.log but can provide more logs

Comment 1 Dmitry Tantsur 2022-02-14 10:59:41 UTC

a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.093 1 DEBUG oslo_concurrency.processutils [-] CMD "partx -a /dev/sda" returned: 1 in 0.003s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG oslo_concurrency.processutils [-] 'partx -a /dev/sda' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stdout is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stderr is: "partx: /dev/sda: error adding partitions 1-4
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: " _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100


Okay, we're unable to tell the kernel about new partitions. Unfortunately, we run without the -v flag, so hard to tell why.

Comment 2 Dmitry Tantsur 2022-02-25 16:20:55 UTC

*** Bug 2057668 has been marked as a duplicate of this bug. ***

Comment 3 Dmitry Tantsur 2022-02-25 16:22:02 UTC

Unmarking this as triaged. We don't know the root cause, the errors in comment 1 also happen on successful deployments. We can observe sda2 existing both before and after the failure.

Comment 4 Dmitry Tantsur 2022-02-25 16:23:37 UTC

I think Derek is looking into adding retries.

Comment 5 Derek Higgins 2022-03-01 16:40:36 UTC

We havn't been able to isolate/reproduce this to get to the root of the problem, in the mean time
I've pushed a retry for the failing operation

Comment 6 Marius Cornea 2022-03-03 13:34:39 UTC

I was also seeing this issue on ProLiant DL380 Gen10 machines and it was reproducing pretty consistently with 4.10.0-rc.6

Comment 7 Lubov 2022-03-07 08:59:54 UTC

@mcornea could you, please try to deploy on your setup the last 4.11 nightly, please
On out CI setup the deployment passed last night

Comment 8 Marius Cornea 2022-03-07 09:50:22 UTC

(In reply to Lubov from comment #7)
> @mcornea could you, please try to deploy on your setup the last
> 4.11 nightly, please
> On out CI setup the deployment passed last night

I can confirm the nodes were deployed as well on my environment with 4.11.0-0.nightly-2022-03-06-020555.

Comment 9 shreepat 2022-03-17 20:11:21 UTC

Encountered  this issue today during OCP-4.10 mgmt cluster installation.


time="2022-03-17T11:12:15-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node 920b03ad-9311-492c-899d-dcf56b181d2c. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command."
time="2022-03-17T11:12:15-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmpd66e7u6t/boot/efi"
time="2022-03-17T11:12:15-05:00" level=error msg="Exit code: 32"
time="2022-03-17T11:12:15-05:00" level=error msg="Stdout: ''"
time="2022-03-17T11:12:15-05:00" level=error msg="Stderr: 'mount: /tmp/tmpd66e7u6t/boot/efi: special device /dev/sda2 does not exist.\\n'.' 

2022-03-17 16:11:09.089 1 ERROR ironic.conductor.utils [-] Deploy step deploy.install_coreos failed on node 920b03ad-9311-492c-899d-dcf56b181d2c. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command.

DCI job : 

https://www.distributed-ci.io/jobs/4480dd4c-a4f8-41d3-9717-99a11393af61/jobStates

Comment 15 Lubov 2022-05-24 14:09:18 UTC

verified on 4.11.0-0.nightly-2022-05-20-213928

Comment 16 Derek Higgins 2022-05-25 08:13:20 UTC

*** Bug 2061278 has been marked as a duplicate of this bug. ***

Comment 20 Derek Higgins 2022-05-26 08:26:41 UTC

cherry-pick was missing a import (it wasn't needed in origin/main)
https://bugzilla.redhat.com/show_bug.cgi?id=2090631

Comment 21 Derek Higgins 2022-05-26 08:28:46 UTC

*** Bug 2090631 has been marked as a duplicate of this bug. ***

Comment 41 Lubov 2022-06-14 17:00:46 UTC

verified on 4.10.18. run twice

Comment 54 errata-xmlrpc 2022-06-28 11:50:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.20 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5172

Note You need to log in before you can comment on or make changes to this bug.