Bug 2053752

Summary:

[IPI] OCP-4.10 baremetal - boot partition is not mounted on temporary directory

Product:

OpenShift Container Platform

Reporter:

tonyg

Component:

Bare Metal Hardware Provisioning

Assignee:

Derek Higgins <derekh>

Bare Metal Hardware Provisioning sub component:

ironic

QA Contact:

Lubov <lshilin>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

asalvati, bfournie, bmuchiny, derekh, eglottma, fbaudin, fsoppels, gvillani, josearod, lshilin, manrodri, mcornea, openshift-bugs-escalate, rugouvei, shreepat, skrenger, snetting, tsedovic, yprokule

Version:

4.10

Keywords:

AutomationBlocker, Regression, Triaged

Target Milestone:

---

Flags:

tsedovic: needinfo-

Target Release:

4.10.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

2061278 (view as bug list)

Environment:

Last Closed:

2022-06-28 11:50:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

2086759

Bug Blocks:

2061278

Attachments:

Description	Flags
OpenShift install log from latest deployment	none

Description tonyg 2022-02-11 22:56:59 UTC

Created attachment 1860700 [details]
OpenShift install log from latest deployment

Description of problem:

While attempting to install OCP 4.10.0 RC 2022-02-04 with baremetal IPI we have found random failures on ironic_python_agent/extensions/image.py when attempting to mount the boot partition (/dev/sda2) into a temporary file, this is causing to not complete the installation on all the master nodes.

Version-Release number of selected component (if applicable):


OpenShift Installer 4.10.0-rc.1
Built from commit 4fc9fa88c22221b6cede2456b1c33847943b75c9


How reproducible:

Frequently, but not all the time


Steps to Reproduce:
1. Installing OCP using dci-openshift-agent (that uses IPI)
2. Sometimes a master node won't complete

Actual results:

Excerpt of the install log:

time="2022-02-11T13:06:19-05:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m1s [id=2ec51de8-ddbf-4c6e-94d6-d9da581ea8d3]"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node a7985248-1cf3-494a-9771-de48e4500a62. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command."
time="2022-02-11T13:06:19-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmp0wftrg8k/boot/efi"
time="2022-02-11T13:06:19-05:00" level=error msg="Exit code: 32"
time="2022-02-11T13:06:19-05:00" level=error msg="Stdout: ''"
time="2022-02-11T13:06:19-05:00" level=error msg="Stderr: 'mount: /tmp/tmp0wftrg8k/boot/efi: special device /dev/sda2 does not exist.\\n'.'"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error msg="  on ../../tmp/openshift-install-masters-1714874582/main.tf line 43, in resource \"ironic_deployment\" \"openshift-master-deployment\":"
time="2022-02-11T13:06:19-05:00" level=error msg="  43: resource \"ironic_deployment\" \"openshift-master-deployment\" {"
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=error
time="2022-02-11T13:06:19-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change

master-0 node:

$ df -PhT /boot
Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/loop1     squashfs  883M  883M     0 100% /boot


$ ls -ltr /tmp
total 4
drwx------ 3 root root   60 Feb 11 18:00 systemd-private-b818b9aaddd64104ac98e75ebcf63b8d-chronyd.service-l0sYIf                                                                                                  
-rw-r--r-- 1 root root 1708 Feb 11 18:05 ironic.ign

$ last
reboot   system boot  4.18.0-305.34.2. Fri Feb 11 18:00   still running

wtmp begins Fri Feb 11 18:00:44 2022


$ mkdir /tmp/test
$ sudo mount /dev/sda2 /tmp/test/
$ ls /tmp/test/
EFI


$ sudo umount /dev/sda2
$ sudo mount /dev/sda2 /non-existing-dir
mount: /non-existing-dir: mount point does not exist.
$ echo $?
32

$ sudo fdisk -l /dev/sda
GPT PMBR size mismatch (7884799 != 936640511) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Disk /dev/sda: 446.6 GiB, 479559942144 bytes, 936640512 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: 00000000-0000-4000-A000-000000000001

Device       Start     End Sectors  Size Type
/dev/sda1     2048    4095    2048    1M BIOS boot
/dev/sda2     4096  264191  260096  127M EFI System
/dev/sda3   264192 1050623  786432  384M Linux filesystem
/dev/sda4  1050624 7884766 6834143  3.3G Linux filesystem


Expected results:

The installation should complete on all the master nodes.

Additional info:

Through dci-openshift-agent we upload logs and details of the installation, these are the last 3 times we've seen this issue.

- https://www.distributed-ci.io/jobs/6ae69cec-ee6f-4de0-a69e-2a86bed35c9c/files
- https://www.distributed-ci.io/jobs/e7305c1e-e0a7-4ee6-afac-9da05e89e297/files


Including openshift_install.log but can provide more logs

Comment 1 Dmitry Tantsur 2022-02-14 10:59:41 UTC

a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.093 1 DEBUG oslo_concurrency.processutils [-] CMD "partx -a /dev/sda" returned: 1 in 0.003s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG oslo_concurrency.processutils [-] 'partx -a /dev/sda' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stdout is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stderr is: "partx: /dev/sda: error adding partitions 1-4
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: " _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100


Okay, we're unable to tell the kernel about new partitions. Unfortunately, we run without the -v flag, so hard to tell why.

Comment 2 Dmitry Tantsur 2022-02-25 16:20:55 UTC

*** Bug 2057668 has been marked as a duplicate of this bug. ***

Comment 3 Dmitry Tantsur 2022-02-25 16:22:02 UTC

Unmarking this as triaged. We don't know the root cause, the errors in comment 1 also happen on successful deployments. We can observe sda2 existing both before and after the failure.

Comment 4 Dmitry Tantsur 2022-02-25 16:23:37 UTC

I think Derek is looking into adding retries.

Comment 5 Derek Higgins 2022-03-01 16:40:36 UTC

We havn't been able to isolate/reproduce this to get to the root of the problem, in the mean time
I've pushed a retry for the failing operation

Comment 6 Marius Cornea 2022-03-03 13:34:39 UTC

I was also seeing this issue on ProLiant DL380 Gen10 machines and it was reproducing pretty consistently with 4.10.0-rc.6

Comment 7 Lubov 2022-03-07 08:59:54 UTC

@mcornea could you, please try to deploy on your setup the last 4.11 nightly, please
On out CI setup the deployment passed last night

Comment 8 Marius Cornea 2022-03-07 09:50:22 UTC

(In reply to Lubov from comment #7)
> @mcornea could you, please try to deploy on your setup the last
> 4.11 nightly, please
> On out CI setup the deployment passed last night

I can confirm the nodes were deployed as well on my environment with 4.11.0-0.nightly-2022-03-06-020555.

Comment 9 shreepat 2022-03-17 20:11:21 UTC

Encountered  this issue today during OCP-4.10 mgmt cluster installation.


time="2022-03-17T11:12:15-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node 920b03ad-9311-492c-899d-dcf56b181d2c. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command."
time="2022-03-17T11:12:15-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmpd66e7u6t/boot/efi"
time="2022-03-17T11:12:15-05:00" level=error msg="Exit code: 32"
time="2022-03-17T11:12:15-05:00" level=error msg="Stdout: ''"
time="2022-03-17T11:12:15-05:00" level=error msg="Stderr: 'mount: /tmp/tmpd66e7u6t/boot/efi: special device /dev/sda2 does not exist.\\n'.' 

2022-03-17 16:11:09.089 1 ERROR ironic.conductor.utils [-] Deploy step deploy.install_coreos failed on node 920b03ad-9311-492c-899d-dcf56b181d2c. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command.

DCI job : 

https://www.distributed-ci.io/jobs/4480dd4c-a4f8-41d3-9717-99a11393af61/jobStates

Comment 15 Lubov 2022-05-24 14:09:18 UTC

verified on 4.11.0-0.nightly-2022-05-20-213928

Comment 16 Derek Higgins 2022-05-25 08:13:20 UTC

*** Bug 2061278 has been marked as a duplicate of this bug. ***

Comment 20 Derek Higgins 2022-05-26 08:26:41 UTC

cherry-pick was missing a import (it wasn't needed in origin/main)
https://bugzilla.redhat.com/show_bug.cgi?id=2090631

Comment 21 Derek Higgins 2022-05-26 08:28:46 UTC

*** Bug 2090631 has been marked as a duplicate of this bug. ***

Comment 41 Lubov 2022-06-14 17:00:46 UTC

verified on 4.10.18. run twice

Comment 54 errata-xmlrpc 2022-06-28 11:50:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.20 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5172