Created attachment 1860700 [details] OpenShift install log from latest deployment Description of problem: While attempting to install OCP 4.10.0 RC 2022-02-04 with baremetal IPI we have found random failures on ironic_python_agent/extensions/image.py when attempting to mount the boot partition (/dev/sda2) into a temporary file, this is causing to not complete the installation on all the master nodes. Version-Release number of selected component (if applicable): OpenShift Installer 4.10.0-rc.1 Built from commit 4fc9fa88c22221b6cede2456b1c33847943b75c9 How reproducible: Frequently, but not all the time Steps to Reproduce: 1. Installing OCP using dci-openshift-agent (that uses IPI) 2. Sometimes a master node won't complete Actual results: Excerpt of the install log: time="2022-02-11T13:06:19-05:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m1s [id=2ec51de8-ddbf-4c6e-94d6-d9da581ea8d3]" time="2022-02-11T13:06:19-05:00" level=error time="2022-02-11T13:06:19-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node a7985248-1cf3-494a-9771-de48e4500a62. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command." time="2022-02-11T13:06:19-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmp0wftrg8k/boot/efi" time="2022-02-11T13:06:19-05:00" level=error msg="Exit code: 32" time="2022-02-11T13:06:19-05:00" level=error msg="Stdout: ''" time="2022-02-11T13:06:19-05:00" level=error msg="Stderr: 'mount: /tmp/tmp0wftrg8k/boot/efi: special device /dev/sda2 does not exist.\\n'.'" time="2022-02-11T13:06:19-05:00" level=error time="2022-02-11T13:06:19-05:00" level=error msg=" on ../../tmp/openshift-install-masters-1714874582/main.tf line 43, in resource \"ironic_deployment\" \"openshift-master-deployment\":" time="2022-02-11T13:06:19-05:00" level=error msg=" 43: resource \"ironic_deployment\" \"openshift-master-deployment\" {" time="2022-02-11T13:06:19-05:00" level=error time="2022-02-11T13:06:19-05:00" level=error time="2022-02-11T13:06:19-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change master-0 node: $ df -PhT /boot Filesystem Type Size Used Avail Use% Mounted on /dev/loop1 squashfs 883M 883M 0 100% /boot $ ls -ltr /tmp total 4 drwx------ 3 root root 60 Feb 11 18:00 systemd-private-b818b9aaddd64104ac98e75ebcf63b8d-chronyd.service-l0sYIf -rw-r--r-- 1 root root 1708 Feb 11 18:05 ironic.ign $ last reboot system boot 4.18.0-305.34.2. Fri Feb 11 18:00 still running wtmp begins Fri Feb 11 18:00:44 2022 $ mkdir /tmp/test $ sudo mount /dev/sda2 /tmp/test/ $ ls /tmp/test/ EFI $ sudo umount /dev/sda2 $ sudo mount /dev/sda2 /non-existing-dir mount: /non-existing-dir: mount point does not exist. $ echo $? 32 $ sudo fdisk -l /dev/sda GPT PMBR size mismatch (7884799 != 936640511) will be corrected by write. The backup GPT table is not on the end of the device. This problem will be corrected by write. Disk /dev/sda: 446.6 GiB, 479559942144 bytes, 936640512 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 262144 bytes / 262144 bytes Disklabel type: gpt Disk identifier: 00000000-0000-4000-A000-000000000001 Device Start End Sectors Size Type /dev/sda1 2048 4095 2048 1M BIOS boot /dev/sda2 4096 264191 260096 127M EFI System /dev/sda3 264192 1050623 786432 384M Linux filesystem /dev/sda4 1050624 7884766 6834143 3.3G Linux filesystem Expected results: The installation should complete on all the master nodes. Additional info: Through dci-openshift-agent we upload logs and details of the installation, these are the last 3 times we've seen this issue. - https://www.distributed-ci.io/jobs/6ae69cec-ee6f-4de0-a69e-2a86bed35c9c/files - https://www.distributed-ci.io/jobs/e7305c1e-e0a7-4ee6-afac-9da05e89e297/files Including openshift_install.log but can provide more logs
a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.093 1 DEBUG oslo_concurrency.processutils [-] CMD "partx -a /dev/sda" returned: 1 in 0.003s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423 a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG oslo_concurrency.processutils [-] 'partx -a /dev/sda' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474 a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stdout is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99 a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stderr is: "partx: /dev/sda: error adding partitions 1-4 a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: " _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100 Okay, we're unable to tell the kernel about new partitions. Unfortunately, we run without the -v flag, so hard to tell why.
*** Bug 2057668 has been marked as a duplicate of this bug. ***
Unmarking this as triaged. We don't know the root cause, the errors in comment 1 also happen on successful deployments. We can observe sda2 existing both before and after the failure.
I think Derek is looking into adding retries.
We havn't been able to isolate/reproduce this to get to the root of the problem, in the mean time I've pushed a retry for the failing operation
I was also seeing this issue on ProLiant DL380 Gen10 machines and it was reproducing pretty consistently with 4.10.0-rc.6
@mcornea could you, please try to deploy on your setup the last 4.11 nightly, please On out CI setup the deployment passed last night
(In reply to Lubov from comment #7) > @mcornea could you, please try to deploy on your setup the last > 4.11 nightly, please > On out CI setup the deployment passed last night I can confirm the nodes were deployed as well on my environment with 4.11.0-0.nightly-2022-03-06-020555.
Encountered this issue today during OCP-4.10 mgmt cluster installation. time="2022-03-17T11:12:15-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node 920b03ad-9311-492c-899d-dcf56b181d2c. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command." time="2022-03-17T11:12:15-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmpd66e7u6t/boot/efi" time="2022-03-17T11:12:15-05:00" level=error msg="Exit code: 32" time="2022-03-17T11:12:15-05:00" level=error msg="Stdout: ''" time="2022-03-17T11:12:15-05:00" level=error msg="Stderr: 'mount: /tmp/tmpd66e7u6t/boot/efi: special device /dev/sda2 does not exist.\\n'.' 2022-03-17 16:11:09.089 1 ERROR ironic.conductor.utils [-] Deploy step deploy.install_coreos failed on node 920b03ad-9311-492c-899d-dcf56b181d2c. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command. DCI job : https://www.distributed-ci.io/jobs/4480dd4c-a4f8-41d3-9717-99a11393af61/jobStates
verified on 4.11.0-0.nightly-2022-05-20-213928
*** Bug 2061278 has been marked as a duplicate of this bug. ***
cherry-pick was missing a import (it wasn't needed in origin/main) https://bugzilla.redhat.com/show_bug.cgi?id=2090631
*** Bug 2090631 has been marked as a duplicate of this bug. ***
verified on 4.10.18. run twice
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.10.20 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5172