Hide Forgot
This bug was initially created as a copy of Bug #2061278 I am copying this bug because: I need a first fix the problem in 4.11 +++ This bug was initially created as a clone of Bug #2053752 +++ Description of problem: While attempting to install OCP 4.10.0 RC 2022-02-04 with baremetal IPI we have found random failures on ironic_python_agent/extensions/image.py when attempting to mount the boot partition (/dev/sda2) into a temporary file, this is causing to not complete the installation on all the master nodes. Version-Release number of selected component (if applicable): OpenShift Installer 4.10.0-rc.1 Built from commit 4fc9fa88c22221b6cede2456b1c33847943b75c9 How reproducible: Frequently, but not all the time Steps to Reproduce: 1. Installing OCP using dci-openshift-agent (that uses IPI) 2. Sometimes a master node won't complete Actual results: Excerpt of the install log: time="2022-02-11T13:06:19-05:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m1s [id=2ec51de8-ddbf-4c6e-94d6-d9da581ea8d3]" time="2022-02-11T13:06:19-05:00" level=error time="2022-02-11T13:06:19-05:00" level=error msg="Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Deploy step deploy.install_coreos failed on node a7985248-1cf3-494a-9771-de48e4500a62. Could not verify uefi on device /dev/sda, failed with Unexpected error while running command." time="2022-02-11T13:06:19-05:00" level=error msg="Command: mount /dev/sda2 /tmp/tmp0wftrg8k/boot/efi" time="2022-02-11T13:06:19-05:00" level=error msg="Exit code: 32" time="2022-02-11T13:06:19-05:00" level=error msg="Stdout: ''" time="2022-02-11T13:06:19-05:00" level=error msg="Stderr: 'mount: /tmp/tmp0wftrg8k/boot/efi: special device /dev/sda2 does not exist.\\n'.'" time="2022-02-11T13:06:19-05:00" level=error time="2022-02-11T13:06:19-05:00" level=error msg=" on ../../tmp/openshift-install-masters-1714874582/main.tf line 43, in resource \"ironic_deployment\" \"openshift-master-deployment\":" time="2022-02-11T13:06:19-05:00" level=error msg=" 43: resource \"ironic_deployment\" \"openshift-master-deployment\" {" time="2022-02-11T13:06:19-05:00" level=error time="2022-02-11T13:06:19-05:00" level=error time="2022-02-11T13:06:19-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change master-0 node: $ df -PhT /boot Filesystem Type Size Used Avail Use% Mounted on /dev/loop1 squashfs 883M 883M 0 100% /boot $ ls -ltr /tmp total 4 drwx------ 3 root root 60 Feb 11 18:00 systemd-private-b818b9aaddd64104ac98e75ebcf63b8d-chronyd.service-l0sYIf -rw-r--r-- 1 root root 1708 Feb 11 18:05 ironic.ign $ last reboot system boot 4.18.0-305.34.2. Fri Feb 11 18:00 still running wtmp begins Fri Feb 11 18:00:44 2022 $ mkdir /tmp/test $ sudo mount /dev/sda2 /tmp/test/ $ ls /tmp/test/ EFI $ sudo umount /dev/sda2 $ sudo mount /dev/sda2 /non-existing-dir mount: /non-existing-dir: mount point does not exist. $ echo $? 32 $ sudo fdisk -l /dev/sda GPT PMBR size mismatch (7884799 != 936640511) will be corrected by write. The backup GPT table is not on the end of the device. This problem will be corrected by write. Disk /dev/sda: 446.6 GiB, 479559942144 bytes, 936640512 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 262144 bytes / 262144 bytes Disklabel type: gpt Disk identifier: 00000000-0000-4000-A000-000000000001 Device Start End Sectors Size Type /dev/sda1 2048 4095 2048 1M BIOS boot /dev/sda2 4096 264191 260096 127M EFI System /dev/sda3 264192 1050623 786432 384M Linux filesystem /dev/sda4 1050624 7884766 6834143 3.3G Linux filesystem Expected results: The installation should complete on all the master nodes. Additional info: Through dci-openshift-agent we upload logs and details of the installation, these are the last 3 times we've seen this issue. - https://www.distributed-ci.io/jobs/6ae69cec-ee6f-4de0-a69e-2a86bed35c9c/files - https://www.distributed-ci.io/jobs/e7305c1e-e0a7-4ee6-afac-9da05e89e297/files Including openshift_install.log but can provide more logs --- Additional comment from Dmitry Tantsur on 2022-02-14 10:59:41 GMT --- a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.093 1 DEBUG oslo_concurrency.processutils [-] CMD "partx -a /dev/sda" returned: 1 in 0.003s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423 a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG oslo_concurrency.processutils [-] 'partx -a /dev/sda' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474 a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stdout is: "" _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:99 a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: 2022-02-11 18:06:04.094 1 DEBUG ironic_lib.utils [-] Command stderr is: "partx: /dev/sda: error adding partitions 1-4 a7985248-1cf3-494a-9771-de48e4500a62_master-0.cluster10.core.dfwt5g.lab_2022-02-11-18-06-10.tar.gz: " _log /usr/lib/python3.6/site-packages/ironic_lib/utils.py:100 Okay, we're unable to tell the kernel about new partitions. Unfortunately, we run without the -v flag, so hard to tell why. --- Additional comment from Dmitry Tantsur on 2022-02-25 16:20:55 GMT --- --- Additional comment from Dmitry Tantsur on 2022-02-25 16:22:02 GMT --- Unmarking this as triaged. We don't know the root cause, the errors in comment 1 also happen on successful deployments. We can observe sda2 existing both before and after the failure. --- Additional comment from Dmitry Tantsur on 2022-02-25 16:23:37 GMT --- I think Derek is looking into adding retries. --- Additional comment from Derek Higgins on 2022-03-01 16:40:36 GMT --- We havn't been able to isolate/reproduce this to get to the root of the problem, in the mean time I've pushed a retry for the failing operation --- Additional comment from Marius Cornea on 2022-03-03 13:34:39 GMT --- I was also seeing this issue on ProLiant DL380 Gen10 machines and it was reproducing pretty consistently with 4.10.0-rc.6 --- Additional comment from Lubov on 2022-03-07 08:59:54 GMT --- @mcornea could you, please try to deploy on your setup the last 4.11 nightly, please On out CI setup the deployment passed last night
Testing in the CI environment where this bug reproduces proves that the current number of reties isn't adequate bumping the number
verified on 4.11.0-0.nightly-2022-05-20-213928
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069