In some deployment we're seeing worker nodes failing to get provisioned and restarting. In the ironic logs the error most relevant seems to be errors after writing the image to disk 2021-05-17 06:09:53.078 1 ERROR ironic.conductor.utils [-] Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 33c5edf0-569e-4895-9105-137bb676b177 : Error performing deploy_step write_image: Command execution failed: Unable to find a valid partition table on the disk after writing the image. Error Unexpected error while running command. Command: parted -s -m /dev/sda unit MiB print Exit code: 1 Stdout: 'BYT;\n/dev/sda:915715MiB:scsi:512:4096:unknown:ATA INTEL SSDSC2KB96:;\n' Stderr: 'Error: /dev/sda: unrecognised disk label\n'.^[[00m
Created attachment 1785113 [details] IPA log
It looks like this is down to the image-cache container trying to download the image through a proxy From rhcos-48.84.202104271417-0-openstack.x86_64.qcow2 (contains a proxy error message instead of a qcow image) <p>The following error was encountered while trying to retrieve the URL: <a href="http://metal3-state.openshift-machine-api:6180/images/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2">http://metal3-state.openshift-machine-api:6180/images/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2</a></p> <pre>Name Error: The domain name does not exist.</pre> From squid logs 1621432495.708 29 fd00:1101::6ef0:c42d:33f4:c2f TCP_MISS/503 4587 GET http://metal3-state.openshift-machine-api:6180/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2 - HIER_NONE/- text/html and the metal3-machine-os-downloader container in the image-cache pod env: - name: RHCOS_IMAGE_URL value: http://metal3-state.openshift-machine-api:6180/images/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2 - name: HTTP_PROXY value: http://[fd00:1101::1]:3128 - name: HTTPS_PROXY value: http://[fd00:1101::1]:3128 - name: NO_PROXY value: .cluster.local,.svc,127.0.0.1,9999,api-int.ostest.test.metalkube.org,fd00:1101::/64,fd01::/48,fd02::/112,fd2e:6f44:5dd8:c956::/120,localhost
to work around this you can add "metal3-state.openshift-machine-api" to your noProxy variable in install-config.yaml
Verified with 4.8.0-0.nightly-2021-06-02-025513 [kni@provisionhost-0-0 ~]$ oc get pod/metal3-image-cache-c4pft -o yaml | grep qcow2 value: http://metal3-state.openshift-machine-api.svc.cluster.local:6180/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438