We are deploying OCP BM IPI clusters using our provisioning host to mirror RHCOS images. To do so, we define the `clusterOSImage` field of the `install-config.yaml` to point to our provisioning host using its IP on the provisioning network: > kind: InstallConfig > apiVersion: v1 > [...] > platform: > baremetal: > clusterOSImage: http://172.22.0.1/rhcos/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz?sha256=37a156f9f2b0efded45cb3cd5688aa2d42c26873a534951484e96f546a6b2c84 > [...] Unfortunately, the InitContainer `metal3-machine-os-downloader` of the metal3 Pod is now failing to download the image: > curl -g --compressed -L --connect-timeout 120 -o rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz http://172.22.0.1/rhcos/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz > curl: (28) Connection timed out after 120000 milliseconds This is due to the control plane node not having IP on the provisioning network anymore and the InitContainer `metal3-machine-os-downloader` getting started before the InitContainer `metal-static-ip-set`. This behaviour started between OCP 4.8.0-fc.7 (good) and OCP 4.8.0-rc.0 (bad). I was most likely introduced by https://github.com/openshift/installer/pull/4900.
Given that we use the "Recreate" deployment strategy, I don't see any reason that we couldn't acquire the VIP before doing the OS download.
(In reply to Zane Bitter from comment #2) > Given that we use the "Recreate" deployment strategy, I don't see any reason > that we couldn't acquire the VIP before doing the OS download. I agree probably the fix is to reorder the initContainers, so that static-ip-set happens prior to the image download. That was previously discussed on https://bugzilla.redhat.com/show_bug.cgi?id=1847142#c2 As I mention there, just switching the order may not be enough, because we set the connection lifetime to 300s in the initContainer: https://github.com/openshift/ironic-static-ip-manager/blob/master/set-static-ip#L37 The expectation is that the refresh-static-ip later refreshes that, but if the RHCOS download takes more than 300s it's possible the connection could be interrupted. That said, given that the default is to download from an external URL via the controlplane network, switching the order is probably reasonable - in the cases where this is set to the provisioning network it's very likely to be referencing a locally cached image, thus the download shouldn't take more than 300s.
https://github.com/openshift/cluster-baremetal-operator/pull/169
verified on 4.9.0-0.nightly-2021-07-04-140102 from metal3-machine-os-downloader container log + curl -g --compressed -L --connect-timeout 120 -o rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz http://172.22.0.1/rhcos/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 963M 100 963M 0 0 910M 0 0:00:01 0:00:01 --:--:-- 910M
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759