We are deploying OCP BM IPI clusters using our provisioning host to mirror RHCOS images.
To do so, we define the `clusterOSImage` field of the `install-config.yaml` to point to our provisioning host using its IP on the provisioning network:
> kind: InstallConfig
> apiVersion: v1
> clusterOSImage: http://172.22.0.1/rhcos/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz?sha256=37a156f9f2b0efded45cb3cd5688aa2d42c26873a534951484e96f546a6b2c84
Unfortunately, the InitContainer `metal3-machine-os-downloader` of the metal3 Pod is now failing to download the image:
> curl -g --compressed -L --connect-timeout 120 -o rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz http://172.22.0.1/rhcos/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz
> curl: (28) Connection timed out after 120000 milliseconds
This is due to the control plane node not having IP on the provisioning network anymore and the InitContainer `metal3-machine-os-downloader` getting started before the InitContainer `metal-static-ip-set`.
This behaviour started between OCP 4.8.0-fc.7 (good) and OCP 4.8.0-rc.0 (bad). I was most likely introduced by https://github.com/openshift/installer/pull/4900.
Given that we use the "Recreate" deployment strategy, I don't see any reason that we couldn't acquire the VIP before doing the OS download.
(In reply to Zane Bitter from comment #2)
> Given that we use the "Recreate" deployment strategy, I don't see any reason
> that we couldn't acquire the VIP before doing the OS download.
I agree probably the fix is to reorder the initContainers, so that static-ip-set happens prior to the image download.
That was previously discussed on https://bugzilla.redhat.com/show_bug.cgi?id=1847142#c2
As I mention there, just switching the order may not be enough, because we set the connection lifetime to 300s in the initContainer:
The expectation is that the refresh-static-ip later refreshes that, but if the RHCOS download takes more than 300s it's possible the connection could be interrupted.
That said, given that the default is to download from an external URL via the controlplane network, switching the order is probably reasonable - in the cases where this is set to the provisioning network it's very likely to be referencing a locally cached image, thus the download shouldn't take more than 300s.
verified on 4.9.0-0.nightly-2021-07-04-140102
from metal3-machine-os-downloader container log
+ curl -g --compressed -L --connect-timeout 120 -o rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz http://172.22.0.1/rhcos/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 963M 100 963M 0 0 910M 0 0:00:01 0:00:01 --:--:-- 910M
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.