Bug 1973724

Summary: metal3 Pod cannot download RHCOS images using the provisioning network anymore
Product: OpenShift Container Platform Reporter: Denis Ollier <dollierp>
Component: Bare Metal Hardware ProvisioningAssignee: Angus Salkeld <asalkeld>
Bare Metal Hardware Provisioning sub component: cluster-baremetal-operator QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact: Padraig O'Grady <pogrady>
Severity: medium    
Priority: medium CC: aos-bugs, asalkeld, beth.white, fdeutsch, pogrady, rbartal, zbitter
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A change was made to stop provisioning services once control plane was deployed. Consequence: This caused the InitContainer `metal3-machine-os-downloader` of the metal3 Pod to fail to download the image. Fix: The order of creating InitContainers has been changed to so that static-ip-set happens prior to the image download. Result: Image download happens as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:35:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Denis Ollier 2021-06-18 14:59:11 UTC
We are deploying OCP BM IPI clusters using our provisioning host to mirror RHCOS images.

To do so, we define the `clusterOSImage` field of the `install-config.yaml` to point to our provisioning host using its IP on the provisioning network:

> kind: InstallConfig
> apiVersion: v1
> [...]
> platform:
>   baremetal:
>     clusterOSImage: http://172.22.0.1/rhcos/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz?sha256=37a156f9f2b0efded45cb3cd5688aa2d42c26873a534951484e96f546a6b2c84
> [...]

Unfortunately, the InitContainer `metal3-machine-os-downloader` of the metal3 Pod is now failing to download the image:

> curl -g --compressed -L --connect-timeout 120 -o rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz http://172.22.0.1/rhcos/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz
> curl: (28) Connection timed out after 120000 milliseconds

This is due to the control plane node not having IP on the provisioning network anymore and the InitContainer `metal3-machine-os-downloader` getting started before the InitContainer `metal-static-ip-set`.

This behaviour started between OCP 4.8.0-fc.7 (good) and OCP 4.8.0-rc.0 (bad). I was most likely introduced by https://github.com/openshift/installer/pull/4900.

Comment 2 Zane Bitter 2021-06-18 16:17:11 UTC
Given that we use the "Recreate" deployment strategy, I don't see any reason that we couldn't acquire the VIP before doing the OS download.

Comment 6 Steven Hardy 2021-06-22 12:23:07 UTC
(In reply to Zane Bitter from comment #2)
> Given that we use the "Recreate" deployment strategy, I don't see any reason
> that we couldn't acquire the VIP before doing the OS download.

I agree probably the fix is to reorder the initContainers, so that static-ip-set happens prior to the image download.

That was previously discussed on https://bugzilla.redhat.com/show_bug.cgi?id=1847142#c2

As I mention there, just switching the order may not be enough, because we set the connection lifetime to 300s in the initContainer:

https://github.com/openshift/ironic-static-ip-manager/blob/master/set-static-ip#L37

The expectation is that the refresh-static-ip later refreshes that, but if the RHCOS download takes more than 300s it's possible the connection could be interrupted.

That said, given that the default is to download from an external URL via the controlplane network, switching the order is probably reasonable - in the cases where this is set to the provisioning network it's very likely to be referencing a locally cached image, thus the download shouldn't take more than 300s.

Comment 9 Lubov 2021-07-06 09:04:59 UTC
verified on 4.9.0-0.nightly-2021-07-04-140102

from metal3-machine-os-downloader container log 

+ curl -g --compressed -L --connect-timeout 120 -o rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz http://172.22.0.1/rhcos/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  963M  100  963M    0     0   910M      0  0:00:01  0:00:01 --:--:--  910M

Comment 14 errata-xmlrpc 2021-10-18 17:35:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759