1973724 – metal3 Pod cannot download RHCOS images using the provisioning network anymore

Bug 1973724 - metal3 Pod cannot download RHCOS images using the provisioning network anymore

Summary: metal3 Pod cannot download RHCOS images using the provisioning network anymore

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Angus Salkeld
QA Contact:	Lubov
Docs Contact:	Padraig O'Grady
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-18 14:59 UTC by Denis Ollier
Modified:	2021-10-18 17:36 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A change was made to stop provisioning services once control plane was deployed. Consequence: This caused the InitContainer `metal3-machine-os-downloader` of the metal3 Pod to fail to download the image. Fix: The order of creating InitContainers has been changed to so that static-ip-set happens prior to the image download. Result: Image download happens as expected.
Clone Of:
Environment:
Last Closed:	2021-10-18 17:35:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-baremetal-operator pull 169	0	None	open	Bug 1973724: reorder the initContainers, so that static-ip-set happens prior to the image download	2021-06-29 01:02:08 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:36:09 UTC

Description Denis Ollier 2021-06-18 14:59:11 UTC

We are deploying OCP BM IPI clusters using our provisioning host to mirror RHCOS images.

To do so, we define the `clusterOSImage` field of the `install-config.yaml` to point to our provisioning host using its IP on the provisioning network:

> kind: InstallConfig
> apiVersion: v1
> [...]
> platform:
>   baremetal:
>     clusterOSImage: http://172.22.0.1/rhcos/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz?sha256=37a156f9f2b0efded45cb3cd5688aa2d42c26873a534951484e96f546a6b2c84
> [...]

Unfortunately, the InitContainer `metal3-machine-os-downloader` of the metal3 Pod is now failing to download the image:

> curl -g --compressed -L --connect-timeout 120 -o rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz http://172.22.0.1/rhcos/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz
> curl: (28) Connection timed out after 120000 milliseconds

This is due to the control plane node not having IP on the provisioning network anymore and the InitContainer `metal3-machine-os-downloader` getting started before the InitContainer `metal-static-ip-set`.

This behaviour started between OCP 4.8.0-fc.7 (good) and OCP 4.8.0-rc.0 (bad). I was most likely introduced by https://github.com/openshift/installer/pull/4900.

Comment 2 Zane Bitter 2021-06-18 16:17:11 UTC

Given that we use the "Recreate" deployment strategy, I don't see any reason that we couldn't acquire the VIP before doing the OS download.

Comment 6 Steven Hardy 2021-06-22 12:23:07 UTC

(In reply to Zane Bitter from comment #2)
> Given that we use the "Recreate" deployment strategy, I don't see any reason
> that we couldn't acquire the VIP before doing the OS download.

I agree probably the fix is to reorder the initContainers, so that static-ip-set happens prior to the image download.

That was previously discussed on https://bugzilla.redhat.com/show_bug.cgi?id=1847142#c2

As I mention there, just switching the order may not be enough, because we set the connection lifetime to 300s in the initContainer:

https://github.com/openshift/ironic-static-ip-manager/blob/master/set-static-ip#L37

The expectation is that the refresh-static-ip later refreshes that, but if the RHCOS download takes more than 300s it's possible the connection could be interrupted.

That said, given that the default is to download from an external URL via the controlplane network, switching the order is probably reasonable - in the cases where this is set to the provisioning network it's very likely to be referencing a locally cached image, thus the download shouldn't take more than 300s.

Comment 7 Angus Salkeld 2021-06-29 01:00:17 UTC

https://github.com/openshift/cluster-baremetal-operator/pull/169

Comment 9 Lubov 2021-07-06 09:04:59 UTC

verified on 4.9.0-0.nightly-2021-07-04-140102

from metal3-machine-os-downloader container log 

+ curl -g --compressed -L --connect-timeout 120 -o rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz http://172.22.0.1/rhcos/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  963M  100  963M    0     0   910M      0  0:00:01  0:00:01 --:--:--  910M

Comment 14 errata-xmlrpc 2021-10-18 17:35:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.