2045927 – [FJ OCP4.10 Bug]: Podman failed to pull the IPA image due to the loss of proxy environment

Bug 2045927 - [FJ OCP4.10 Bug]: Podman failed to pull the IPA image due to the loss of proxy environment

Summary: [FJ OCP4.10 Bug]: Podman failed to pull the IPA image due to the loss of prox...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Jacob Anders
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:	QJ220126-001
Depends On:
Blocks:	1920358
TreeView+	depends on / blocked

Reported:	2022-01-26 00:21 UTC by Fujitsu container team
Modified:	2022-03-12 04:42 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:41:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-baremetal-operator pull 240	None	Merged	Bug 2045927: Add proxy for image-customization-controller	2022-01-26 17:17:30 UTC
Github	openshift image-customization-controller pull 33	None	Merged	Bug 2045927: Add proxy for ironic-agent.service	2022-01-26 17:17:32 UTC
Github	openshift installer pull 5569	None	Merged	Bug 2045927: Add proxy for ironic-agent.service	2022-01-27 11:56:34 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-12 04:42:08 UTC

Description Fujitsu container team 2022-01-26 00:21:03 UTC

Customer Contact Name:

  Yasuhiro Futakawa

Description of Problem:

  This problem is caused by the recent merge of Day1 networking feature into OCP.

  Summary of Day1 networking
  https://github.com/openshift/enhancements/blob/master/enhancements/network/baremetal-ipi-network-configuration.md#baremetal-ipi-network-configuration

  Related changes to this issue
  https://github.com/openshift/enhancements/blob/master/enhancements/baremetal/coreos-image-in-release.md#include-the-coreos-image-in-the-release-for-baremetal

  In previous OCP, we used initrd and kernel to PXE boot.
  But this method was changed to the coreos IPA(Ironic Python Agent) by Day1 networking feature.
  This means a baremetal server is booted by coreos image.(not initrd and kernel)

  By this changing, inspection failed when behind a proxy environment because there was no proxy variables set up for podman.
  podman is running on coreos. So this problem does not occur in previous OCP(we don't use coreos previously when deploying).

  This new way for running the IPA is booting coreos on the bare metal server and then running IPA as a container through podman, but the proxy variables were not set for podman although they were written in the install-config.yaml and that caused an image pull failure.

Version-Release number of selected component:

  This issue was detected in the Pre-GA version.

    Red Hat OpenShift Container Platform Version Number: 4.10
    Release Number:  4.10.0-0.nightly-2021-12-20-231053
    Kubernetes Version: 1.22.1
    Cri-o Version: 1.23.0
    Related Component: NONE
    Related Middleware/Application: irmc
    Underlying RHCOS Release Number: 4.10
    Underlying RHCOS Architecture: x86_64
    Underlying RHCOS Kernel Version: 4.18.0

Drivers or hardware or architecture dependency:

  None

How reproducible:

  Everytime

Step to Reproduce:

  $ openshift-install --dir ~/clusterconfigs create manifests
  $ openshift-install --dir ~/clusterconfigs --log-level debug create cluster

Actual Results:

  IPA image pull failed 

Expected Results:

  IPA image can be pulled successfully

Summary of actions taken to resolve issue:

  Fujitsu opened issue: https://github.com/openshift/installer/issues/5552
  Fujitsu sent PR: https://github.com/openshift/image-customization-controller/pull/33
                   https://github.com/openshift/installer/pull/5569
                   https://github.com/openshift/cluster-baremetal-operator/pull/240

Location of diagnostic data:

  None

Hardware configuration:

  Model: RX2540 M4

Comment 1 Jacob Anders 2022-01-26 09:03:04 UTC

Upon discussion with the Metal Platform team we decided this qualifies as a blocker due to regression in use cases requiring use of proxy.

Comment 2 Jacob Anders 2022-01-26 09:05:17 UTC

Our colleagues from Fujitsu who originally identified this issue have proposed fixes which are currently under review.

Comment 3 Jacob Anders 2022-01-26 11:35:13 UTC

In addition to PRs which are aiming to resolve the proxy issue, the Metal Team is currently working on adding a validation / CI job that would ensure that the fixes proposed work as expected (this is tracked in https://github.com/openshift-metal3/dev-scripts/pull/1341).

Comment 4 Jacob Anders 2022-01-27 00:11:50 UTC

The Team have made good progress with this BZ - with regards to fixes, we currently we have:

https://github.com/openshift/image-customization-controller/pull/33 MERGED
https://github.com/openshift/cluster-baremetal-operator/pull/240    MERGED
https://github.com/openshift/installer/pull/5569                    OPEN


PR5569 is past reviews and hasn't merged only due to perma-failing tests. It is now waiting for a Staff Engineer to review and override CI allowing it to merge.

PR1341 (https://github.com/openshift-metal3/dev-scripts/pull/1341) which is aiming to add test coverage is still WIP however this is not a part of the fix - this can be finished as a follow up change post 4.10 Code Freeze.

Comment 5 Jacob Anders 2022-01-27 11:58:09 UTC

https://github.com/openshift/installer/pull/5569 has just MERGED. I removed explicit linkage to https://github.com/openshift-metal3/dev-scripts/pull/1341 and setting the BZ to MODIFIED.

Comment 10 Victor Voronkov 2022-01-28 11:35:51 UTC

Verified the fix had no regression and deployment succeeded on IPv6 ctrplane network
Note - no reproduce of the issue itself was possible in QE env at that moment)

provisionhost-0-0 ~]$ more install-config.yaml
apiVersion: v1
baseDomain: qe.lab.redhat.com
proxy:
  httpProxy: http://[fd2e:6f44:5dd8::7c]:3128
  httpsProxy: http://[fd2e:6f44:5dd8::7c]:3128
  noProxy: registry.ocp-edge-cluster-0.qe.lab.redhat.com,fd00:1101:0:1::/64,fd2e:6f44:5dd8::/64,9999
networking:
  networkType: OVNKubernetes
  machineCIDR: fd2e:6f44:5dd8::/64

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-27-104747   True        False         5h3m    Cluster version is 4.10.0-0.nightly-2022-01-27-104747
===================
with all that the real verification wait for QE to implement iptables rules to enable connection outside only via Bastion host and via proxy
or fix been verified at customer environment

Comment 12 Victor Voronkov 2022-02-06 10:08:10 UTC

Hi, could you please report if the fix was working? Our test env to reproduce the original issue is still WIP.
Actually we were interested to understand the topology of your env, where proxy is the only gateway and restrictions you apply on your nodes. Thanks

Comment 13 Fujitsu container team 2022-02-08 02:07:10 UTC

Hi Victor,

Thank you for your reply.

> could you please report if the fix was working?
Yes, Fujitsu verified that this fix was working correctly.

Best Regards,
Yasuhiro Futakawa

Comment 15 errata-xmlrpc 2022-03-12 04:41:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.