Customer Contact Name: Yasuhiro Futakawa Description of Problem: This problem is caused by the recent merge of Day1 networking feature into OCP. Summary of Day1 networking https://github.com/openshift/enhancements/blob/master/enhancements/network/baremetal-ipi-network-configuration.md#baremetal-ipi-network-configuration Related changes to this issue https://github.com/openshift/enhancements/blob/master/enhancements/baremetal/coreos-image-in-release.md#include-the-coreos-image-in-the-release-for-baremetal In previous OCP, we used initrd and kernel to PXE boot. But this method was changed to the coreos IPA(Ironic Python Agent) by Day1 networking feature. This means a baremetal server is booted by coreos image.(not initrd and kernel) By this changing, inspection failed when behind a proxy environment because there was no proxy variables set up for podman. podman is running on coreos. So this problem does not occur in previous OCP(we don't use coreos previously when deploying). This new way for running the IPA is booting coreos on the bare metal server and then running IPA as a container through podman, but the proxy variables were not set for podman although they were written in the install-config.yaml and that caused an image pull failure. Version-Release number of selected component: This issue was detected in the Pre-GA version. Red Hat OpenShift Container Platform Version Number: 4.10 Release Number: 4.10.0-0.nightly-2021-12-20-231053 Kubernetes Version: 1.22.1 Cri-o Version: 1.23.0 Related Component: NONE Related Middleware/Application: irmc Underlying RHCOS Release Number: 4.10 Underlying RHCOS Architecture: x86_64 Underlying RHCOS Kernel Version: 4.18.0 Drivers or hardware or architecture dependency: None How reproducible: Everytime Step to Reproduce: $ openshift-install --dir ~/clusterconfigs create manifests $ openshift-install --dir ~/clusterconfigs --log-level debug create cluster Actual Results: IPA image pull failed Expected Results: IPA image can be pulled successfully Summary of actions taken to resolve issue: Fujitsu opened issue: https://github.com/openshift/installer/issues/5552 Fujitsu sent PR: https://github.com/openshift/image-customization-controller/pull/33 https://github.com/openshift/installer/pull/5569 https://github.com/openshift/cluster-baremetal-operator/pull/240 Location of diagnostic data: None Hardware configuration: Model: RX2540 M4
Upon discussion with the Metal Platform team we decided this qualifies as a blocker due to regression in use cases requiring use of proxy.
Our colleagues from Fujitsu who originally identified this issue have proposed fixes which are currently under review.
In addition to PRs which are aiming to resolve the proxy issue, the Metal Team is currently working on adding a validation / CI job that would ensure that the fixes proposed work as expected (this is tracked in https://github.com/openshift-metal3/dev-scripts/pull/1341).
The Team have made good progress with this BZ - with regards to fixes, we currently we have: https://github.com/openshift/image-customization-controller/pull/33 MERGED https://github.com/openshift/cluster-baremetal-operator/pull/240 MERGED https://github.com/openshift/installer/pull/5569 OPEN PR5569 is past reviews and hasn't merged only due to perma-failing tests. It is now waiting for a Staff Engineer to review and override CI allowing it to merge. PR1341 (https://github.com/openshift-metal3/dev-scripts/pull/1341) which is aiming to add test coverage is still WIP however this is not a part of the fix - this can be finished as a follow up change post 4.10 Code Freeze.
https://github.com/openshift/installer/pull/5569 has just MERGED. I removed explicit linkage to https://github.com/openshift-metal3/dev-scripts/pull/1341 and setting the BZ to MODIFIED.
Verified the fix had no regression and deployment succeeded on IPv6 ctrplane network Note - no reproduce of the issue itself was possible in QE env at that moment) provisionhost-0-0 ~]$ more install-config.yaml apiVersion: v1 baseDomain: qe.lab.redhat.com proxy: httpProxy: http://[fd2e:6f44:5dd8::7c]:3128 httpsProxy: http://[fd2e:6f44:5dd8::7c]:3128 noProxy: registry.ocp-edge-cluster-0.qe.lab.redhat.com,fd00:1101:0:1::/64,fd2e:6f44:5dd8::/64,9999 networking: networkType: OVNKubernetes machineCIDR: fd2e:6f44:5dd8::/64 [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-27-104747 True False 5h3m Cluster version is 4.10.0-0.nightly-2022-01-27-104747 =================== with all that the real verification wait for QE to implement iptables rules to enable connection outside only via Bastion host and via proxy or fix been verified at customer environment
Hi, could you please report if the fix was working? Our test env to reproduce the original issue is still WIP. Actually we were interested to understand the topology of your env, where proxy is the only gateway and restrictions you apply on your nodes. Thanks
Hi Victor, Thank you for your reply. > could you please report if the fix was working? Yes, Fujitsu verified that this fix was working correctly. Best Regards, Yasuhiro Futakawa
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056