Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1928581

Summary: proxy/cluster settings not updating due to missing local CNO image
Product: OpenShift Container Platform Reporter: Vincent Lours <vlours>
Component: NodeAssignee: Qi Wang <qiwan>
Node sub component: Kubelet QA Contact: MinLi <minmli>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: high CC: amcdermo, aos-bugs, nagrawal, palonsor, rphillips, wking
Version: 4.6Keywords: Reopened
Target Milestone: ---Flags: vlours: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-31 17:52:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vincent Lours 2021-02-15 00:45:29 UTC
Description of problem:

Deploying a wrong proxy setting may result in a chicken and egg story.

When a wrong proxy setting is deployed, the new Machine Config (MC) will be applied and the Cluster Network Operator (CNO) will be evicted and relocated to a new Node. Unfortunately, if the new node has never downloaded the CNO image, it will fail to pull the image because the new proxy setting is wrong. 
So the CNO will fail to start. And as a result, it's now impossible to fix the proxy setting as the CNO is required to apply the new config.

The only fix is to relocate the CNO to a node where the image is already pulled.


Version-Release number of selected component (if applicable):
This issue was triggered in OCP 4.5.16, but may certainly occur in other versions.
Component: Networking/cluster proxy


How reproducible:


Steps to Reproduce:
1. Wrong proxy settings are set
2. MCO upgrade rolls out to set those settings in the container engine
3. During that upgrade, the CNO pod is evicted and scheduled to an upgraded node.
4. The upgraded node does not have the image and it fails to pull it due to wrong proxy settings, so the new CNO pod never started.
5. No CNO pod means no proxy.status update means the cluster cannot recover from proxy failure.
6. Forcing the CNO pod to start on a node that has the image pulled from before made its start possible, so you could fix proxy config.


Actual results:

In some cases, the CNO will fail to start and the proxy rollback will be impossible in this state.

Expected results:

- CNO should always be relocated to a node where the image is already pulled.
- CNO images may be pulled on all master nodes as part of the process before triggering the MC deployment.

Additional info:

The KCS 5795521 (https://access.redhat.com/solutions/5795521) has been created to provide a workaround to this issue.

Comment 4 Andrew McDermott 2021-02-16 17:27:10 UTC
Dropping the severity as there is a documented workaround (see the bugzilla description).

Comment 13 Qi Wang 2021-04-01 19:31:14 UTC
Assign to MCO since they manage the controller config.

Comment 14 Yu Qi Zhang 2021-04-13 04:33:18 UTC
Sorry for the delay in response. What is the current status of this bug? What's is the MCO's role? To make sure a wrongly configured proxy doesn't get written to the nodes?

Comment 23 Neelesh Agrawal 2021-11-12 15:36:20 UTC
In the PR
https://github.com/openshift/machine-config-operator/pull/2539
Qi attempted to test validity of proxy by testing the image-pull using podman. However, because MCO image doesn't contain podman, test fail CI. Besides, there is a debate what would be a more appropriate place to do such validation. There could be other similar configurations that, if wrong, can cause cluster disruption.
Closing this BZ as we don't have a clear path forward for a fix. If needed, please reopen and suggest ways/places we can do this testing.

Comment 24 Pablo Alonso Rodriguez 2021-11-12 15:43:16 UTC
You can do it with skopeo inspect. If skopeo is not available in MCO image, I don't understand why it cannot be added.

Other similar configurations that can cause something like this may require separate BZs, let's focus on this one.

Comment 26 Qi Wang 2021-11-12 17:06:20 UTC
Skopeo can be used inside a container by pulling skopeo image[1]. So a tool to pull image is needed. In the PR we directly tried to use podman. I have followed the instructions from https://www.redhat.com/sysadmin/podman-inside-container to install podman inside the registry.ci.openshift.org/ocp/4.9:base image. Many errors occurred it seems Podman package was not available in ocp image at that time.

I can get back to this and retry to see if it's available in current version of ocp image.

[1] https://www.redhat.com/sysadmin/how-run-skopeo-container

Comment 27 Pablo Alonso Rodriguez 2021-11-12 17:12:14 UTC
Skopeo is a binary that can be installed, either in your workstation or a container, like "dnf install skopeo".

Comment 28 Qi Wang 2021-11-12 19:21:54 UTC
Thanks. I just tried, both podman and skopeo are installed successfully in OCP image.
I converted the PR to use skopeo and see if it can pass the CI.

Comment 34 Qi Wang 2022-10-31 17:52:31 UTC
Close this bug per the discussion: https://github.com/openshift/machine-config-operator/pull/2539#issuecomment-1292410131, MCO is not responsible for proxy validation, and the original customer is closed. 
If further discussions are needed, this can be tracked by a new feature request.

Comment 35 Pablo Alonso Rodriguez 2022-11-01 08:39:24 UTC
Changing to CLOSED DEFERRED. This is not a problem we mustn't solve but a problem we are not solving now.

Just to make this clear: I agree with MCO not being the right team to address this, but strongly disagree in this to not be solved. This should have been tested and solved at CNO level. However, I see practical to open a new bug to the CNO component once we have other occurrences.