Bug 2005694

Summary:	Removing proxy object takes up to 10 minutes for the changes to propagate to the MCO
Product:	OpenShift Container Platform	Reporter:	Alex Kalenyuk <akalenyu>
Component:	Machine Config Operator	Assignee:	John Kyros <jkyros>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, dollierp, eduen, jcaamano, jerzhang, jkyros, mkrejci, sregidor, wking
Version:	4.9
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 10:37:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Kalenyuk 2021-09-19 15:36:58 UTC

Description of problem:
Resuming paused MCPs (.spec.paused=false) still performs maintenance despite resource changes being reverted

Version-Release number of selected component (if applicable):
4.9.0-rc.0

How reproducible:
100%

Steps to Reproduce:
1. Pause MCPs to prevent auto reboot
2. Make changes (cluster-wide proxy in this case)
3. Revert changes
4. Resume MCPs

Actual results:
Maintenace performed

Expected results:
No maintenance performed

Additional info:
Scenario:

[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get proxy cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
  creationTimestamp: "2021-09-13T11:49:11Z"
  generation: 37
  name: cluster
  resourceVersion: "7623281"
  uid: 5101277d-a23e-4a0f-a31a-93e2693dca35
spec:
  trustedCA:
    name: ""
status: {}

1. Pause:
[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp worker --template '{{.spec.paused}}'
true[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp master --template '{{.spec.paused}}'
true

2. Make changes:
[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get proxy cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
  creationTimestamp: "2021-09-13T11:49:11Z"
  generation: 38
  name: cluster
  resourceVersion: "7755941"
  uid: 5101277d-a23e-4a0f-a31a-93e2693dca35
spec:
  httpsProxy: http://cdi-test-proxy.openshift-cnv:8080
  trustedCA:
    name: ""
status:
  httpsProxy: http://cdi-test-proxy.openshift-cnv:8080
  noProxy: .cluster.local,.svc,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,192.168.0.0/18,api-int.alex490-194.cnv-qe.rhcloud.com,localhost

3. Change reverted:
[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get proxy cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
  creationTimestamp: "2021-09-13T11:49:11Z"
  generation: 39
  name: cluster
  resourceVersion: "7757378"
  uid: 5101277d-a23e-4a0f-a31a-93e2693dca35
spec:
  trustedCA:
    name: ""
status: {}

4. Resume:
[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp master --template '{{.spec.paused}}'
false[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp worker --template '{{.spec.paused}}'
false

As you can see maintenance starts:
[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a88383bb8c26039eee6ee16b19ca0638   False     True       False      3              0                   0                     0                      6d3h
worker   rendered-worker-8623d681ece6edadd6c55ac90ec7328f   False     True       False      3              0                   0                     0                      6d3h

- Feel free to tell me if I'm getting this wrong and this is desired
- Our goal is to test that our controller picks up the updated values from the proxy resource, this is why we wanted to disable the reboots

Comment 1 Pierre Prinetti 2021-09-22 15:34:48 UTC

I suspect that this behaviour is not platform-specific. Maybe the machine-config-operator team can sort this out more easily?

Comment 2 Yu Qi Zhang 2021-09-22 16:37:17 UTC

There's a few possibilities here:

1. The changes to proxy has not yet rendered to MCO yet when you reverted the change. What happens is when you modify the proxy object, the MCController needs to read the updated proxy, update the corresponding base machineconfig, then render from that updated config, and then finally select nodes from a pool to apply this to. It could still have the stale rendered config from the previous apply and not yet the revert
2. Some other change is being rendered in and the update is not due to your proxy change (unlikely but possible)
3. The proxy object actually has a minor diff between the original and the now-reverted (maybe the networking operator is parsing it differently) that is causing the diff
4. There is a bug somewhere in the pool logic that doesn't immediately update back to the old config

Please attach a must-gather of a cluster right after you do this, or when it has settled. At least we would need to see the rendered machineconfigs that its updating to, to see what the diff in contents are.

Comment 3 Alex Kalenyuk 2021-09-22 19:57:31 UTC

Attaching must gather after reproducing scenario + MCPs settled:
https://drive.google.com/file/d/1PMus4KMKwnYq-_NKTXWqNmWnGzML049e/view?usp=sharing

Regarding 1 - how long can this propagation take potentially? (paused is still true at this point)

Comment 4 Denis Ollier 2021-10-03 20:59:04 UTC

(In reply to Yu Qi Zhang from comment #2)
> There's a few possibilities here:
> 
> 4. There is a bug somewhere in the pool logic that doesn't immediately
> update back to the old config
> 

I suspect this bug causing this: https://bugzilla.redhat.com/show_bug.cgi?id=1981549.

Comment 12 Sergio 2021-11-04 14:52:18 UTC

We found a variation of this BZ where proxy1 cannot be reconfigured to proxy2. The reconfiguration is not delayed, the reconfiguration never happens at all (16 hours waiting).

When verifying this BZ we need to make sure that we can do this too:
(with paused pools)
1. configure proxy1
2. check that MCDs have the right values for proxy1
2. edit proxy resource and reconfigure proxy1 -> proxy
3. check taht MCDs have the right values for proxy2

Comment 13 Sergio 2021-11-04 14:54:36 UTC

I made a typo in my previous comment, sorry

These steps:
(with paused pools)
1. configure proxy1
3. check that MCDs have the right values for proxy1
4. edit proxy resource and reconfigure proxy1 -> proxy2
5. check taht MCDs have the right values for proxy2

Comment 18 Sergio 2022-06-15 14:02:34 UTC

Verified using ipi on AWS version:
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-15-095020   True        False         76m     Cluster version is 4.11.0-0.nightly-2022-06-15-095020


Verification steps:

1. Pause worker and master MachineConfigPools

2. Edit proxy resource to add a proxy

oc edit proxy
....
spec:
  httpProxy: http://user:pass@proxy-fake:1111
  httpsProxy: http://user:pass@proxy-fake:1111
  noProxy: test.no-proxy.com
  trustedCA:
    name: ""

3. Check that the proxy info is displayed in the Daemonset

$ oc get ds machine-config-daemon -o yaml |grep -i proxy
    - name: HTTP_PROXY
      value: http://user:pass@proxy-fake:1111
    - name: HTTPS_PROXY
      value: http://user:pass@proxy-fake:1111
    - name: NO_PROXY
      value: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sregidor-bz1.qe.devcluster.openshift.com,localhost,test.no-proxy.com
    name: oauth-proxy
      name: proxy-tls
  - name: proxy-tls
      secretName: proxy-tls
    name: oauth-proxy

4. Check that the operator has been marked as degraded
$ oc get co machine-config
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.11.0-0.nightly-2022-06-15-095020   True        False         True       79m     Failed to resync 4.11.0-0.nightly-2022-06-15-095020 because: Required MachineConfigPool 'master' is paused and can not sync until it is unpaused


5. Remove the proxy config from the proxy object


oc edit proxy
....
spec:
  trustedCA:
    name: ""

6. Check that the proxy is not configured in the Daemonsets anymore. (In less than 10 minutes)
$ oc get ds machine-config-daemon -o yaml |grep -i proxy
        name: oauth-proxy
          name: proxy-tls
      - name: proxy-tls
          secretName: proxy-tls


7. Check that the operator is not marked as degraded any more
$ oc get co machine-config
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.11.0-0.nightly-2022-06-15-095020   True        False         False      87m     



We move the status to VERIFIED

Comment 20 errata-xmlrpc 2022-08-10 10:37:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069