Description of problem: Resuming paused MCPs (.spec.paused=false) still performs maintenance despite resource changes being reverted Version-Release number of selected component (if applicable): 4.9.0-rc.0 How reproducible: 100% Steps to Reproduce: 1. Pause MCPs to prevent auto reboot 2. Make changes (cluster-wide proxy in this case) 3. Revert changes 4. Resume MCPs Actual results: Maintenace performed Expected results: No maintenance performed Additional info: Scenario: [cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get proxy cluster -o yaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: creationTimestamp: "2021-09-13T11:49:11Z" generation: 37 name: cluster resourceVersion: "7623281" uid: 5101277d-a23e-4a0f-a31a-93e2693dca35 spec: trustedCA: name: "" status: {} 1. Pause: [cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp worker --template '{{.spec.paused}}' true[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp master --template '{{.spec.paused}}' true 2. Make changes: [cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get proxy cluster -o yaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: creationTimestamp: "2021-09-13T11:49:11Z" generation: 38 name: cluster resourceVersion: "7755941" uid: 5101277d-a23e-4a0f-a31a-93e2693dca35 spec: httpsProxy: http://cdi-test-proxy.openshift-cnv:8080 trustedCA: name: "" status: httpsProxy: http://cdi-test-proxy.openshift-cnv:8080 noProxy: .cluster.local,.svc,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,192.168.0.0/18,api-int.alex490-194.cnv-qe.rhcloud.com,localhost 3. Change reverted: [cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get proxy cluster -o yaml apiVersion: config.openshift.io/v1 kind: Proxy metadata: creationTimestamp: "2021-09-13T11:49:11Z" generation: 39 name: cluster resourceVersion: "7757378" uid: 5101277d-a23e-4a0f-a31a-93e2693dca35 spec: trustedCA: name: "" status: {} 4. Resume: [cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp master --template '{{.spec.paused}}' false[cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp worker --template '{{.spec.paused}}' false As you can see maintenance starts: [cnv-qe-jenkins@alex48-451-xzjsk-executor containerized-data-importer]$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-a88383bb8c26039eee6ee16b19ca0638 False True False 3 0 0 0 6d3h worker rendered-worker-8623d681ece6edadd6c55ac90ec7328f False True False 3 0 0 0 6d3h - Feel free to tell me if I'm getting this wrong and this is desired - Our goal is to test that our controller picks up the updated values from the proxy resource, this is why we wanted to disable the reboots
I suspect that this behaviour is not platform-specific. Maybe the machine-config-operator team can sort this out more easily?
There's a few possibilities here: 1. The changes to proxy has not yet rendered to MCO yet when you reverted the change. What happens is when you modify the proxy object, the MCController needs to read the updated proxy, update the corresponding base machineconfig, then render from that updated config, and then finally select nodes from a pool to apply this to. It could still have the stale rendered config from the previous apply and not yet the revert 2. Some other change is being rendered in and the update is not due to your proxy change (unlikely but possible) 3. The proxy object actually has a minor diff between the original and the now-reverted (maybe the networking operator is parsing it differently) that is causing the diff 4. There is a bug somewhere in the pool logic that doesn't immediately update back to the old config Please attach a must-gather of a cluster right after you do this, or when it has settled. At least we would need to see the rendered machineconfigs that its updating to, to see what the diff in contents are.
Attaching must gather after reproducing scenario + MCPs settled: https://drive.google.com/file/d/1PMus4KMKwnYq-_NKTXWqNmWnGzML049e/view?usp=sharing Regarding 1 - how long can this propagation take potentially? (paused is still true at this point)
(In reply to Yu Qi Zhang from comment #2) > There's a few possibilities here: > > 4. There is a bug somewhere in the pool logic that doesn't immediately > update back to the old config > I suspect this bug causing this: https://bugzilla.redhat.com/show_bug.cgi?id=1981549.
We found a variation of this BZ where proxy1 cannot be reconfigured to proxy2. The reconfiguration is not delayed, the reconfiguration never happens at all (16 hours waiting). When verifying this BZ we need to make sure that we can do this too: (with paused pools) 1. configure proxy1 2. check that MCDs have the right values for proxy1 2. edit proxy resource and reconfigure proxy1 -> proxy 3. check taht MCDs have the right values for proxy2
I made a typo in my previous comment, sorry These steps: (with paused pools) 1. configure proxy1 3. check that MCDs have the right values for proxy1 4. edit proxy resource and reconfigure proxy1 -> proxy2 5. check taht MCDs have the right values for proxy2
Verified using ipi on AWS version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-15-095020 True False 76m Cluster version is 4.11.0-0.nightly-2022-06-15-095020 Verification steps: 1. Pause worker and master MachineConfigPools 2. Edit proxy resource to add a proxy oc edit proxy .... spec: httpProxy: http://user:pass@proxy-fake:1111 httpsProxy: http://user:pass@proxy-fake:1111 noProxy: test.no-proxy.com trustedCA: name: "" 3. Check that the proxy info is displayed in the Daemonset $ oc get ds machine-config-daemon -o yaml |grep -i proxy - name: HTTP_PROXY value: http://user:pass@proxy-fake:1111 - name: HTTPS_PROXY value: http://user:pass@proxy-fake:1111 - name: NO_PROXY value: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.sregidor-bz1.qe.devcluster.openshift.com,localhost,test.no-proxy.com name: oauth-proxy name: proxy-tls - name: proxy-tls secretName: proxy-tls name: oauth-proxy 4. Check that the operator has been marked as degraded $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.11.0-0.nightly-2022-06-15-095020 True False True 79m Failed to resync 4.11.0-0.nightly-2022-06-15-095020 because: Required MachineConfigPool 'master' is paused and can not sync until it is unpaused 5. Remove the proxy config from the proxy object oc edit proxy .... spec: trustedCA: name: "" 6. Check that the proxy is not configured in the Daemonsets anymore. (In less than 10 minutes) $ oc get ds machine-config-daemon -o yaml |grep -i proxy name: oauth-proxy name: proxy-tls - name: proxy-tls secretName: proxy-tls 7. Check that the operator is not marked as degraded any more $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.11.0-0.nightly-2022-06-15-095020 True False False 87m We move the status to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069