In a cluster updating from 4.7.9 to 4.7.11, the machine-config operator sticks with [1]: - lastTransitionTime: "2021-07-12T16:02:10Z" message: 'Unable to apply 4.7.11: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)' reason: RequiredPoolsFailed status: "True" type: Degraded ... extension: master: 'pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip-10-0-198-230.ec2.internal is reporting: \"failed to run command nice (6 tries): timed out waiting for the condition: running nice -- ionice -c 3 podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d failed: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d: error pinging docker registry quay.io: Get \\\"https://quay.io/v2/\\\": proxyconnect tcp: dial tcp: lookup clusterhttpsproxy.example.com on 10.0.128.2:53: no such host\\n: exit status 125\""' Seems like the Proxy config object had the bogus clusterhttpsproxy.example.com inserted, and subsequently removed, but the machine-config daemon was unable to recover after the bogus insertion. Machine-config daemon pod YAML [2] and logs [3]. Machine-config DaemonSet still has the broken Proxy environment variables [4]. The machine-config ControllerConfig proxy entry seems clear: $ yaml2json <cluster-scoped-resources/machineconfiguration.openshift.io/controllerconfigs/machine-config-controller.yaml | jq -r .spec.proxy null So I would expect the machine-config operator to be clearing the proxy environment variables off the DaemonSet [5], but that does not seem to be happening. This is similar to bug 1928581, except in this case, the update wedged before the network operator pod went down, so it should be easier to recover from. [1]: https://gist.github.com/abutcher/ce403a9ab355fe51dce8f549a8d2ae5d [2]: https://gist.github.com/abutcher/9928a877af6d97d5e624501143a1d6ea [3]: https://gist.github.com/abutcher/34cb0d5e4998ff27215290391b20b21f [4]: https://gist.github.com/abutcher/92dea431695b483c4ed08911445a1a60 [5]: https://github.com/openshift/machine-config-operator/blob/e3863b02b7403342cdf0f981889e8c3cfc2d86bb/manifests/machineconfigdaemon/daemonset.yaml#L40-L53
Attempting to repair, we manually removed the HTTP*PROXY environment variables from the MCD DaemonSet. The MCD update rolled out, but then the two impacted nodes stuck with logs like: I0712 20:16:36.854515 317114 daemon.go:1088] Validating against current config rendered-master-8b288a4d95457df6a503d6516d2c6971 E0712 20:16:36.854563 317114 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-master-8b288a4d95457df6a503d6516d2c6971: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d94c4dfe32d6b9321b61e221c35559e22cf6e96f75bef0345fe978530fff3e02" To see if it helped, we touched /run/machine-config-daemon-force on the compute node following [1], but it did not help: E0712 20:36:16.210890 716066 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8 eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d : with stdout output: : error running rpm-ostree rebase --experimental /srv/repo:f15456562794ed50b10e596 45d97aae9fbeccad57ed23e3ccaa0fb1a021369cd --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82 e03e61c82b056f5c2b8af200e15a4d --custom-origin-description Managed by machine-config-operator: error: opendir(/srv/repo): No such file or directory : exit status 1 Full logs for that MCD in [2]. [1]: https://github.com/openshift/machine-config-operator/pull/2265/files [2]: https://gist.github.com/abutcher/31e478f47f7961f99ddd024153e08d38
Bug 1981146 sounds very similar to this one, possibly one bug should be closed as a dup of the other.
Bug in MCO code is that the DaemonSet management depends on ensureContainer, which currently ensures that the in-cluster resource contains at least the env vars the MCO wants, but does not remove any additional env vars which may be in the in-cluster resource already [1]. So you set the Proxy properties, MCO injects the new env vars in the DaemonSet. Then you clear the Proxy properties, the MCO no longer requires the DaemonSet to contain the proxy env vars, but just says "huh, dunno what those are about; I'll let them stay", instead of removing them. We fixed something similar, but not quite the same, for the CVO in bug 1951339, and have [2] open about unifying the code-bases. [1]: https://github.com/openshift/machine-config-operator/blob/a5421965cf4e440b94bd1a4774072c517291fda0/lib/resourcemerge/core.go#L102-L117 [2]: https://issues.redhat.com/browse/GRPA-3832
*** Bug 1981146 has been marked as a duplicate of this bug. ***
Verified using IPI AWS cluster with image 4.10.0-0.nightly-2021-10-30-025206 When we configure a proxy "proxy1" and later on we remove the proxy settings, a new MC is rendered without the proxy and the proxy variables are removed correctly from MCD pods. Then we configure "proxy2" and "proxy2" is properly configured too. So we can consider the BZ verified. But we have observed that when we configure "proxy1" and we reconfigure it to "proxy2" instead of removing it, the reconfiguration never happens, even if we wait for 16 hours. Maybe it could be the result of being impacted by https://bugzilla.redhat.com/show_bug.cgi?id=2005694, but it is not a performance problem (delayed reconfiguration). To configure "proxy1" and then configure "proxy2", we need to configure "proxy1" -> remove "proxy1" -> configure "proxy2". Nevertheless, the behavior described in this BZ is fixed so we move the status to VERIFIED.
Verification steps: 1. Configure a proxy in the cluster oc edit proxy cluster ... spec: httpProxy: http://user:pass@proxy-fake:1111 httpsProxy: http://user:pass@proxy-fake:1111 noProxy: test.no-proxy.com trustedCA: name: "" 2. Verfify that the proxy has been added to MCD pods environment variables $ oc get pods -o yaml machine-config-daemon-6snvk | grep env -A 9 env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: HTTP_PROXY value: http://user:pass@proxy-fake:1111 - name: HTTPS_PROXY value: http://user:pass@proxy-fake:1111 3. Remove the proxy from the cluster oc edit proxy cluster ... spec: trustedCA: name: "" 4. Verify that the proxy has been removed from MCD pods environmnet variables $ oc get pods machine-config-daemon-g72rv -o yaml | grep env -A 9 ... env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName image: quay.io/openshift-release-d.....
The proxy1 -> proxy2 reconfiguration issue is a variation of this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1981549. It will be tracked and fixed in this BZ.
I'm sorry, I made a mistake in my previous comment and put the wrong link. The proxy1 -> proxy2 reconfiguration issue is actually tracked in this BZ https://bugzilla.redhat.com/show_bug.cgi?id=2005694
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
*** Bug 2070930 has been marked as a duplicate of this bug. ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days