Description of problem: Witnessed in http://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-01-070854 MCD is occasionally going into crash loop backoff due to a recent patch to propagate an error: https://github.com/openshift/machine-config-operator/commit/dd7154131a868ec950e87cdfc74d1b89b3919792#diff-a53b7b593d3d778e62eaeeafa40088656f9212bfa2c2b7991df15fa78e60b0f0R649 ``` W0301 16:31:13.766926 30144 daemon.go:634] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "test1-m6nq2-master-0" not found I0301 16:31:13.766940 30144 daemon.go:635] Shutting down MachineConfigDaemon F0301 16:31:13.766993 30144 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "test1-m6nq2-master-0" not found ``` Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
*** Bug 1933155 has been marked as a duplicate of this bug. ***
*** Bug 1938192 has been marked as a duplicate of this bug. ***
The status says POST but appears maybe development is rethinking the fix/PR?
Per the latest update on the PR, this PR is waiting to be unblocked by CI.
*** Bug 1941932 has been marked as a duplicate of this bug. ***
*** Bug 1942763 has been marked as a duplicate of this bug. ***
Bumping this bug, and mentioning the test-case for Sippy, because origin PRs keep failing on this. $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+KubePodCrashLooping+fired.*machine-config-daemon' | grep 'failures match' | sort pull-ci-openshift-origin-master-e2e-aws-disruptive (all) - 25 runs, 96% failed, 17% of failures match = 16% impact pull-ci-openshift-origin-master-e2e-gcp-disruptive (all) - 6 runs, 83% failed, 60% of failures match = 50% impact
Hey folks, I'm facing this one also in this nightly version: "4.8.0-0.nightly-2021-04-05-174735", is there any workaround to make this work properly?
Here is a workaround (Kudos to Jerry Zhang): This should be done in the nodes affected by this error, you should be able to know which are affected checkjing in which node is the pod on crashloopbackoff state, then join the node and execute this: - journalctl --flush - rm -rf /var/log/journal/* - systemctl restart systemd-journald Then restart the affected pods (in crashloopbackoff) and check if they goes up correctly. regards
A quick note on Juan's method above, the MCO does use the journal to also determine pending configs during updates. We'll try to go through with the revert ASAP but you may see a node update restart because of it
*** Bug 1946713 has been marked as a duplicate of this bug. ***
*** Bug 1946853 has been marked as a duplicate of this bug. ***
(In reply to Juan Manuel Parrilla Madrid from comment #10) > Then restart the affected pods (in crashloopbackoff) and check if they goes > up correctly. DQA (dumb question amnesty): How do you restart the pods? What are the series of commands?
Try: $ oc -n openshift-machine-config-operator delete pod $NAME_OF_THE_MCD_POD_THAT_WAS_CRASHLOOPING The DaemonSet controller should create a replacement for the one you delete.
Trevor is correct. Also since it's crashlooping it may eventually succeed by itself (because the loop is really trying to restart the pod), it just might take awhile since the crashloop is exponential I believe.
Verified on 4.8.0-0.nightly-2021-04-22-061234. No crashloop backoff on MCD. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-061234 True False 49m Cluster version is 4.8.0-0.nightly-2021-04-22-061234 [mnguyen@pet32 4.8]$ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-92bm8r2-f76d1-gs5fk-master-0 Ready master 72m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-master-1 Ready master 71m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-master-2 Ready master 72m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-worker-b-mwkll Ready worker 65m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-worker-c-m56bz Ready worker 65m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-worker-d-r6vzg Ready worker 65m v1.21.0-rc.0+3ced7a9 $ oc get pods -A --field-selector spec.nodeName=ci-ln-92bm8r2-f76d1-gs5fk-worker-b-mwkll NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-csi-drivers gcp-pd-csi-driver-node-hqrwn 3/3 Running 0 65m openshift-cluster-node-tuning-operator tuned-skvhp 1/1 Running 0 65m openshift-dns dns-default-2fllg 2/2 Running 0 65m openshift-dns node-resolver-597ss 1/1 Running 0 65m openshift-image-registry node-ca-74nw7 1/1 Running 0 65m openshift-ingress-canary ingress-canary-dw87p 1/1 Running 0 65m openshift-ingress router-default-84474bb94-6g5t6 1/1 Running 0 66m openshift-machine-config-operator machine-config-daemon-9cnp2 2/2 Running 0 65m openshift-monitoring alertmanager-main-1 5/5 Running 0 64m openshift-monitoring node-exporter-btxsh 2/2 Running 0 65m openshift-monitoring prometheus-k8s-0 7/7 Running 1 64m openshift-monitoring thanos-querier-9cf4fd6b7-mz7tz 5/5 Running 0 64m openshift-multus multus-d77w9 1/1 Running 0 65m openshift-multus network-metrics-daemon-9ct6r 2/2 Running 0 65m openshift-network-diagnostics network-check-target-5bkx8 1/1 Running 0 65m openshift-sdn ovs-hqj8q 1/1 Running 0 65m openshift-sdn sdn-n8kpb 2/2 Running 0 65m $ oc -n openshift-machine-config-operator logs machine-config-daemon-9cnp2 -c machine-config-daemon | grep 'unable to update node' $ oc -n openshift-machine-config-operator logs machine-config-daemon-9cnp2 -c machine-config-daemon | grep 'cannot apply annotation for SSH access due'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438