Bug 1933772

Summary:	MCD Crash Loop Backoff
Product:	OpenShift Container Platform	Reporter:	Ryan Phillips <rphillips>
Component:	Machine Config Operator	Assignee:	Ben Howard <behoward>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.8	CC:	behoward, danili, dollierp, huirwang, jerzhang, jparrill, kboumedh, lmcfadde, mgugino, mhamzy, mkrejci, nelluri, rioliu, rteague, sasha, sbatsche, tkapoor, wking, zzhao
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
Last Closed:	2021-07-27 22:48:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ryan Phillips 2021-03-01 16:55:44 UTC

Description of problem:
Witnessed in http://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-01-070854

MCD is occasionally going into crash loop backoff due to a recent patch to propagate an error:

https://github.com/openshift/machine-config-operator/commit/dd7154131a868ec950e87cdfc74d1b89b3919792#diff-a53b7b593d3d778e62eaeeafa40088656f9212bfa2c2b7991df15fa78e60b0f0R649

```
W0301 16:31:13.766926   30144 daemon.go:634] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "test1-m6nq2-master-0" not found
I0301 16:31:13.766940   30144 daemon.go:635] Shutting down MachineConfigDaemon
F0301 16:31:13.766993   30144 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "test1-m6nq2-master-0" not found
```

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ben Howard 2021-03-02 13:27:48 UTC

*** Bug 1933155 has been marked as a duplicate of this bug. ***

Comment 2 Yu Qi Zhang 2021-03-12 16:02:30 UTC

*** Bug 1938192 has been marked as a duplicate of this bug. ***

Comment 3 lmcfadde 2021-03-17 15:26:25 UTC

The status says POST but appears maybe development is rethinking the fix/PR?

Comment 4 Dan Li 2021-03-23 14:46:51 UTC

Per the latest update on the PR, this PR is waiting to be unblocked by CI.

Comment 5 Yu Qi Zhang 2021-03-23 15:04:46 UTC

*** Bug 1941932 has been marked as a duplicate of this bug. ***

Comment 6 Matthew Staebler 2021-03-24 23:36:14 UTC

*** Bug 1942763 has been marked as a duplicate of this bug. ***

Comment 8 W. Trevor King 2021-04-05 18:43:03 UTC

Bumping this bug, and mentioning the test-case for Sippy, because origin PRs keep failing on this.

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+KubePodCrashLooping+fired.*machine-config-daemon' | grep 'failures match' | sort
pull-ci-openshift-origin-master-e2e-aws-disruptive (all) - 25 runs, 96% failed, 17% of failures match = 16% impact
pull-ci-openshift-origin-master-e2e-gcp-disruptive (all) - 6 runs, 83% failed, 60% of failures match = 50% impact

Comment 9 Juan Manuel Parrilla Madrid 2021-04-06 14:51:24 UTC

Hey folks, I'm facing this one also in this nightly version: "4.8.0-0.nightly-2021-04-05-174735", is there any workaround to make this work properly?

Comment 10 Juan Manuel Parrilla Madrid 2021-04-06 17:10:31 UTC

Here is a workaround (Kudos to Jerry Zhang):


This should be done in the nodes affected by this error, you should be able to know which are affected checkjing in which node is the pod on crashloopbackoff state, then join the node and execute this:

- journalctl --flush
- rm -rf /var/log/journal/*
- systemctl restart systemd-journald

Then restart the affected pods (in crashloopbackoff) and check if they goes up correctly.

regards

Comment 11 Yu Qi Zhang 2021-04-06 17:50:33 UTC

A quick note on Juan's method above, the MCO does use the journal to also determine pending configs during updates.

We'll try to go through with the revert ASAP but you may see a node update restart because of it

Comment 12 Alexander Chuzhoy 2021-04-06 17:55:07 UTC

*** Bug 1946713 has been marked as a duplicate of this bug. ***

Comment 13 Yu Qi Zhang 2021-04-07 03:49:36 UTC

*** Bug 1946853 has been marked as a duplicate of this bug. ***

Comment 14 Mark Hamzy 2021-04-08 21:46:08 UTC

(In reply to Juan Manuel Parrilla Madrid from comment #10)

> Then restart the affected pods (in crashloopbackoff) and check if they goes
> up correctly.

DQA (dumb question amnesty): How do you restart the pods?  What are the series of commands?

Comment 15 W. Trevor King 2021-04-08 22:01:23 UTC

Try:

  $ oc -n openshift-machine-config-operator delete pod $NAME_OF_THE_MCD_POD_THAT_WAS_CRASHLOOPING

The DaemonSet controller should create a replacement for the one you delete.

Comment 16 Yu Qi Zhang 2021-04-08 22:09:07 UTC

Trevor is correct. Also since it's crashlooping it may eventually succeed by itself (because the loop is really trying to restart the pod), it just might take awhile since the crashloop is exponential I believe.

Comment 18 Michael Nguyen 2021-04-22 15:07:41 UTC

Verified on 4.8.0-0.nightly-2021-04-22-061234.  No crashloop backoff on MCD.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-22-061234   True        False         49m     Cluster version is 4.8.0-0.nightly-2021-04-22-061234
[mnguyen@pet32 4.8]$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-92bm8r2-f76d1-gs5fk-master-0         Ready    master   72m   v1.21.0-rc.0+3ced7a9
ci-ln-92bm8r2-f76d1-gs5fk-master-1         Ready    master   71m   v1.21.0-rc.0+3ced7a9
ci-ln-92bm8r2-f76d1-gs5fk-master-2         Ready    master   72m   v1.21.0-rc.0+3ced7a9
ci-ln-92bm8r2-f76d1-gs5fk-worker-b-mwkll   Ready    worker   65m   v1.21.0-rc.0+3ced7a9
ci-ln-92bm8r2-f76d1-gs5fk-worker-c-m56bz   Ready    worker   65m   v1.21.0-rc.0+3ced7a9
ci-ln-92bm8r2-f76d1-gs5fk-worker-d-r6vzg   Ready    worker   65m   v1.21.0-rc.0+3ced7a9
$ oc get pods -A --field-selector spec.nodeName=ci-ln-92bm8r2-f76d1-gs5fk-worker-b-mwkll
NAMESPACE                                NAME                             READY   STATUS    RESTARTS   AGE
openshift-cluster-csi-drivers            gcp-pd-csi-driver-node-hqrwn     3/3     Running   0          65m
openshift-cluster-node-tuning-operator   tuned-skvhp                      1/1     Running   0          65m
openshift-dns                            dns-default-2fllg                2/2     Running   0          65m
openshift-dns                            node-resolver-597ss              1/1     Running   0          65m
openshift-image-registry                 node-ca-74nw7                    1/1     Running   0          65m
openshift-ingress-canary                 ingress-canary-dw87p             1/1     Running   0          65m
openshift-ingress                        router-default-84474bb94-6g5t6   1/1     Running   0          66m
openshift-machine-config-operator        machine-config-daemon-9cnp2      2/2     Running   0          65m
openshift-monitoring                     alertmanager-main-1              5/5     Running   0          64m
openshift-monitoring                     node-exporter-btxsh              2/2     Running   0          65m
openshift-monitoring                     prometheus-k8s-0                 7/7     Running   1          64m
openshift-monitoring                     thanos-querier-9cf4fd6b7-mz7tz   5/5     Running   0          64m
openshift-multus                         multus-d77w9                     1/1     Running   0          65m
openshift-multus                         network-metrics-daemon-9ct6r     2/2     Running   0          65m
openshift-network-diagnostics            network-check-target-5bkx8       1/1     Running   0          65m
openshift-sdn                            ovs-hqj8q                        1/1     Running   0          65m
openshift-sdn                            sdn-n8kpb                        2/2     Running   0          65m
$ oc -n openshift-machine-config-operator logs machine-config-daemon-9cnp2 -c machine-config-daemon | grep 'unable to update node'
$ oc -n openshift-machine-config-operator logs machine-config-daemon-9cnp2 -c machine-config-daemon | grep 'cannot apply annotation for SSH access due'

Comment 21 errata-xmlrpc 2021-07-27 22:48:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438