Bug 1933772
Summary: | MCD Crash Loop Backoff | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ryan Phillips <rphillips> |
Component: | Machine Config Operator | Assignee: | Ben Howard <behoward> |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.8 | CC: | behoward, danili, dollierp, huirwang, jerzhang, jparrill, kboumedh, lmcfadde, mgugino, mhamzy, mkrejci, nelluri, rioliu, rteague, sasha, sbatsche, tkapoor, wking, zzhao |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
|
|
Last Closed: | 2021-07-27 22:48:44 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ryan Phillips
2021-03-01 16:55:44 UTC
*** Bug 1933155 has been marked as a duplicate of this bug. *** *** Bug 1938192 has been marked as a duplicate of this bug. *** The status says POST but appears maybe development is rethinking the fix/PR? Per the latest update on the PR, this PR is waiting to be unblocked by CI. *** Bug 1941932 has been marked as a duplicate of this bug. *** *** Bug 1942763 has been marked as a duplicate of this bug. *** Bumping this bug, and mentioning the test-case for Sippy, because origin PRs keep failing on this. $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+KubePodCrashLooping+fired.*machine-config-daemon' | grep 'failures match' | sort pull-ci-openshift-origin-master-e2e-aws-disruptive (all) - 25 runs, 96% failed, 17% of failures match = 16% impact pull-ci-openshift-origin-master-e2e-gcp-disruptive (all) - 6 runs, 83% failed, 60% of failures match = 50% impact Hey folks, I'm facing this one also in this nightly version: "4.8.0-0.nightly-2021-04-05-174735", is there any workaround to make this work properly? Here is a workaround (Kudos to Jerry Zhang): This should be done in the nodes affected by this error, you should be able to know which are affected checkjing in which node is the pod on crashloopbackoff state, then join the node and execute this: - journalctl --flush - rm -rf /var/log/journal/* - systemctl restart systemd-journald Then restart the affected pods (in crashloopbackoff) and check if they goes up correctly. regards A quick note on Juan's method above, the MCO does use the journal to also determine pending configs during updates. We'll try to go through with the revert ASAP but you may see a node update restart because of it *** Bug 1946713 has been marked as a duplicate of this bug. *** *** Bug 1946853 has been marked as a duplicate of this bug. *** (In reply to Juan Manuel Parrilla Madrid from comment #10) > Then restart the affected pods (in crashloopbackoff) and check if they goes > up correctly. DQA (dumb question amnesty): How do you restart the pods? What are the series of commands? Try: $ oc -n openshift-machine-config-operator delete pod $NAME_OF_THE_MCD_POD_THAT_WAS_CRASHLOOPING The DaemonSet controller should create a replacement for the one you delete. Trevor is correct. Also since it's crashlooping it may eventually succeed by itself (because the loop is really trying to restart the pod), it just might take awhile since the crashloop is exponential I believe. Verified on 4.8.0-0.nightly-2021-04-22-061234. No crashloop backoff on MCD. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-061234 True False 49m Cluster version is 4.8.0-0.nightly-2021-04-22-061234 [mnguyen@pet32 4.8]$ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-92bm8r2-f76d1-gs5fk-master-0 Ready master 72m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-master-1 Ready master 71m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-master-2 Ready master 72m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-worker-b-mwkll Ready worker 65m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-worker-c-m56bz Ready worker 65m v1.21.0-rc.0+3ced7a9 ci-ln-92bm8r2-f76d1-gs5fk-worker-d-r6vzg Ready worker 65m v1.21.0-rc.0+3ced7a9 $ oc get pods -A --field-selector spec.nodeName=ci-ln-92bm8r2-f76d1-gs5fk-worker-b-mwkll NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-csi-drivers gcp-pd-csi-driver-node-hqrwn 3/3 Running 0 65m openshift-cluster-node-tuning-operator tuned-skvhp 1/1 Running 0 65m openshift-dns dns-default-2fllg 2/2 Running 0 65m openshift-dns node-resolver-597ss 1/1 Running 0 65m openshift-image-registry node-ca-74nw7 1/1 Running 0 65m openshift-ingress-canary ingress-canary-dw87p 1/1 Running 0 65m openshift-ingress router-default-84474bb94-6g5t6 1/1 Running 0 66m openshift-machine-config-operator machine-config-daemon-9cnp2 2/2 Running 0 65m openshift-monitoring alertmanager-main-1 5/5 Running 0 64m openshift-monitoring node-exporter-btxsh 2/2 Running 0 65m openshift-monitoring prometheus-k8s-0 7/7 Running 1 64m openshift-monitoring thanos-querier-9cf4fd6b7-mz7tz 5/5 Running 0 64m openshift-multus multus-d77w9 1/1 Running 0 65m openshift-multus network-metrics-daemon-9ct6r 2/2 Running 0 65m openshift-network-diagnostics network-check-target-5bkx8 1/1 Running 0 65m openshift-sdn ovs-hqj8q 1/1 Running 0 65m openshift-sdn sdn-n8kpb 2/2 Running 0 65m $ oc -n openshift-machine-config-operator logs machine-config-daemon-9cnp2 -c machine-config-daemon | grep 'unable to update node' $ oc -n openshift-machine-config-operator logs machine-config-daemon-9cnp2 -c machine-config-daemon | grep 'cannot apply annotation for SSH access due' Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |