Bug 1989547
Summary: | sdn container is restarting randomly on some nodes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Pamela Escorza <pescorza> |
Component: | Networking | Assignee: | Dan Winship <danw> |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aconstan, danw |
Version: | 3.11.0 | ||
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-15 19:21:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pamela Escorza
2021-08-03 12:53:04 UTC
can you get `oc logs --previous ...` for the sdn pod? ie, to see the logs of the sdn pod that crashed, rather than the logs of the new sdn pod that was started after the old one crashed if that doesn't work, `oc get pod ... -o yaml` might show the last few lines of the previous pod's logs in its conditions... @danw: the log from the last status is empty: ~~~ $ ls -ltrh *sdn-p928f* -rw-------. 1 pescorza pescorza 14M Jul 30 20:20 pod-sdn-p928f_sdn.log -rw-------. 1 pescorza pescorza 97 Jul 30 20:20 pod-sdn-p928f_sdn.previous.log [pescorza@pescorza openshift-sdn]$ cat pod-sdn-p928f_sdn.previous.log Error from server (BadRequest): previous terminated container "sdn" in pod "sdn-p928f" not found ~~~ ~~~ "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2020-08-10T00:19:20Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2021-07-25T05:10:49Z", "status": "True", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": null, "status": "True", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2020-08-10T00:19:20Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "containerID": "docker://7b9e2ee59fa4751476a4a7c5176bce1be82e9a02d1608739a3815e6b5501a605", "image": "registry.redhat.io/openshift3/ose-node:v3.11.216", "imageID": "docker-pullable://registry.redhat.io/openshift3/ose-node@sha256:3a64f6b31e31695ac58ec7533c70b96b3a2f9858dbb288b3d55c2ac31384c00a", "lastState": {}, "name": "sdn", "ready": true, "restartCount": 28, "state": { "running": { "startedAt": "2021-07-25T05:10:49Z" ... ~~~ The sdn logs show that sdn and ovs pods are periodically intentionally killed by kubelet. If this doesn't correspond to updates/maintenance/etc, then the most likely answer is that they are failing their liveness probes, and so kubelet restarts them. The sosreport shows that that this did happen for the ovs pod (but not the sdn pod) on the sosreport-ed node. There's no previous log for that ovs in the sdn/ovs logs, but looking at the ovs ".previous.log" files that do exist for other nodes, all of them were killed (presumably by kubelet) as opposed to crashing, but none of them seem to show any obvious problems immediately before being killed. Hi @danw: could you please suggest any kind of debug that we can perform to get further information? As an administrator, they can run oc edit -n openshift-sdn ds ovs --type=json \ -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/livenessProbe"}]' This should cause OVS to be redeployed without its liveness probe, which should ensure that kubelet does not end up mistakenly killing it when the machine gets overloaded. Note that if they do any OCP updates or perform other operations with the OCP ansible scripts, this change might get reverted and they will have to do it again. This change will be included in an upcoming 3.11.z release. (Not the very next one, which I think will be coming out this week, but the one after that.) upgraded cluster from 3.11.501 to 3.11.521 using openshift-ansible-playbooks-3.11.521-1.git.0.8ca76fd.el7.noarch this looks good [root@qe-smoke311-master-etcd-1 ~]# oc get ds -n openshift-sdn ovs -o yaml | grep livenessPro [root@qe-smoke311-master-etcd-1 ~]# Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.521 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3424 |