Description of problem: Customer is reporting sdn container restarts on most of the nodes of their productive cluster which business workload is affected when the node allocate the egressip. Version-Release number of selected component (if applicable): - Red Hat OpenShift 3.11 - RHEL Atomic - Baremetal - Docker version docker-1.13.1-208.git7d71120.el7_9.x86_64 How reproducible: No able to reproduce as it's random Actual results: high number of sdn pod restart Expected results: no restart Additional info:
can you get `oc logs --previous ...` for the sdn pod? ie, to see the logs of the sdn pod that crashed, rather than the logs of the new sdn pod that was started after the old one crashed if that doesn't work, `oc get pod ... -o yaml` might show the last few lines of the previous pod's logs in its conditions...
@danw: the log from the last status is empty: ~~~ $ ls -ltrh *sdn-p928f* -rw-------. 1 pescorza pescorza 14M Jul 30 20:20 pod-sdn-p928f_sdn.log -rw-------. 1 pescorza pescorza 97 Jul 30 20:20 pod-sdn-p928f_sdn.previous.log [pescorza@pescorza openshift-sdn]$ cat pod-sdn-p928f_sdn.previous.log Error from server (BadRequest): previous terminated container "sdn" in pod "sdn-p928f" not found ~~~ ~~~ "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2020-08-10T00:19:20Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2021-07-25T05:10:49Z", "status": "True", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": null, "status": "True", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2020-08-10T00:19:20Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "containerID": "docker://7b9e2ee59fa4751476a4a7c5176bce1be82e9a02d1608739a3815e6b5501a605", "image": "registry.redhat.io/openshift3/ose-node:v3.11.216", "imageID": "docker-pullable://registry.redhat.io/openshift3/ose-node@sha256:3a64f6b31e31695ac58ec7533c70b96b3a2f9858dbb288b3d55c2ac31384c00a", "lastState": {}, "name": "sdn", "ready": true, "restartCount": 28, "state": { "running": { "startedAt": "2021-07-25T05:10:49Z" ... ~~~
The sdn logs show that sdn and ovs pods are periodically intentionally killed by kubelet. If this doesn't correspond to updates/maintenance/etc, then the most likely answer is that they are failing their liveness probes, and so kubelet restarts them. The sosreport shows that that this did happen for the ovs pod (but not the sdn pod) on the sosreport-ed node. There's no previous log for that ovs in the sdn/ovs logs, but looking at the ovs ".previous.log" files that do exist for other nodes, all of them were killed (presumably by kubelet) as opposed to crashing, but none of them seem to show any obvious problems immediately before being killed.
Hi @danw: could you please suggest any kind of debug that we can perform to get further information?
As an administrator, they can run oc edit -n openshift-sdn ds ovs --type=json \ -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/livenessProbe"}]' This should cause OVS to be redeployed without its liveness probe, which should ensure that kubelet does not end up mistakenly killing it when the machine gets overloaded. Note that if they do any OCP updates or perform other operations with the OCP ansible scripts, this change might get reverted and they will have to do it again. This change will be included in an upcoming 3.11.z release. (Not the very next one, which I think will be coming out this week, but the one after that.)
upgraded cluster from 3.11.501 to 3.11.521 using openshift-ansible-playbooks-3.11.521-1.git.0.8ca76fd.el7.noarch this looks good [root@qe-smoke311-master-etcd-1 ~]# oc get ds -n openshift-sdn ovs -o yaml | grep livenessPro [root@qe-smoke311-master-etcd-1 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.521 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3424