Bug 1989547

Summary:	sdn container is restarting randomly on some nodes
Product:	OpenShift Container Platform	Reporter:	Pamela Escorza <pescorza>
Component:	Networking	Assignee:	Dan Winship <danw>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, danw
Version:	3.11.0
Target Milestone:	---
Target Release:	3.11.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-15 19:21:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pamela Escorza 2021-08-03 12:53:04 UTC

Description of problem:
Customer is reporting sdn container restarts on most of the nodes of their productive cluster which business workload is affected when the node allocate the egressip.

Version-Release number of selected component (if applicable):
- Red Hat OpenShift 3.11 - RHEL Atomic - Baremetal
- Docker version  docker-1.13.1-208.git7d71120.el7_9.x86_64  

How reproducible:
No able to reproduce as it's random 


Actual results:
high number of sdn pod restart

Expected results:
no restart 

Additional info:

Comment 3 Dan Winship 2021-08-10 12:31:20 UTC

can you get `oc logs --previous ...` for the sdn pod? ie, to see the logs of the sdn pod that crashed, rather than the logs of the new sdn pod that was started after the old one crashed

if that doesn't work, `oc get pod ... -o yaml` might show the last few lines of the previous pod's logs in its conditions...

Comment 4 Pamela Escorza 2021-08-10 14:34:37 UTC

@danw: the log from the last status is empty:  
~~~
$ ls -ltrh *sdn-p928f*
-rw-------. 1 pescorza pescorza 14M Jul 30 20:20 pod-sdn-p928f_sdn.log
-rw-------. 1 pescorza pescorza  97 Jul 30 20:20 pod-sdn-p928f_sdn.previous.log
[pescorza@pescorza openshift-sdn]$ cat pod-sdn-p928f_sdn.previous.log
Error from server (BadRequest): previous terminated container "sdn" in pod "sdn-p928f" not found

~~~

~~~
 "status": {
    "conditions": [
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2020-08-10T00:19:20Z",
        "status": "True",
        "type": "Initialized"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2021-07-25T05:10:49Z",
        "status": "True",
        "type": "Ready"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": null,
        "status": "True",
        "type": "ContainersReady"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2020-08-10T00:19:20Z",
        "status": "True",
        "type": "PodScheduled"
      }
    ],
    "containerStatuses": [
      {
        "containerID": "docker://7b9e2ee59fa4751476a4a7c5176bce1be82e9a02d1608739a3815e6b5501a605",
        "image": "registry.redhat.io/openshift3/ose-node:v3.11.216",
        "imageID": "docker-pullable://registry.redhat.io/openshift3/ose-node@sha256:3a64f6b31e31695ac58ec7533c70b96b3a2f9858dbb288b3d55c2ac31384c00a",
        "lastState": {},
        "name": "sdn",
        "ready": true,
        "restartCount": 28,
        "state": {
          "running": {
            "startedAt": "2021-07-25T05:10:49Z"
...
~~~

Comment 6 Dan Winship 2021-08-17 13:28:00 UTC

The sdn logs show that sdn and ovs pods are periodically intentionally killed by kubelet. If this doesn't correspond to updates/maintenance/etc, then the most likely answer is that they are failing their liveness probes, and so kubelet restarts them.

The sosreport shows that that this did happen for the ovs pod (but not the sdn pod) on the sosreport-ed node. There's no previous log for that ovs in the sdn/ovs logs, but looking at the ovs ".previous.log" files that do exist for other nodes, all of them were killed (presumably by kubelet) as opposed to crashing, but none of them seem to show any obvious problems immediately before being killed.

Comment 7 Pamela Escorza 2021-08-17 15:53:29 UTC

Hi @danw: could you please suggest any kind of debug that we can perform to get further information?

Comment 8 Dan Winship 2021-08-24 14:15:16 UTC

As an administrator, they can run

    oc edit -n openshift-sdn ds ovs --type=json \
      -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/livenessProbe"}]'

This should cause OVS to be redeployed without its liveness probe, which should ensure that kubelet does not end up mistakenly killing it when the machine gets overloaded.

Note that if they do any OCP updates or perform other operations with the OCP ansible scripts, this change might get reverted and they will have to do it again.

This change will be included in an upcoming 3.11.z release. (Not the very next one, which I think will be coming out this week, but the one after that.)

Comment 13 zhaozhanqi 2021-09-10 10:25:06 UTC

upgraded cluster from 3.11.501 to 3.11.521 using openshift-ansible-playbooks-3.11.521-1.git.0.8ca76fd.el7.noarch

this looks good

[root@qe-smoke311-master-etcd-1 ~]# oc get ds -n openshift-sdn ovs -o yaml  | grep livenessPro
[root@qe-smoke311-master-etcd-1 ~]#

Comment 15 errata-xmlrpc 2021-09-15 19:21:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.521 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3424