Bug 1989547 - sdn container is restarting randomly on some nodes
Summary: sdn container is restarting randomly on some nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.11.z
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-03 12:53 UTC by Pamela Escorza
Modified: 2021-09-15 19:21 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-15 19:21:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12343 0 None None None 2021-08-24 13:53:21 UTC
Red Hat Product Errata RHBA-2021:3424 0 None None None 2021-09-15 19:21:18 UTC

Description Pamela Escorza 2021-08-03 12:53:04 UTC
Description of problem:
Customer is reporting sdn container restarts on most of the nodes of their productive cluster which business workload is affected when the node allocate the egressip.

Version-Release number of selected component (if applicable):
- Red Hat OpenShift 3.11 - RHEL Atomic - Baremetal
- Docker version  docker-1.13.1-208.git7d71120.el7_9.x86_64  

How reproducible:
No able to reproduce as it's random 


Actual results:
high number of sdn pod restart

Expected results:
no restart 

Additional info:

Comment 3 Dan Winship 2021-08-10 12:31:20 UTC
can you get `oc logs --previous ...` for the sdn pod? ie, to see the logs of the sdn pod that crashed, rather than the logs of the new sdn pod that was started after the old one crashed

if that doesn't work, `oc get pod ... -o yaml` might show the last few lines of the previous pod's logs in its conditions...

Comment 4 Pamela Escorza 2021-08-10 14:34:37 UTC
@danw: the log from the last status is empty:  
~~~
$ ls -ltrh *sdn-p928f*
-rw-------. 1 pescorza pescorza 14M Jul 30 20:20 pod-sdn-p928f_sdn.log
-rw-------. 1 pescorza pescorza  97 Jul 30 20:20 pod-sdn-p928f_sdn.previous.log
[pescorza@pescorza openshift-sdn]$ cat pod-sdn-p928f_sdn.previous.log
Error from server (BadRequest): previous terminated container "sdn" in pod "sdn-p928f" not found

~~~

~~~
 "status": {
    "conditions": [
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2020-08-10T00:19:20Z",
        "status": "True",
        "type": "Initialized"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2021-07-25T05:10:49Z",
        "status": "True",
        "type": "Ready"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": null,
        "status": "True",
        "type": "ContainersReady"
      },
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2020-08-10T00:19:20Z",
        "status": "True",
        "type": "PodScheduled"
      }
    ],
    "containerStatuses": [
      {
        "containerID": "docker://7b9e2ee59fa4751476a4a7c5176bce1be82e9a02d1608739a3815e6b5501a605",
        "image": "registry.redhat.io/openshift3/ose-node:v3.11.216",
        "imageID": "docker-pullable://registry.redhat.io/openshift3/ose-node@sha256:3a64f6b31e31695ac58ec7533c70b96b3a2f9858dbb288b3d55c2ac31384c00a",
        "lastState": {},
        "name": "sdn",
        "ready": true,
        "restartCount": 28,
        "state": {
          "running": {
            "startedAt": "2021-07-25T05:10:49Z"
...
~~~

Comment 6 Dan Winship 2021-08-17 13:28:00 UTC
The sdn logs show that sdn and ovs pods are periodically intentionally killed by kubelet. If this doesn't correspond to updates/maintenance/etc, then the most likely answer is that they are failing their liveness probes, and so kubelet restarts them.

The sosreport shows that that this did happen for the ovs pod (but not the sdn pod) on the sosreport-ed node. There's no previous log for that ovs in the sdn/ovs logs, but looking at the ovs ".previous.log" files that do exist for other nodes, all of them were killed (presumably by kubelet) as opposed to crashing, but none of them seem to show any obvious problems immediately before being killed.

Comment 7 Pamela Escorza 2021-08-17 15:53:29 UTC
Hi @danw: could you please suggest any kind of debug that we can perform to get further information?

Comment 8 Dan Winship 2021-08-24 14:15:16 UTC
As an administrator, they can run

    oc edit -n openshift-sdn ds ovs --type=json \
      -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/livenessProbe"}]'

This should cause OVS to be redeployed without its liveness probe, which should ensure that kubelet does not end up mistakenly killing it when the machine gets overloaded.

Note that if they do any OCP updates or perform other operations with the OCP ansible scripts, this change might get reverted and they will have to do it again.

This change will be included in an upcoming 3.11.z release. (Not the very next one, which I think will be coming out this week, but the one after that.)

Comment 13 zhaozhanqi 2021-09-10 10:25:06 UTC
upgraded cluster from 3.11.501 to 3.11.521 using openshift-ansible-playbooks-3.11.521-1.git.0.8ca76fd.el7.noarch

this looks good

[root@qe-smoke311-master-etcd-1 ~]# oc get ds -n openshift-sdn ovs -o yaml  | grep livenessPro
[root@qe-smoke311-master-etcd-1 ~]#

Comment 15 errata-xmlrpc 2021-09-15 19:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.521 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3424


Note You need to log in before you can comment on or make changes to this bug.