Bug 1893362

Summary:

The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing down reboots

Product:

OpenShift Container Platform

Reporter:

Yu Qi Zhang <jerzhang>

Component:

Networking

Assignee:

Colin Walters <walters>

Networking sub component:

openshift-sdn

QA Contact:

zhaozhanqi <zzhao>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

high

CC:

aconstan, avishnoi, bbennett, kgarriso, lmohanty, sdodson, skumari, walters, wking

Version:

4.7

Keywords:

Upgrades

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-24 15:29:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
(reversed) shutdown logs	none

Description Yu Qi Zhang 2020-10-30 20:52:08 UTC

Created attachment 1725424 [details]
(reversed) shutdown logs

Description of problem:
While debugging MCO reboot slowdowns in 4.7, we realized that there were 2 containers (ovs-xxxxx_openshift-sdn and dns) that did not terminate gracefully, adding ~1 min to the reboot process. Also see https://bugzilla.redhat.com/show_bug.cgi?id=1893360 for the DNS side.

For ovs specifically, we see:

Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope changed stop-sigterm -> stop-sigkill
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Killing process 2306 (tail) with signal SIGKILL.
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Stopping timed out. Killing.

Which I guess is coming from https://github.com/openshift/cluster-network-operator/blob/dfea1dc53ed82f5e4a4e85b7e8e863af3e4bd54a/bindata/network/openshift-sdn/sdn-ovs.yaml#L144 ?

I've attached a (reversed) reboot logs that reference this container. The shutdown appears to add ~20 seconds while attempting to sigterm this process. So far I've tested this to reproduce 100% in 4.7 and 4.6.1, across both AWS and GCP.

Version-Release number of selected component (if applicable):
4.7, 4.6

How reproducible:
100%

Steps to Reproduce:
1. spin up 4.6.1 or a 4.7 nightly
2. trigger a reboot by adding e.g. a machineconfig
3. observe shutdown logs

Actual results:
ovs does not terminate gracefully and has to be killed

Expected results:
ovs should terminate gracefully

Additional info:

Comment 1 Kirsten Garrison 2020-10-30 21:14:42 UTC

As a note: these slowdowns may not seem like a lot but when combined with the dns bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893360) they result in longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.

Comment 4 zhaozhanqi 2020-11-10 05:56:27 UTC

Verified this bug on 4.7.0-0.nightly-2020-11-09-235738

1. oc debug node/node1 
 chroot /host
 reboot
2.check the node logs
 oc debug node/nod1 chroot /host
journalctl | grep -i "tail) with signal SIGKILL"   ### return nothing.

Comment 7 errata-xmlrpc 2021-02-24 15:29:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 8 W. Trevor King 2021-04-05 17:36:48 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 9 Red Hat Bugzilla 2023-09-15 00:50:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days