Bug 1893362

Summary: The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing down reboots
Product: OpenShift Container Platform Reporter: Yu Qi Zhang <jerzhang>
Component: NetworkingAssignee: Colin Walters <walters>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: aconstan, avishnoi, bbennett, kgarriso, lmohanty, sdodson, skumari, walters, wking
Version: 4.7Keywords: Upgrades
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:29:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
(reversed) shutdown logs none

Description Yu Qi Zhang 2020-10-30 20:52:08 UTC
Created attachment 1725424 [details]
(reversed) shutdown logs

Description of problem:
While debugging MCO reboot slowdowns in 4.7, we realized that there were 2 containers (ovs-xxxxx_openshift-sdn and dns) that did not terminate gracefully, adding ~1 min to the reboot process. Also see https://bugzilla.redhat.com/show_bug.cgi?id=1893360 for the DNS side.

For ovs specifically, we see:

Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope changed stop-sigterm -> stop-sigkill
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Killing process 2306 (tail) with signal SIGKILL.
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Stopping timed out. Killing.

Which I guess is coming from https://github.com/openshift/cluster-network-operator/blob/dfea1dc53ed82f5e4a4e85b7e8e863af3e4bd54a/bindata/network/openshift-sdn/sdn-ovs.yaml#L144 ?

I've attached a (reversed) reboot logs that reference this container. The shutdown appears to add ~20 seconds while attempting to sigterm this process. So far I've tested this to reproduce 100% in 4.7 and 4.6.1, across both AWS and GCP.

Version-Release number of selected component (if applicable):
4.7, 4.6

How reproducible:
100%

Steps to Reproduce:
1. spin up 4.6.1 or a 4.7 nightly
2. trigger a reboot by adding e.g. a machineconfig
3. observe shutdown logs

Actual results:
ovs does not terminate gracefully and has to be killed

Expected results:
ovs should terminate gracefully

Additional info:

Comment 1 Kirsten Garrison 2020-10-30 21:14:42 UTC
As a note: these slowdowns may not seem like a lot but when combined with the dns bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893360) they result in longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.

Comment 4 zhaozhanqi 2020-11-10 05:56:27 UTC
Verified this bug on 4.7.0-0.nightly-2020-11-09-235738

1. oc debug node/node1 
 chroot /host
 reboot
2.check the node logs
 oc debug node/nod1 chroot /host
journalctl | grep -i "tail) with signal SIGKILL"   ### return nothing.

Comment 7 errata-xmlrpc 2021-02-24 15:29:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 8 W. Trevor King 2021-04-05 17:36:48 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 9 Red Hat Bugzilla 2023-09-15 00:50:35 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days