Bug 1893362 - The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing down reboots [NEEDINFO]
Summary: The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing ...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: Colin Walters
QA Contact: zhaozhanqi
Depends On:
TreeView+ depends on / blocked
Reported: 2020-10-30 20:52 UTC by Yu Qi Zhang
Modified: 2021-04-05 17:36 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-02-24 15:29:18 UTC
Target Upstream Version:
lmohanty: needinfo? (bbennett)

Attachments (Terms of Use)
(reversed) shutdown logs (7.64 KB, text/plain)
2020-10-30 20:52 UTC, Yu Qi Zhang
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 859 0 None closed Bug 1893362: Ensure tail processes exit with parent 2021-02-09 13:43:12 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:29:47 UTC

Description Yu Qi Zhang 2020-10-30 20:52:08 UTC
Created attachment 1725424 [details]
(reversed) shutdown logs

Description of problem:
While debugging MCO reboot slowdowns in 4.7, we realized that there were 2 containers (ovs-xxxxx_openshift-sdn and dns) that did not terminate gracefully, adding ~1 min to the reboot process. Also see https://bugzilla.redhat.com/show_bug.cgi?id=1893360 for the DNS side.

For ovs specifically, we see:

Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope changed stop-sigterm -> stop-sigkill
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Killing process 2306 (tail) with signal SIGKILL.
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Stopping timed out. Killing.

Which I guess is coming from https://github.com/openshift/cluster-network-operator/blob/dfea1dc53ed82f5e4a4e85b7e8e863af3e4bd54a/bindata/network/openshift-sdn/sdn-ovs.yaml#L144 ?

I've attached a (reversed) reboot logs that reference this container. The shutdown appears to add ~20 seconds while attempting to sigterm this process. So far I've tested this to reproduce 100% in 4.7 and 4.6.1, across both AWS and GCP.

Version-Release number of selected component (if applicable):
4.7, 4.6

How reproducible:

Steps to Reproduce:
1. spin up 4.6.1 or a 4.7 nightly
2. trigger a reboot by adding e.g. a machineconfig
3. observe shutdown logs

Actual results:
ovs does not terminate gracefully and has to be killed

Expected results:
ovs should terminate gracefully

Additional info:

Comment 1 Kirsten Garrison 2020-10-30 21:14:42 UTC
As a note: these slowdowns may not seem like a lot but when combined with the dns bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893360) they result in longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.

Comment 4 zhaozhanqi 2020-11-10 05:56:27 UTC
Verified this bug on 4.7.0-0.nightly-2020-11-09-235738

1. oc debug node/node1 
 chroot /host
2.check the node logs
 oc debug node/nod1 chroot /host
journalctl | grep -i "tail) with signal SIGKILL"   ### return nothing.

Comment 7 errata-xmlrpc 2021-02-24 15:29:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 8 W. Trevor King 2021-04-05 17:36:48 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.