Bug 1893362 - The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing down reboots
Summary: The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.0
Assignee: Colin Walters
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-30 20:52 UTC by Yu Qi Zhang
Modified: 2023-09-15 00:50 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:29:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
(reversed) shutdown logs (7.64 KB, text/plain)
2020-10-30 20:52 UTC, Yu Qi Zhang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 859 0 None closed Bug 1893362: Ensure tail processes exit with parent 2021-02-09 13:43:12 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:29:47 UTC

Description Yu Qi Zhang 2020-10-30 20:52:08 UTC
Created attachment 1725424 [details]
(reversed) shutdown logs

Description of problem:
While debugging MCO reboot slowdowns in 4.7, we realized that there were 2 containers (ovs-xxxxx_openshift-sdn and dns) that did not terminate gracefully, adding ~1 min to the reboot process. Also see https://bugzilla.redhat.com/show_bug.cgi?id=1893360 for the DNS side.

For ovs specifically, we see:

Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope changed stop-sigterm -> stop-sigkill
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Killing process 2306 (tail) with signal SIGKILL.
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Stopping timed out. Killing.

Which I guess is coming from https://github.com/openshift/cluster-network-operator/blob/dfea1dc53ed82f5e4a4e85b7e8e863af3e4bd54a/bindata/network/openshift-sdn/sdn-ovs.yaml#L144 ?

I've attached a (reversed) reboot logs that reference this container. The shutdown appears to add ~20 seconds while attempting to sigterm this process. So far I've tested this to reproduce 100% in 4.7 and 4.6.1, across both AWS and GCP.

Version-Release number of selected component (if applicable):
4.7, 4.6

How reproducible:
100%

Steps to Reproduce:
1. spin up 4.6.1 or a 4.7 nightly
2. trigger a reboot by adding e.g. a machineconfig
3. observe shutdown logs

Actual results:
ovs does not terminate gracefully and has to be killed

Expected results:
ovs should terminate gracefully

Additional info:

Comment 1 Kirsten Garrison 2020-10-30 21:14:42 UTC
As a note: these slowdowns may not seem like a lot but when combined with the dns bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893360) they result in longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.

Comment 4 zhaozhanqi 2020-11-10 05:56:27 UTC
Verified this bug on 4.7.0-0.nightly-2020-11-09-235738

1. oc debug node/node1 
 chroot /host
 reboot
2.check the node logs
 oc debug node/nod1 chroot /host
journalctl | grep -i "tail) with signal SIGKILL"   ### return nothing.

Comment 7 errata-xmlrpc 2021-02-24 15:29:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 8 W. Trevor King 2021-04-05 17:36:48 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 9 Red Hat Bugzilla 2023-09-15 00:50:35 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.