1893362 – The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing down reboots

Bug 1893362 - The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing down reboots

Summary: The ovs-xxxxx_openshift-sdn container does not terminate gracefully, slowing ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Colin Walters
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-30 20:52 UTC by Yu Qi Zhang
Modified:	2023-09-15 00:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:29:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
(reversed) shutdown logs (7.64 KB, text/plain) 2020-10-30 20:52 UTC, Yu Qi Zhang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 859	0	None	closed	Bug 1893362: Ensure tail processes exit with parent	2021-02-09 13:43:12 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:29:47 UTC

Description Yu Qi Zhang 2020-10-30 20:52:08 UTC

Created attachment 1725424 [details]
(reversed) shutdown logs

Description of problem:
While debugging MCO reboot slowdowns in 4.7, we realized that there were 2 containers (ovs-xxxxx_openshift-sdn and dns) that did not terminate gracefully, adding ~1 min to the reboot process. Also see https://bugzilla.redhat.com/show_bug.cgi?id=1893360 for the DNS side.

For ovs specifically, we see:

Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope changed stop-sigterm -> stop-sigkill
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Killing process 2306 (tail) with signal SIGKILL.
Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Stopping timed out. Killing.

Which I guess is coming from https://github.com/openshift/cluster-network-operator/blob/dfea1dc53ed82f5e4a4e85b7e8e863af3e4bd54a/bindata/network/openshift-sdn/sdn-ovs.yaml#L144 ?

I've attached a (reversed) reboot logs that reference this container. The shutdown appears to add ~20 seconds while attempting to sigterm this process. So far I've tested this to reproduce 100% in 4.7 and 4.6.1, across both AWS and GCP.

Version-Release number of selected component (if applicable):
4.7, 4.6

How reproducible:
100%

Steps to Reproduce:
1. spin up 4.6.1 or a 4.7 nightly
2. trigger a reboot by adding e.g. a machineconfig
3. observe shutdown logs

Actual results:
ovs does not terminate gracefully and has to be killed

Expected results:
ovs should terminate gracefully

Additional info:

Comment 1 Kirsten Garrison 2020-10-30 21:14:42 UTC

As a note: these slowdowns may not seem like a lot but when combined with the dns bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893360) they result in longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.

Comment 4 zhaozhanqi 2020-11-10 05:56:27 UTC

Verified this bug on 4.7.0-0.nightly-2020-11-09-235738

1. oc debug node/node1 
 chroot /host
 reboot
2.check the node logs
 oc debug node/nod1 chroot /host
journalctl | grep -i "tail) with signal SIGKILL"   ### return nothing.

Comment 7 errata-xmlrpc 2021-02-24 15:29:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 8 W. Trevor King 2021-04-05 17:36:48 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 9 Red Hat Bugzilla 2023-09-15 00:50:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.