Created attachment 1725424 [details] (reversed) shutdown logs Description of problem: While debugging MCO reboot slowdowns in 4.7, we realized that there were 2 containers (ovs-xxxxx_openshift-sdn and dns) that did not terminate gracefully, adding ~1 min to the reboot process. Also see https://bugzilla.redhat.com/show_bug.cgi?id=1893360 for the DNS side. For ovs specifically, we see: Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope changed stop-sigterm -> stop-sigkill Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Killing process 2306 (tail) with signal SIGKILL. Oct 30 15:10:38 jerzhang-201029-01-nq26v-worker-a-ptkvd systemd[1]: crio-bb121d29a9b04c15724c0af67be2ab0f3990a17f6748d9b147d743cc791162ed.scope: Stopping timed out. Killing. Which I guess is coming from https://github.com/openshift/cluster-network-operator/blob/dfea1dc53ed82f5e4a4e85b7e8e863af3e4bd54a/bindata/network/openshift-sdn/sdn-ovs.yaml#L144 ? I've attached a (reversed) reboot logs that reference this container. The shutdown appears to add ~20 seconds while attempting to sigterm this process. So far I've tested this to reproduce 100% in 4.7 and 4.6.1, across both AWS and GCP. Version-Release number of selected component (if applicable): 4.7, 4.6 How reproducible: 100% Steps to Reproduce: 1. spin up 4.6.1 or a 4.7 nightly 2. trigger a reboot by adding e.g. a machineconfig 3. observe shutdown logs Actual results: ovs does not terminate gracefully and has to be killed Expected results: ovs should terminate gracefully Additional info:
As a note: these slowdowns may not seem like a lot but when combined with the dns bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893360) they result in longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.
Verified this bug on 4.7.0-0.nightly-2020-11-09-235738 1. oc debug node/node1 chroot /host reboot 2.check the node logs oc debug node/nod1 chroot /host journalctl | grep -i "tail) with signal SIGKILL" ### return nothing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days