Bug 1893360

Summary: 4.7 node reboots are slower due to DNS containers not terminating cleanly
Product: OpenShift Container Platform Reporter: Yu Qi Zhang <jerzhang>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, kgarriso, sdodson, skumari, walters, wking
Version: 4.7Keywords: Upgrades
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-04 21:55:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
journal snippets (journal debug output - reversed) around the reboot slowdowns
none
dns logs in crio none

Description Yu Qi Zhang 2020-10-30 20:39:05 UTC
Created attachment 1725420 [details]
journal snippets (journal debug output - reversed) around the reboot slowdowns

Created attachment 1725420 [details]
journal snippets (journal debug output - reversed) around the reboot slowdowns

The MCO team discovered this when we found that our e2e-gcp-op hits timeouts in 4.7. We noticed that nodes were taking longer to reboot (30s-1min) which accumulating across multiple reboots causing the 4.7 MCO tests to be slower than 4.6 (70+ mins vs 60 mins), triggering timeouts.

Upon further investigation we noticed that this was 100% replicable in GCP. Upon triggering a reboot (after the node has been drained), everything progresses smoothly until cri-o reports issues terminating 2 scopes See attached journal logs of relevant timings:

1. ovs seems to timeout on a tail process, adding 20~ seconds
2. dns seems to be waiting on a child process, adding 30~ seconds

before the node continues its reboot.

In 4.6, we ONLY have the ovs slowdown (to be more precise, ovs-xxxxx_openshift-sdn). Reproduced with 4.6.1. The DNS slowdown is new.


Environment: GCP, AWS
Version: 4.7 (used a nightly from 2 days ago to reproduce)

example recent 4.7 run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2189/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1322086555187154944

example recent 4.6 run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2187/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1321399618835058688


How reproducible:
100%

Steps to Reproduce:
1. spin up a 4.7 nightly
2. trigger a reboot by adding e.g. a machineconfig
3. observe shutdown logs

Comment 1 Yu Qi Zhang 2020-10-30 20:39:44 UTC
Created attachment 1725421 [details]
dns logs in crio

Comment 2 Kirsten Garrison 2020-10-30 21:13:01 UTC
As a note: these slowdowns may not seem like a lot but when combined with the sdn bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893362) they result in 50% longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.

Comment 3 Colin Walters 2020-11-02 18:35:15 UTC
Following git blame here leads to

https://github.com/openshift/cluster-dns-operator/issues/65

Relevant commits:
https://github.com/openshift/cluster-dns-operator/commit/52362da4821655981eabd64c111c71698e30e3d4

Annoyingly, this problem would mostly go away if Kubernetes supported a proper init system (like podman run --init).

Comment 4 Colin Walters 2020-11-02 18:43:48 UTC
Ah it's probably not the bash code, it's likely fallout from
https://github.com/openshift/cluster-dns-operator/commit/f094ddf7edc95dad8398179482687bc2a7a0c15b

Comment 5 Colin Walters 2020-11-04 21:55:18 UTC

*** This bug has been marked as a duplicate of bug 1884053 ***