Bug 1893360 - 4.7 node reboots are slower due to DNS containers not terminating cleanly
Summary: 4.7 node reboots are slower due to DNS containers not terminating cleanly
Keywords:
Status: CLOSED DUPLICATE of bug 1884053
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-30 20:39 UTC by Yu Qi Zhang
Modified: 2022-08-04 22:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-04 21:55:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journal snippets (journal debug output - reversed) around the reboot slowdowns (13.88 KB, text/plain)
2020-10-30 20:39 UTC, Yu Qi Zhang
no flags Details
dns logs in crio (7.90 KB, text/plain)
2020-10-30 20:39 UTC, Yu Qi Zhang
no flags Details

Description Yu Qi Zhang 2020-10-30 20:39:05 UTC
Created attachment 1725420 [details]
journal snippets (journal debug output - reversed) around the reboot slowdowns

Created attachment 1725420 [details]
journal snippets (journal debug output - reversed) around the reboot slowdowns

The MCO team discovered this when we found that our e2e-gcp-op hits timeouts in 4.7. We noticed that nodes were taking longer to reboot (30s-1min) which accumulating across multiple reboots causing the 4.7 MCO tests to be slower than 4.6 (70+ mins vs 60 mins), triggering timeouts.

Upon further investigation we noticed that this was 100% replicable in GCP. Upon triggering a reboot (after the node has been drained), everything progresses smoothly until cri-o reports issues terminating 2 scopes See attached journal logs of relevant timings:

1. ovs seems to timeout on a tail process, adding 20~ seconds
2. dns seems to be waiting on a child process, adding 30~ seconds

before the node continues its reboot.

In 4.6, we ONLY have the ovs slowdown (to be more precise, ovs-xxxxx_openshift-sdn). Reproduced with 4.6.1. The DNS slowdown is new.


Environment: GCP, AWS
Version: 4.7 (used a nightly from 2 days ago to reproduce)

example recent 4.7 run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2189/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1322086555187154944

example recent 4.6 run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2187/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1321399618835058688


How reproducible:
100%

Steps to Reproduce:
1. spin up a 4.7 nightly
2. trigger a reboot by adding e.g. a machineconfig
3. observe shutdown logs

Comment 1 Yu Qi Zhang 2020-10-30 20:39:44 UTC
Created attachment 1725421 [details]
dns logs in crio

Comment 2 Kirsten Garrison 2020-10-30 21:13:01 UTC
As a note: these slowdowns may not seem like a lot but when combined with the sdn bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893362) they result in 50% longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.

Comment 3 Colin Walters 2020-11-02 18:35:15 UTC
Following git blame here leads to

https://github.com/openshift/cluster-dns-operator/issues/65

Relevant commits:
https://github.com/openshift/cluster-dns-operator/commit/52362da4821655981eabd64c111c71698e30e3d4

Annoyingly, this problem would mostly go away if Kubernetes supported a proper init system (like podman run --init).

Comment 4 Colin Walters 2020-11-02 18:43:48 UTC
Ah it's probably not the bash code, it's likely fallout from
https://github.com/openshift/cluster-dns-operator/commit/f094ddf7edc95dad8398179482687bc2a7a0c15b

Comment 5 Colin Walters 2020-11-04 21:55:18 UTC

*** This bug has been marked as a duplicate of bug 1884053 ***


Note You need to log in before you can comment on or make changes to this bug.