1893360 – 4.7 node reboots are slower due to DNS containers not terminating cleanly

Bug 1893360 - 4.7 node reboots are slower due to DNS containers not terminating cleanly

Summary: 4.7 node reboots are slower due to DNS containers not terminating cleanly

Keywords:
Status:	CLOSED DUPLICATE of bug 1884053
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-30 20:39 UTC by Yu Qi Zhang
Modified:	2022-08-04 22:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-04 21:55:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
journal snippets (journal debug output - reversed) around the reboot slowdowns (13.88 KB, text/plain) 2020-10-30 20:39 UTC, Yu Qi Zhang	no flags	Details
dns logs in crio (7.90 KB, text/plain) 2020-10-30 20:39 UTC, Yu Qi Zhang	no flags	Details
View All

Description Yu Qi Zhang 2020-10-30 20:39:05 UTC

Created attachment 1725420 [details]
journal snippets (journal debug output - reversed) around the reboot slowdowns

Created attachment 1725420 [details]
journal snippets (journal debug output - reversed) around the reboot slowdowns

The MCO team discovered this when we found that our e2e-gcp-op hits timeouts in 4.7. We noticed that nodes were taking longer to reboot (30s-1min) which accumulating across multiple reboots causing the 4.7 MCO tests to be slower than 4.6 (70+ mins vs 60 mins), triggering timeouts.

Upon further investigation we noticed that this was 100% replicable in GCP. Upon triggering a reboot (after the node has been drained), everything progresses smoothly until cri-o reports issues terminating 2 scopes See attached journal logs of relevant timings:

1. ovs seems to timeout on a tail process, adding 20~ seconds
2. dns seems to be waiting on a child process, adding 30~ seconds

before the node continues its reboot.

In 4.6, we ONLY have the ovs slowdown (to be more precise, ovs-xxxxx_openshift-sdn). Reproduced with 4.6.1. The DNS slowdown is new.


Environment: GCP, AWS
Version: 4.7 (used a nightly from 2 days ago to reproduce)

example recent 4.7 run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2189/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1322086555187154944

example recent 4.6 run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2187/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1321399618835058688


How reproducible:
100%

Steps to Reproduce:
1. spin up a 4.7 nightly
2. trigger a reboot by adding e.g. a machineconfig
3. observe shutdown logs

Comment 1 Yu Qi Zhang 2020-10-30 20:39:44 UTC

Created attachment 1725421 [details]
dns logs in crio

Comment 2 Kirsten Garrison 2020-10-30 21:13:01 UTC

As a note: these slowdowns may not seem like a lot but when combined with the sdn bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893362) they result in 50% longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.

Comment 3 Colin Walters 2020-11-02 18:35:15 UTC

Following git blame here leads to

https://github.com/openshift/cluster-dns-operator/issues/65

Relevant commits:
https://github.com/openshift/cluster-dns-operator/commit/52362da4821655981eabd64c111c71698e30e3d4

Annoyingly, this problem would mostly go away if Kubernetes supported a proper init system (like podman run --init).

Comment 4 Colin Walters 2020-11-02 18:43:48 UTC

Ah it's probably not the bash code, it's likely fallout from
https://github.com/openshift/cluster-dns-operator/commit/f094ddf7edc95dad8398179482687bc2a7a0c15b

Comment 5 Colin Walters 2020-11-04 21:55:18 UTC


*** This bug has been marked as a duplicate of bug 1884053 ***

Note You need to log in before you can comment on or make changes to this bug.