Created attachment 1725420 [details] journal snippets (journal debug output - reversed) around the reboot slowdowns Created attachment 1725420 [details] journal snippets (journal debug output - reversed) around the reboot slowdowns The MCO team discovered this when we found that our e2e-gcp-op hits timeouts in 4.7. We noticed that nodes were taking longer to reboot (30s-1min) which accumulating across multiple reboots causing the 4.7 MCO tests to be slower than 4.6 (70+ mins vs 60 mins), triggering timeouts. Upon further investigation we noticed that this was 100% replicable in GCP. Upon triggering a reboot (after the node has been drained), everything progresses smoothly until cri-o reports issues terminating 2 scopes See attached journal logs of relevant timings: 1. ovs seems to timeout on a tail process, adding 20~ seconds 2. dns seems to be waiting on a child process, adding 30~ seconds before the node continues its reboot. In 4.6, we ONLY have the ovs slowdown (to be more precise, ovs-xxxxx_openshift-sdn). Reproduced with 4.6.1. The DNS slowdown is new. Environment: GCP, AWS Version: 4.7 (used a nightly from 2 days ago to reproduce) example recent 4.7 run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2189/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1322086555187154944 example recent 4.6 run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2187/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1321399618835058688 How reproducible: 100% Steps to Reproduce: 1. spin up a 4.7 nightly 2. trigger a reboot by adding e.g. a machineconfig 3. observe shutdown logs
Created attachment 1725421 [details] dns logs in crio
As a note: these slowdowns may not seem like a lot but when combined with the sdn bz (https://bugzilla.redhat.com/show_bug.cgi?id=1893362) they result in 50% longer reboots per node. This means that they break MCO CI(e2e-gcp-op) which already has a built-in cushion and will cause visible perf hits for customers rolling out updates to large clusters.
Following git blame here leads to https://github.com/openshift/cluster-dns-operator/issues/65 Relevant commits: https://github.com/openshift/cluster-dns-operator/commit/52362da4821655981eabd64c111c71698e30e3d4 Annoyingly, this problem would mostly go away if Kubernetes supported a proper init system (like podman run --init).
Ah it's probably not the bash code, it's likely fallout from https://github.com/openshift/cluster-dns-operator/commit/f094ddf7edc95dad8398179482687bc2a7a0c15b
*** This bug has been marked as a duplicate of bug 1884053 ***