Description of problem: A cluster upgraded from 4.8.0.fc9 to 4.8.0.rc0 The upgrade hung trying to progress the upgrade of the DNS clusteroperator. "oc get co" reported the cluster operator to be in a state of Progressing. No obvious errors were observed in the openshift-dns-operator pod log or the clusterversion-operator pod log. The cluster was throwing no alerts to indicate any other potential problems. The cluster remained in this state for over ten hours. In an attempt to progress the upgrade, SRE deleted the existing dns-operator pod. This caused the operator to finally upgrade to 4.8.0.rc0 and the upgrade to continue. OpenShift release version: 4.8.0.rc0 Cluster Platform: OpenShift Dedicated
On the CVO side, this update was waiting for the DNS operator to bump its ClusterOperator versions: $ grep 'Running sync.*in state\|Result of work' namespaces/openshift-cluster-version/pods/cluster-version-operator-78f84b6f64-9hbfd/cluster-version-operator/cluster-version-operator/logs/current.log | tail -n4 2021-06-17T21:54:10.425504776Z I0617 21:54:10.425452 1 task_graph.go:555] Result of work: [Cluster operator dns is updating versions] 2021-06-17T21:57:24.919369917Z I0617 21:57:24.919361 1 sync_worker.go:541] Running sync 4.8.0-rc.0 (force=false) on generation 36 in state Updating at attempt 33 2021-06-17T22:03:06.834691848Z I0617 22:03:06.834631 1 task_graph.go:555] Result of work: [Cluster operator dns is updating versions] 2021-06-17T22:06:13.936367223Z I0617 22:06:13.936354 1 sync_worker.go:541] Running sync 4.8.0-rc.0 (force=false) on generation 36 in state Updating at attempt 34 Unfortunately, we didn't capture the ClusterOperators to pin that down. DNS operator logs appear extremely limited: $ cat namespaces/openshift-dns-operator/pods/dns-operator-d8699d55f-mrcwx/dns-operator/dns-operator/logs /current.log 2021-06-17T17:51:37.162054353Z I0617 17:51:37.161847 1 request.go:668] Waited for 1.032916683s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/policy/v1beta1?timeout=32s 2021-06-17T17:51:39.724792717Z time="2021-06-17T17:51:39Z" level=info msg="reconciling request: /default" 2021-06-17T17:51:39.785010526Z time="2021-06-17T17:51:39Z" level=info msg="updated node resolver daemonset: openshift-dns/node-resolver" 2021-06-17T17:51:39.789786347Z time="2021-06-17T17:51:39Z" level=info msg="reconciling request: /default" Compare an fc.0 -> rc.0 CI run [1], which opens with: I0615 19:48:57.996217 1 request.go:668] Waited for 1.012002054s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/network.openshift.io/v1?timeout=32s time="2021-06-15T19:49:00Z" level=info msg="reconciling request: /default" time="2021-06-15T19:49:00Z" level=info msg="reconciling request: /default" time="2021-06-15T19:49:12Z" level=info msg="reconciling request: /default" time="2021-06-15T19:49:13Z" level=info msg="updated DNS default status: old: v1.DNSStatus... So... I dunno. Maybe a SIGQUIT stack trace for the DNS operator would have identified a hang (see [2] for steps, although in this case it would have been the DNS operator, not [2]'s CVO pod). Or maybe the DNS maintainers will be able to guess at the hang or suggest alternative theories based on the resources we did gather here. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1404846426944442368 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900#c5
Setting blocker+ until we can determine what the cause of the problem in and whether it will block upgrades in general or only in some specific configuration.
(In reply to W. Trevor King from comment #2) > [...] Or maybe the DNS maintainers > will be able to guess at the hang or suggest alternative theories based on > the resources we did gather here. I'm pretty sure the problem is that the status controller watches dnses, reconciles by checking (among other things) whether the operand pods have been ready for minReadySeconds, and isn't getting triggered by any events when the pods become ready.
I wasn't able to reproduce the original problem, and we don't have enough information to determine its cause with certainty. However, the proposed change will cause the operator to check its status and possibly update the clusteroperator status whenever there is an update to the daemonset, and this change should reduce the likelihood that the operator fails to update the status when it should.
upgrade from 4.8.0-fc.9 to 4.9.0-0.nightly-2021-06-28-221420 and didn't see the issue. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.9 True True 103m Working towards 4.9.0-0.nightly-2021-06-28-221420: 568 of 676 done (84% complete) <......> $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-06-28-221420 True False 14m Cluster version is 4.9.0-0.nightly-2021-06-28-221420 $ oc get co/dns NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE dns 4.9.0-0.nightly-2021-06-28-221420 True False False 3h19m $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-113.us-east-2.compute.internal Ready worker 3h13m v1.21.0-rc.0+120883f ip-10-0-140-105.us-east-2.compute.internal Ready master 3h18m v1.21.0-rc.0+120883f ip-10-0-178-209.us-east-2.compute.internal Ready worker 3h13m v1.21.0-rc.0+120883f ip-10-0-183-190.us-east-2.compute.internal Ready master 3h19m v1.21.0-rc.0+120883f ip-10-0-194-69.us-east-2.compute.internal Ready worker 3h13m v1.21.0-rc.0+120883f ip-10-0-218-106.us-east-2.compute.internal Ready master 3h18m v1.21.0-rc.0+120883f
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759