1973482 – 4.8.0.rc0 upgrade hung, stuck on DNS clusteroperator progressing

Bug 1973482 - 4.8.0.rc0 upgrade hung, stuck on DNS clusteroperator progressing

Summary: 4.8.0.rc0 upgrade hung, stuck on DNS clusteroperator progressing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1974978
TreeView+	depends on / blocked

Reported:	2021-06-17 23:13 UTC by Matt Bargenquast
Modified:	2022-08-04 22:39 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:35:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-dns-operator pull 283	0	None	open	Bug 1973482: status: Watch daemonsets	2021-06-22 21:37:37 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:35:23 UTC

Description Matt Bargenquast 2021-06-17 23:13:56 UTC

Description of problem:

A cluster upgraded from 4.8.0.fc9 to 4.8.0.rc0

The upgrade hung trying to progress the upgrade of the DNS clusteroperator.

"oc get co" reported the cluster operator to be in a state of Progressing.

No obvious errors were observed in the openshift-dns-operator pod log or the clusterversion-operator pod log. The cluster was throwing no alerts to indicate any other potential problems.

The cluster remained in this state for over ten hours. In an attempt to progress the upgrade, SRE deleted the existing dns-operator pod. This caused the operator to finally upgrade to 4.8.0.rc0 and the upgrade to continue.

OpenShift release version:

4.8.0.rc0

Cluster Platform:

OpenShift Dedicated

Comment 2 W. Trevor King 2021-06-21 18:45:27 UTC

On the CVO side, this update was waiting for the DNS operator to bump its ClusterOperator versions:

$ grep 'Running sync.*in state\|Result of work' namespaces/openshift-cluster-version/pods/cluster-version-operator-78f84b6f64-9hbfd/cluster-version-operator/cluster-version-operator/logs/current.log | tail -n4
2021-06-17T21:54:10.425504776Z I0617 21:54:10.425452       1 task_graph.go:555] Result of work: [Cluster operator dns is updating versions]
2021-06-17T21:57:24.919369917Z I0617 21:57:24.919361       1 sync_worker.go:541] Running sync 4.8.0-rc.0 (force=false) on generation 36 in state Updating at attempt 33
2021-06-17T22:03:06.834691848Z I0617 22:03:06.834631       1 task_graph.go:555] Result of work: [Cluster operator dns is updating versions]
2021-06-17T22:06:13.936367223Z I0617 22:06:13.936354       1 sync_worker.go:541] Running sync 4.8.0-rc.0 (force=false) on generation 36 in state Updating at attempt 34

Unfortunately, we didn't capture the ClusterOperators to pin that down.  DNS operator logs appear extremely limited:

$ cat namespaces/openshift-dns-operator/pods/dns-operator-d8699d55f-mrcwx/dns-operator/dns-operator/logs
/current.log 
2021-06-17T17:51:37.162054353Z I0617 17:51:37.161847       1 request.go:668] Waited for 1.032916683s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/policy/v1beta1?timeout=32s
2021-06-17T17:51:39.724792717Z time="2021-06-17T17:51:39Z" level=info msg="reconciling request: /default"
2021-06-17T17:51:39.785010526Z time="2021-06-17T17:51:39Z" level=info msg="updated node resolver daemonset: openshift-dns/node-resolver"
2021-06-17T17:51:39.789786347Z time="2021-06-17T17:51:39Z" level=info msg="reconciling request: /default"

Compare an fc.0 -> rc.0 CI run [1], which opens with:

I0615 19:48:57.996217       1 request.go:668] Waited for 1.012002054s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/network.openshift.io/v1?timeout=32s
time="2021-06-15T19:49:00Z" level=info msg="reconciling request: /default"
time="2021-06-15T19:49:00Z" level=info msg="reconciling request: /default"
time="2021-06-15T19:49:12Z" level=info msg="reconciling request: /default"
time="2021-06-15T19:49:13Z" level=info msg="updated DNS default status: old: v1.DNSStatus...

So... I dunno.  Maybe a SIGQUIT stack trace for the DNS operator would have identified a hang (see [2] for steps, although in this case it would have been the DNS operator, not [2]'s CVO pod).  Or maybe the DNS maintainers will be able to guess at the hang or suggest alternative theories based on the resources we did gather here.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1404846426944442368
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900#c5

Comment 3 Miciah Dashiel Butler Masters 2021-06-22 16:41:57 UTC

Setting blocker+ until we can determine what the cause of the problem in and whether it will block upgrades in general or only in some specific configuration.

Comment 5 Miciah Dashiel Butler Masters 2021-06-22 21:31:39 UTC

(In reply to W. Trevor King from comment #2)
> [...] Or maybe the DNS maintainers
> will be able to guess at the hang or suggest alternative theories based on
> the resources we did gather here.

I'm pretty sure the problem is that the status controller watches dnses, reconciles by checking (among other things) whether the operand pods have been ready for minReadySeconds, and isn't getting triggered by any events when the pods become ready.

Comment 6 Miciah Dashiel Butler Masters 2021-06-22 23:50:36 UTC

I wasn't able to reproduce the original problem, and we don't have enough information to determine its cause with certainty.  However, the proposed change will cause the operator to check its status and possibly update the clusteroperator status whenever there is an update to the daemonset, and this change should reduce the likelihood that the operator fails to update the status when it should.

Comment 8 Hongan Li 2021-06-29 13:36:10 UTC

upgrade from 4.8.0-fc.9 to 4.9.0-0.nightly-2021-06-28-221420 and didn't see the issue.

$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-fc.9   True        True          103m    Working towards 4.9.0-0.nightly-2021-06-28-221420: 568 of 676 done (84% complete)
<......>
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-06-28-221420   True        False         14m     Cluster version is 4.9.0-0.nightly-2021-06-28-221420

$ oc get co/dns
NAME   VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
dns    4.9.0-0.nightly-2021-06-28-221420   True        False         False      3h19m   

$ oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-128-113.us-east-2.compute.internal   Ready    worker   3h13m   v1.21.0-rc.0+120883f
ip-10-0-140-105.us-east-2.compute.internal   Ready    master   3h18m   v1.21.0-rc.0+120883f
ip-10-0-178-209.us-east-2.compute.internal   Ready    worker   3h13m   v1.21.0-rc.0+120883f
ip-10-0-183-190.us-east-2.compute.internal   Ready    master   3h19m   v1.21.0-rc.0+120883f
ip-10-0-194-69.us-east-2.compute.internal    Ready    worker   3h13m   v1.21.0-rc.0+120883f
ip-10-0-218-106.us-east-2.compute.internal   Ready    master   3h18m   v1.21.0-rc.0+120883f

Comment 11 errata-xmlrpc 2021-10-18 17:35:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.