It appears that https://github.com/openshift/cluster-dns-operator/pull/122 broke 4.2 to 4.3 upgrades last week: Dec 01 00:41:10.988 W clusteroperator/network changed Progressing to False Dec 01 00:41:10.988 I clusteroperator/network versions: operator 4.2.9 -> 4.3.0-0.ci-2019-11-30-234318 Dec 01 00:41:11.880 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-5ff9db6dc5 to 1 Dec 01 00:41:11.896 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 node/ created Dec 01 00:41:11.911 I ns/openshift-dns-operator replicaset/dns-operator-5ff9db6dc5 Created pod: dns-operator-5ff9db6dc5-57m95 Dec 01 00:41:11.921 W ns/openshift-marketplace pod/redhat-operators-6567d7b4c8-nr2nn network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (13 times) Dec 01 00:41:11.937 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 Successfully assigned openshift-dns-operator/dns-operator-5ff9db6dc5-57m95 to ip-10-0-136-246.ec2.internal Dec 01 00:41:12.124 W ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found Dec 01 00:41:15.922 I node/ip-10-0-135-144.ec2.internal Node ip-10-0-135-144.ec2.internal status is now: NodeReady (4 times) https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11893 Last passed Nov 22nd
The only problem I see here is two CoreDNS pods were scheduled to nodes with a Ready=Unknown condition, causing DNS to report degraded: ip-10-0-136-246.ec2.internal dns-default-p8zxt ip-10-0-142-83.ec2.internal dns-default-42x8h https://github.com/openshift/cluster-dns-operator/pull/140 was supposed to fix the scheduling issue, but the fix was incomplete because the operator wasn't actually rolling out the new toleration changes. Miciah has fixed that in https://github.com/openshift/cluster-dns-operator/pull/144. I believe https://github.com/openshift/cluster-dns-operator/pull/144 is the fix.
[1] shows NotAllDNSesAvailable clearing up in the wake of #144. We're still failing upgrades on NodeControllerDegraded, but that's bug 1778904. [1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=%5erelease-.*upgrade&search=NotAllDNSesAvailable:%20Not%20all%20desired%20DNS%20DaemonSets%20available&search=NodeControllerDegradedMasterNodesReady:%20NodeControllerDegraded:%20The%20master%20node.*not%20ready
didn't see the issue in recent 4.4 upgrade testing, moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581