+++ This bug was initially created as a clone of Bug #1778529 +++ It appears that https://github.com/openshift/cluster-dns-operator/pull/122 broke 4.2 to 4.3 upgrades last week: Dec 01 00:41:10.988 W clusteroperator/network changed Progressing to False Dec 01 00:41:10.988 I clusteroperator/network versions: operator 4.2.9 -> 4.3.0-0.ci-2019-11-30-234318 Dec 01 00:41:11.880 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-5ff9db6dc5 to 1 Dec 01 00:41:11.896 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 node/ created Dec 01 00:41:11.911 I ns/openshift-dns-operator replicaset/dns-operator-5ff9db6dc5 Created pod: dns-operator-5ff9db6dc5-57m95 Dec 01 00:41:11.921 W ns/openshift-marketplace pod/redhat-operators-6567d7b4c8-nr2nn network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (13 times) Dec 01 00:41:11.937 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 Successfully assigned openshift-dns-operator/dns-operator-5ff9db6dc5-57m95 to ip-10-0-136-246.ec2.internal Dec 01 00:41:12.124 W ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found Dec 01 00:41:15.922 I node/ip-10-0-135-144.ec2.internal Node ip-10-0-135-144.ec2.internal status is now: NodeReady (4 times) https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11893 Last passed Nov 22nd --- Additional comment from Dan Mace on 2019-12-02 21:52:16 UTC --- The only problem I see here is two CoreDNS pods were scheduled to nodes with a Ready=Unknown condition, causing DNS to report degraded: ip-10-0-136-246.ec2.internal dns-default-p8zxt ip-10-0-142-83.ec2.internal dns-default-42x8h https://github.com/openshift/cluster-dns-operator/pull/140 was supposed to fix the scheduling issue, but the fix was incomplete because the operator wasn't actually rolling out the new toleration changes. Miciah has fixed that in https://github.com/openshift/cluster-dns-operator/pull/144. I believe https://github.com/openshift/cluster-dns-operator/pull/144 is the fix.
The PR#145 was merged in https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-03-211441. When checking https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12062 which upgrade to 4.3.0-0.nightly-2019-12-04-214544 and still found the message: secret "metrics-tls" not found Dec 04 22:50:01.382 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-6746ff4575 to 1 Dec 04 22:50:01.391 I ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq node/ created Dec 04 22:50:01.410 I ns/openshift-dns-operator replicaset/dns-operator-6746ff4575 Created pod: dns-operator-6746ff4575-dl5kq Dec 04 22:50:01.438 I ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq Successfully assigned openshift-dns-operator/dns-operator-6746ff4575-dl5kq to ip-10-0-140-114.ec2.internal Dec 04 22:50:01.438 W ns/openshift-monitoring pod/kube-state-metrics-544fbcbfbb-qtmbk network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (10 times) Dec 04 22:50:01.649 W ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found
Since we didn't back port the scheduling fix to 4.2, it's expected that the DNS pods will continue to be scheduled to un-ready nodes prior to the 4.3 upgrade. The root problem is that the nodes aren't ready. DNS isn't causing the problems, and would eventually report success when the nodes are fixed. Maybe we should go ahead and backport the scheduling fix to 4.2.
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1780213 to backport the scheduling fix to 4.2.z.
Let's re-test once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1780213 is released.
This is still waiting for https://github.com/openshift/cluster-dns-operator/pull/150 to be approved for release.
https://github.com/openshift/cluster-dns-operator/pull/150 merged, so once that's released, let's try the upgrade from 4.2.z containing https://github.com/openshift/cluster-dns-operator/pull/150 to the latest 4.3 release.
verified with upgrade from 4.2.0-0.nightly-2019-12-20-124216 to 4.3.0-0.nightly-2019-12-22-223447
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062