Bug 1778954

Summary: [4.3] 4.2 to 4.3 upgrades broken when cluster-dns-operator attempts to upgrade due to missing metrics-tls secret
Product: OpenShift Container Platform Reporter: Miciah Dashiel Butler Masters <mmasters>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, bbennett, ccoleman, dmace, hongli, lmohanty
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1778529 Environment:
Last Closed: 2020-01-23 11:14:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1778529    
Bug Blocks:    

Description Miciah Dashiel Butler Masters 2019-12-02 22:13:04 UTC
+++ This bug was initially created as a clone of Bug #1778529 +++

It appears that https://github.com/openshift/cluster-dns-operator/pull/122 broke 4.2 to 4.3 upgrades last week:

Dec 01 00:41:10.988 W clusteroperator/network changed Progressing to False
Dec 01 00:41:10.988 I clusteroperator/network versions: operator 4.2.9 -> 4.3.0-0.ci-2019-11-30-234318
Dec 01 00:41:11.880 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-5ff9db6dc5 to 1
Dec 01 00:41:11.896 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 node/ created
Dec 01 00:41:11.911 I ns/openshift-dns-operator replicaset/dns-operator-5ff9db6dc5 Created pod: dns-operator-5ff9db6dc5-57m95
Dec 01 00:41:11.921 W ns/openshift-marketplace pod/redhat-operators-6567d7b4c8-nr2nn network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (13 times)
Dec 01 00:41:11.937 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 Successfully assigned openshift-dns-operator/dns-operator-5ff9db6dc5-57m95 to ip-10-0-136-246.ec2.internal
Dec 01 00:41:12.124 W ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found
Dec 01 00:41:15.922 I node/ip-10-0-135-144.ec2.internal Node ip-10-0-135-144.ec2.internal status is now: NodeReady (4 times)


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11893

Last passed Nov 22nd

--- Additional comment from Dan Mace on 2019-12-02 21:52:16 UTC ---

The only problem I see here is two CoreDNS pods were scheduled to nodes with a Ready=Unknown condition, causing DNS to report degraded:

ip-10-0-136-246.ec2.internal
  dns-default-p8zxt
ip-10-0-142-83.ec2.internal
  dns-default-42x8h

https://github.com/openshift/cluster-dns-operator/pull/140 was supposed to fix the scheduling issue, but the fix was incomplete because the operator wasn't actually rolling out the new toleration changes. Miciah has fixed that in https://github.com/openshift/cluster-dns-operator/pull/144.

I believe https://github.com/openshift/cluster-dns-operator/pull/144 is the fix.

Comment 2 Hongan Li 2019-12-05 07:52:33 UTC
The PR#145 was merged in https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-03-211441.

When checking https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12062 which upgrade to 4.3.0-0.nightly-2019-12-04-214544 and still found the message: secret "metrics-tls" not found

Dec 04 22:50:01.382 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-6746ff4575 to 1
Dec 04 22:50:01.391 I ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq node/ created
Dec 04 22:50:01.410 I ns/openshift-dns-operator replicaset/dns-operator-6746ff4575 Created pod: dns-operator-6746ff4575-dl5kq
Dec 04 22:50:01.438 I ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq Successfully assigned openshift-dns-operator/dns-operator-6746ff4575-dl5kq to ip-10-0-140-114.ec2.internal
Dec 04 22:50:01.438 W ns/openshift-monitoring pod/kube-state-metrics-544fbcbfbb-qtmbk network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (10 times)
Dec 04 22:50:01.649 W ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found

Comment 3 Dan Mace 2019-12-05 14:45:22 UTC
Since we didn't back port the scheduling fix to 4.2, it's expected that the DNS pods will continue to be scheduled to un-ready nodes prior to the 4.3 upgrade. The root problem is that the nodes aren't ready. DNS isn't causing the problems, and would eventually report success when the nodes are fixed.

Maybe we should go ahead and backport the scheduling fix to 4.2.

Comment 4 Dan Mace 2019-12-05 14:57:39 UTC
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1780213 to backport the scheduling fix to 4.2.z.

Comment 5 Dan Mace 2019-12-06 18:31:14 UTC
Let's re-test once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1780213 is released.

Comment 6 Dan Mace 2019-12-11 13:27:32 UTC
This is still waiting for https://github.com/openshift/cluster-dns-operator/pull/150 to be approved for release.

Comment 7 Dan Mace 2019-12-13 00:59:21 UTC
https://github.com/openshift/cluster-dns-operator/pull/150 merged, so once that's released, let's try the upgrade from 4.2.z containing https://github.com/openshift/cluster-dns-operator/pull/150 to the latest 4.3 release.

Comment 8 Hongan Li 2019-12-23 11:21:03 UTC
verified with upgrade from 4.2.0-0.nightly-2019-12-20-124216 to 4.3.0-0.nightly-2019-12-22-223447

Comment 10 errata-xmlrpc 2020-01-23 11:14:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062