Bug 1778954 - [4.3] 4.2 to 4.3 upgrades broken when cluster-dns-operator attempts to upgrade due to missing metrics-tls secret
Summary: [4.3] 4.2 to 4.3 upgrades broken when cluster-dns-operator attempts to upgrad...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.3.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On: 1778529
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-02 22:13 UTC by Miciah Dashiel Butler Masters
Modified: 2022-08-04 22:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1778529
Environment:
Last Closed: 2020-01-23 11:14:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 145 0 'None' 'closed' 'Bug 1778954: Upgrade failed due to incorrect toleration merge' 2019-12-09 14:10:39 UTC

Description Miciah Dashiel Butler Masters 2019-12-02 22:13:04 UTC
+++ This bug was initially created as a clone of Bug #1778529 +++

It appears that https://github.com/openshift/cluster-dns-operator/pull/122 broke 4.2 to 4.3 upgrades last week:

Dec 01 00:41:10.988 W clusteroperator/network changed Progressing to False
Dec 01 00:41:10.988 I clusteroperator/network versions: operator 4.2.9 -> 4.3.0-0.ci-2019-11-30-234318
Dec 01 00:41:11.880 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-5ff9db6dc5 to 1
Dec 01 00:41:11.896 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 node/ created
Dec 01 00:41:11.911 I ns/openshift-dns-operator replicaset/dns-operator-5ff9db6dc5 Created pod: dns-operator-5ff9db6dc5-57m95
Dec 01 00:41:11.921 W ns/openshift-marketplace pod/redhat-operators-6567d7b4c8-nr2nn network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (13 times)
Dec 01 00:41:11.937 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 Successfully assigned openshift-dns-operator/dns-operator-5ff9db6dc5-57m95 to ip-10-0-136-246.ec2.internal
Dec 01 00:41:12.124 W ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found
Dec 01 00:41:15.922 I node/ip-10-0-135-144.ec2.internal Node ip-10-0-135-144.ec2.internal status is now: NodeReady (4 times)


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11893

Last passed Nov 22nd

--- Additional comment from Dan Mace on 2019-12-02 21:52:16 UTC ---

The only problem I see here is two CoreDNS pods were scheduled to nodes with a Ready=Unknown condition, causing DNS to report degraded:

ip-10-0-136-246.ec2.internal
  dns-default-p8zxt
ip-10-0-142-83.ec2.internal
  dns-default-42x8h

https://github.com/openshift/cluster-dns-operator/pull/140 was supposed to fix the scheduling issue, but the fix was incomplete because the operator wasn't actually rolling out the new toleration changes. Miciah has fixed that in https://github.com/openshift/cluster-dns-operator/pull/144.

I believe https://github.com/openshift/cluster-dns-operator/pull/144 is the fix.

Comment 2 Hongan Li 2019-12-05 07:52:33 UTC
The PR#145 was merged in https://openshift-release.svc.ci.openshift.org/releasestream/4.3.0-0.nightly/release/4.3.0-0.nightly-2019-12-03-211441.

When checking https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/12062 which upgrade to 4.3.0-0.nightly-2019-12-04-214544 and still found the message: secret "metrics-tls" not found

Dec 04 22:50:01.382 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-6746ff4575 to 1
Dec 04 22:50:01.391 I ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq node/ created
Dec 04 22:50:01.410 I ns/openshift-dns-operator replicaset/dns-operator-6746ff4575 Created pod: dns-operator-6746ff4575-dl5kq
Dec 04 22:50:01.438 I ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq Successfully assigned openshift-dns-operator/dns-operator-6746ff4575-dl5kq to ip-10-0-140-114.ec2.internal
Dec 04 22:50:01.438 W ns/openshift-monitoring pod/kube-state-metrics-544fbcbfbb-qtmbk network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (10 times)
Dec 04 22:50:01.649 W ns/openshift-dns-operator pod/dns-operator-6746ff4575-dl5kq MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found

Comment 3 Dan Mace 2019-12-05 14:45:22 UTC
Since we didn't back port the scheduling fix to 4.2, it's expected that the DNS pods will continue to be scheduled to un-ready nodes prior to the 4.3 upgrade. The root problem is that the nodes aren't ready. DNS isn't causing the problems, and would eventually report success when the nodes are fixed.

Maybe we should go ahead and backport the scheduling fix to 4.2.

Comment 4 Dan Mace 2019-12-05 14:57:39 UTC
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1780213 to backport the scheduling fix to 4.2.z.

Comment 5 Dan Mace 2019-12-06 18:31:14 UTC
Let's re-test once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1780213 is released.

Comment 6 Dan Mace 2019-12-11 13:27:32 UTC
This is still waiting for https://github.com/openshift/cluster-dns-operator/pull/150 to be approved for release.

Comment 7 Dan Mace 2019-12-13 00:59:21 UTC
https://github.com/openshift/cluster-dns-operator/pull/150 merged, so once that's released, let's try the upgrade from 4.2.z containing https://github.com/openshift/cluster-dns-operator/pull/150 to the latest 4.3 release.

Comment 8 Hongan Li 2019-12-23 11:21:03 UTC
verified with upgrade from 4.2.0-0.nightly-2019-12-20-124216 to 4.3.0-0.nightly-2019-12-22-223447

Comment 10 errata-xmlrpc 2020-01-23 11:14:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.