Bug 1778529 - 4.2 to 4.3 upgrades broken when cluster-dns-operator attempts to upgrade due to missing metrics-tls secret
Summary: 4.2 to 4.3 upgrades broken when cluster-dns-operator attempts to upgrade due ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.4.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks: 1778954
TreeView+ depends on / blocked
 
Reported: 2019-12-01 22:35 UTC by Clayton Coleman
Modified: 2020-05-13 21:53 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The DNS operator did not monitor or update the DNS DaemonSet's tolerations. Consequence: If a user or a previous version of the DNS operator set incorrect tolerations on the DNS DaemonSet, the operator did not correct them. Fix: The DNS operator now checks the DNS DaemonSet's tolerations and updates them if they are not what the operator expects. Result: The DNS DaemonSet now always has the expected tolerations.
Clone Of:
: 1778954 (view as bug list)
Environment:
Last Closed: 2020-05-13 21:53:23 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 144 'None' closed Bug 1778529: daemonsetConfigChanged: Check tolerations 2020-11-06 19:15:32 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-13 21:53:26 UTC

Description Clayton Coleman 2019-12-01 22:35:44 UTC
It appears that https://github.com/openshift/cluster-dns-operator/pull/122 broke 4.2 to 4.3 upgrades last week:

Dec 01 00:41:10.988 W clusteroperator/network changed Progressing to False
Dec 01 00:41:10.988 I clusteroperator/network versions: operator 4.2.9 -> 4.3.0-0.ci-2019-11-30-234318
Dec 01 00:41:11.880 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-5ff9db6dc5 to 1
Dec 01 00:41:11.896 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 node/ created
Dec 01 00:41:11.911 I ns/openshift-dns-operator replicaset/dns-operator-5ff9db6dc5 Created pod: dns-operator-5ff9db6dc5-57m95
Dec 01 00:41:11.921 W ns/openshift-marketplace pod/redhat-operators-6567d7b4c8-nr2nn network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (13 times)
Dec 01 00:41:11.937 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 Successfully assigned openshift-dns-operator/dns-operator-5ff9db6dc5-57m95 to ip-10-0-136-246.ec2.internal
Dec 01 00:41:12.124 W ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found
Dec 01 00:41:15.922 I node/ip-10-0-135-144.ec2.internal Node ip-10-0-135-144.ec2.internal status is now: NodeReady (4 times)


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11893

Last passed Nov 22nd

Comment 1 Dan Mace 2019-12-02 21:52:16 UTC
The only problem I see here is two CoreDNS pods were scheduled to nodes with a Ready=Unknown condition, causing DNS to report degraded:

ip-10-0-136-246.ec2.internal
  dns-default-p8zxt
ip-10-0-142-83.ec2.internal
  dns-default-42x8h

https://github.com/openshift/cluster-dns-operator/pull/140 was supposed to fix the scheduling issue, but the fix was incomplete because the operator wasn't actually rolling out the new toleration changes. Miciah has fixed that in https://github.com/openshift/cluster-dns-operator/pull/144.

I believe https://github.com/openshift/cluster-dns-operator/pull/144 is the fix.

Comment 4 Hongan Li 2020-02-06 06:14:36 UTC
didn't see the issue in recent 4.4 upgrade testing, moving to verified.

Comment 6 errata-xmlrpc 2020-05-13 21:53:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.