Bug 1711373

Summary:	dns operator fails to integrate with metrics and stops syncing status
Product:	OpenShift Container Platform	Reporter:	Dan Mace <dmace>
Component:	Networking	Assignee:	Dan Mace <dmace>
Networking sub component:	DNS	QA Contact:	Hongan Li <hongli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs
Version:	4.1.0
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:29:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Mace 2019-05-17 15:55:01 UTC

Description of problem:

The DNS operator was observed failing to integrate with metrics and reporting status sync errors, and the only recovery was to restart the operator. This is related to an earlier report for the ingress operator (https://bugzilla.redhat.com/show_bug.cgi?id=1687640). It seems the earlier fixes were not enough.

The DNS operator behavior based shared with the ingress operator (https://github.com/openshift/cluster-ingress-operator/pull/166) appears to have an issue that leads to the current bug. When the code suspects the Kube client discovery info is stale, the client is re-created in a possibly broken way.

It seems likely the ingress controller also suffers from the bug. Both may have gone undetected by existing e2e tests (which is another problem).

Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0+d21f2fc-138-dirty", GitCommit:"d21f2fc", GitTreeState:"dirty", BuildDate:"2019-05-15T17:30:49Z", GoVersion:"go1.11.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d21f2fc", GitCommit:"d21f2fc", GitTreeState:"clean", BuildDate:"2019-05-15T17:27:21Z", GoVersion:"go1.11.8", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

Create a cluster with the installer in AWS.


Actual results:

$ oc get -n openshift-dns servicemonitor
No resources found.

time="2019-05-16T16:18:27Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""

time="2019-05-16T21:00:22Z" level=info msg="reconciling request: /default"
time="2019-05-16T21:00:22Z" level=error msg="failed to reconcile request /default: [failed to ensure dns default: failed to get cluster IP from network config: failed to get network 'cluster': no matches for 
kind \"Network\" in version \"config.openshift.io/v1\", failed to sync operator status: failed to get clusteroperator dns: no matches for kind \"ClusterOperator\" in version \"config.openshift.io/v1\"]"

The operator never seems to recover on its own — the service monitor retries get throttled and eventually fail permanently.

Expected results:

Functioning DNS operator integrated with metrics and successfully publishing status.

Additional info:

Comment 1 Hongan Li 2019-05-20 02:39:30 UTC

Just installed cluster on AWS and it looks good.

$ oc get clusterversions.config.openshift.io 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         25m     Cluster version is 4.1.0-0.nightly-2019-05-18-050636

$ oc get -n openshift-dns servicemonitor
NAME          AGE
dns-default   34m

Comment 2 Dan Mace 2019-05-20 14:01:16 UTC

The problem occurs with some probability based on a race with the monitoring operator. To be clear, the ingress operator should continue functioning whatever the outcome of the race.

Comment 3 Dan Mace 2019-06-05 15:25:30 UTC

The fix in https://github.com/openshift/cluster-ingress-operator/pull/244 can be applied to dns-operator.

Comment 5 Hongan Li 2019-06-25 10:25:38 UTC

Verified with 4.2.0-0.nightly-2019-06-25-003324 and issue has been fixed.
DNS operator reported the errors but eventually it fixed itself.

$ oc -n openshift-dns get servicemonitor
NAME          dns-default   3h57m


time="2019-06-25T06:19:47Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""
......
time="2019-06-25T06:24:25Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""

time="2019-06-25T06:24:28Z" level=info msg="created servicemonitor openshift-dns/dns-default"

Comment 7 errata-xmlrpc 2019-10-16 06:29:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922