Description of problem: The DNS operator was observed failing to integrate with metrics and reporting status sync errors, and the only recovery was to restart the operator. This is related to an earlier report for the ingress operator (https://bugzilla.redhat.com/show_bug.cgi?id=1687640). It seems the earlier fixes were not enough. The DNS operator behavior based shared with the ingress operator (https://github.com/openshift/cluster-ingress-operator/pull/166) appears to have an issue that leads to the current bug. When the code suspects the Kube client discovery info is stale, the client is re-created in a possibly broken way. It seems likely the ingress controller also suffers from the bug. Both may have gone undetected by existing e2e tests (which is another problem). Version-Release number of selected component (if applicable): Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0+d21f2fc-138-dirty", GitCommit:"d21f2fc", GitTreeState:"dirty", BuildDate:"2019-05-15T17:30:49Z", GoVersion:"go1.11.8", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d21f2fc", GitCommit:"d21f2fc", GitTreeState:"clean", BuildDate:"2019-05-15T17:27:21Z", GoVersion:"go1.11.8", Compiler:"gc", Platform:"linux/amd64"} How reproducible: Create a cluster with the installer in AWS. Actual results: $ oc get -n openshift-dns servicemonitor No resources found. time="2019-05-16T16:18:27Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\"" time="2019-05-16T21:00:22Z" level=info msg="reconciling request: /default" time="2019-05-16T21:00:22Z" level=error msg="failed to reconcile request /default: [failed to ensure dns default: failed to get cluster IP from network config: failed to get network 'cluster': no matches for kind \"Network\" in version \"config.openshift.io/v1\", failed to sync operator status: failed to get clusteroperator dns: no matches for kind \"ClusterOperator\" in version \"config.openshift.io/v1\"]" The operator never seems to recover on its own — the service monitor retries get throttled and eventually fail permanently. Expected results: Functioning DNS operator integrated with metrics and successfully publishing status. Additional info:
Just installed cluster on AWS and it looks good. $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-18-050636 True False 25m Cluster version is 4.1.0-0.nightly-2019-05-18-050636 $ oc get -n openshift-dns servicemonitor NAME AGE dns-default 34m
The problem occurs with some probability based on a race with the monitoring operator. To be clear, the ingress operator should continue functioning whatever the outcome of the race.
The fix in https://github.com/openshift/cluster-ingress-operator/pull/244 can be applied to dns-operator.
Verified with 4.2.0-0.nightly-2019-06-25-003324 and issue has been fixed. DNS operator reported the errors but eventually it fixed itself. $ oc -n openshift-dns get servicemonitor NAME dns-default 3h57m time="2019-06-25T06:19:47Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\"" ...... time="2019-06-25T06:24:25Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\"" time="2019-06-25T06:24:28Z" level=info msg="created servicemonitor openshift-dns/dns-default"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922