Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1711373 - dns operator fails to integrate with metrics and stops syncing status
Summary: dns operator fails to integrate with metrics and stops syncing status
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.0
Assignee: Dan Mace
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-17 15:55 UTC by Dan Mace
Modified: 2019-10-16 06:29 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:29:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 116/commits/495fb11ffea8744fa57986eca3d8ff0555adf6ad 0 None None None 2020-09-04 09:13:31 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:29:17 UTC

Description Dan Mace 2019-05-17 15:55:01 UTC
Description of problem:

The DNS operator was observed failing to integrate with metrics and reporting status sync errors, and the only recovery was to restart the operator. This is related to an earlier report for the ingress operator (https://bugzilla.redhat.com/show_bug.cgi?id=1687640). It seems the earlier fixes were not enough.

The DNS operator behavior based shared with the ingress operator (https://github.com/openshift/cluster-ingress-operator/pull/166) appears to have an issue that leads to the current bug. When the code suspects the Kube client discovery info is stale, the client is re-created in a possibly broken way.

It seems likely the ingress controller also suffers from the bug. Both may have gone undetected by existing e2e tests (which is another problem).

Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0+d21f2fc-138-dirty", GitCommit:"d21f2fc", GitTreeState:"dirty", BuildDate:"2019-05-15T17:30:49Z", GoVersion:"go1.11.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d21f2fc", GitCommit:"d21f2fc", GitTreeState:"clean", BuildDate:"2019-05-15T17:27:21Z", GoVersion:"go1.11.8", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

Create a cluster with the installer in AWS.


Actual results:

$ oc get -n openshift-dns servicemonitor
No resources found.

time="2019-05-16T16:18:27Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""

time="2019-05-16T21:00:22Z" level=info msg="reconciling request: /default"
time="2019-05-16T21:00:22Z" level=error msg="failed to reconcile request /default: [failed to ensure dns default: failed to get cluster IP from network config: failed to get network 'cluster': no matches for 
kind \"Network\" in version \"config.openshift.io/v1\", failed to sync operator status: failed to get clusteroperator dns: no matches for kind \"ClusterOperator\" in version \"config.openshift.io/v1\"]"

The operator never seems to recover on its own — the service monitor retries get throttled and eventually fail permanently.

Expected results:

Functioning DNS operator integrated with metrics and successfully publishing status.

Additional info:

Comment 1 Hongan Li 2019-05-20 02:39:30 UTC
Just installed cluster on AWS and it looks good.

$ oc get clusterversions.config.openshift.io 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-18-050636   True        False         25m     Cluster version is 4.1.0-0.nightly-2019-05-18-050636

$ oc get -n openshift-dns servicemonitor
NAME          AGE
dns-default   34m

Comment 2 Dan Mace 2019-05-20 14:01:16 UTC
The problem occurs with some probability based on a race with the monitoring operator. To be clear, the ingress operator should continue functioning whatever the outcome of the race.

Comment 3 Dan Mace 2019-06-05 15:25:30 UTC
The fix in https://github.com/openshift/cluster-ingress-operator/pull/244 can be applied to dns-operator.

Comment 5 Hongan Li 2019-06-25 10:25:38 UTC
Verified with 4.2.0-0.nightly-2019-06-25-003324 and issue has been fixed.
DNS operator reported the errors but eventually it fixed itself.

$ oc -n openshift-dns get servicemonitor
NAME          dns-default   3h57m


time="2019-06-25T06:19:47Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""
......
time="2019-06-25T06:24:25Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to integrate metrics with openshift-monitoring for dns default: failed to ensure servicemonitor for default: no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""

time="2019-06-25T06:24:28Z" level=info msg="created servicemonitor openshift-dns/dns-default"

Comment 7 errata-xmlrpc 2019-10-16 06:29:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.