Bug 1860142 - coredns_forward_healthcheck_broken_count_total metric is not working for DNS forwarding
Summary: coredns_forward_healthcheck_broken_count_total metric is not working for DNS ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Daneyon Hansen
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On:
Blocks: 1862584
TreeView+ depends on / blocked
 
Reported: 2020-07-23 19:25 UTC by Daneyon Hansen
Modified: 2020-10-27 16:17 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1862584 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:16:56 UTC
Target Upstream Version:


Attachments (Terms of Use)
Prometheus graph data from patched cluster version (158.72 KB, image/png)
2020-08-21 09:18 UTC, Arvind iyengar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift coredns pull 35 0 None closed Bug 1860142: Adds HealthcheckBrokenCount to forward plugin 2021-02-15 08:32:49 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:17:14 UTC

Description Daneyon Hansen 2020-07-23 19:25:45 UTC
Description of problem:

The CoreDNS Forward plugin [1] performs health checking of upstream servers when an error is encountered. However, these metrics are only exposed when a Corefile server block specifies the prometheus plugin. The DNS Operator does not specify the prometheus plugin per server block.

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Always

Steps to Reproduce:
1. Create a cluster
2. Update the operator.dns/default resource by specifying a forwarding server [2]. To simulate a failure scenario, [2] should specify a failed DNS server.
3. rsh into a pod and perform an nslookup for a host in the zone specified in [2]. The lookup should fail.
4. curl the metrics endpoint of the server used in step 3.

Actual results:

No healthcheck metric fired.

Expected results:

The 'coredns_forward_healthcheck_broken_total' metric to fire.

Additional info:
[1] https://coredns.io/plugins/forward/
[2] https://docs.openshift.com/container-platform/4.5/networking/dns-operator.html#nw-dns-forward_dns-operator

Comment 1 Daneyon Hansen 2020-07-23 19:33:13 UTC
A similar issue was discovered while reproducing the BZ's main issue. The CoreDNS forward plugin does not register [3] the `coredns_forward_healthcheck_broken_total` metric [4]. [5] was created upstream. This PR should be cherry-picked to openshift/coredns and backported to 4.4 (first release of DNS forwarding). To simulate this failure scenario, [2] should specify a failed DNS server as the first server and a functioning server as the second server and complete steps 3-4 of the reproducer.

[3] https://github.com/coredns/coredns/blob/v1.6.6/plugin/forward/setup.go#L35
[4] https://github.com/coredns/coredns/blob/v1.6.6/plugin/forward/metrics.go#L36-L41
[5] https://github.com/coredns/coredns/pull/4021

Comment 2 Daneyon Hansen 2020-07-23 20:34:20 UTC
After further analysis, only the bug identified in https://bugzilla.redhat.com/show_bug.cgi?id=1860142#c1 exists. Here are the reproducer steps that indicate https://bugzilla.redhat.com/show_bug.cgi?id=1860142#c0 is not a bug:

1. Update dnses.operator.openshift.io/default with a dns forwarding configuration. Note that 216.239.32.1 and 216.239.32.2 are failed servers.

$ oc get dnses.operator.openshift.io/default -o yaml
apiVersion: operator.openshift.io/v1
kind: DNS
<SNIP>
spec:
  servers:
  - forwardPlugin:
      upstreams:
      - 216.239.32.1
      - 216.239.32.2
    name: google
    zones:
    - daneyon.com

2. Verify the Corefile configuration has been updated for the cluster dns server (10.131.0.3) used for testing.

$ oc get po/dns-default-4df9v -n openshift-dns -o wide
NAME                READY   STATUS    RESTARTS   AGE     IP           NODE                                         NOMINATED NODE   READINESS GATES
dns-default-4df9v   3/3     Running   0          5h14m   10.131.0.3   ip-10-0-198-229.us-west-2.compute.internal   <none>           <none>

$ oc exec dns-default-4df9v -n openshift-dns -c dns -- cat /etc/coredns/Corefile
# google
daneyon.com:5353 {
    forward . 216.239.32.1 216.239.32.2
}
.:5353 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        upstream
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf {
        policy sequential
    }
    cache 30
    reload
}

3. Get the token used for scraping the metrics endpoint:

$ oc serviceaccounts -n openshift-monitoring get-token prometheus-k8s
eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA

4. rsh into a pod, perform a dns query using the server from step 2 and scrape the metrics endpoint.

$ oc rsh -n openshift-ingress router-default-78777b5df4-bp6gn

sh-4.2$ nslookup -port=5353 resume.daneyon.com 10.131.0.3
Server:		10.131.0.3
Address:	10.131.0.3#5353

** server can't find resume.daneyon.com: SERVFAIL

sh-4.2$ curl -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA" -k https://10.131.0.3:9154/metrics | grep healthcheck
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30796    0 30796    0     0   989k      0 --:--:-- --:--:--# HELP coredns_forward_healthcheck_failure_count_total Counter of the number of failed healthchecks.
 -# TYPE coredns_forward_healthcheck_failure_count_total counter
-coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 4061
:coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 1344

5. Verify that the coredns_forward_healthcheck_failure_count_total counters are increasing:

sh-4.2$ curl -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA" -k https://10.131.0.3:9154/metrics | grep healthcheck
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30799    0 30799    0     0  1032k      0 --:--:-- --:--:-- --:--:-- 1074k
# HELP coredns_forward_healthcheck_failure_count_total Counter of the number of failed healthchecks.
# TYPE coredns_forward_healthcheck_failure_count_total counter
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 4063
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 1345

Note that the coredns_forward_healthcheck_broken_total metric should have also been triggered but does not due to [5].

Comment 3 Andrew McDermott 2020-07-30 09:58:10 UTC
Iā€™m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 4 Daneyon Hansen 2020-08-06 21:25:33 UTC
https://github.com/openshift/cluster-dns-operator/pull/185 adds alert rules for these metrics.

Comment 6 Daneyon Hansen 2020-08-19 20:38:24 UTC
Adding alerts for health check metrics (https://github.com/povilasv/coredns-mixin/pull/6) has been moved into a separate BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1870354. Moving this BZ to ON_QA so the addition of metric coredns_forward_healthcheck_broken_count_total can be tested.

Comment 7 Arvind iyengar 2020-08-21 09:17:07 UTC
The patch has been tested in "4.6.0-0.nightly-2020-08-20-234448" release version. It is noted that "coredns_forward_healthcheck_broken_total" metric is getting triggered correctly and the data could also be seen getting populated in prometheus UI: 
-----
sh-4.2# for i in {1..10}; do curl -sS -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImRPN2g4SXZqSXJScUQ2bGtEUEh0VXI5S2F3aXFFWktTUXh2UU5Zb1J3QTQifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi00Z3MyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImRjNGMxN2M3LWM4ZWUtNDA1NC04Njg3LTA4MTVjMzNhYWFhZCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.kW3f2m4luvyLr5oWqCAcKIe6nVWXvkuYBTg6tq3g3QnAexksbwxiTCULiFwjMpS5AwbudI9P0nzBdu-dBLmF1VWtLEgml3_S-GbtPeXBelnxVPEfCMWWPEe4UNl2p-XkxfUV3OAP_4KYubhbg940us_cdZXVZrct57Tdu0kCxaAVNV4wUsrLghblAmXPOE1O5hnE34s0ATcYRF-AXLB9jxIXT42m7JyaylPkrY6hqP--6KfD_49klv1Ucknn_7wLdJIe-w9LohYfPIk4g4Of20F4LSNif2LRjiGbYs6VJxylxF5tFn1Ft1nGLF_JCF3f9-XoQCWR5IWHDWaBUVf1jg" -k https://10.128.2.3:9154/metrics  | grep -i healthcheck  | grep -iv "^#"; sleep 2; done
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8047
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8047
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8049
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8049
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8050
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8050
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8051
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8051
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8053
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8053
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8054
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8054
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8055
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8055
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8057
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8057
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8058
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8058
coredns_forward_healthcheck_broken_count_total 22
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 8059
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 8059
-----

Comment 8 Arvind iyengar 2020-08-21 09:18:22 UTC
Created attachment 1712152 [details]
Prometheus graph data from patched cluster version

Comment 11 errata-xmlrpc 2020-10-27 16:16:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.