1862584 – coredns_forward_healthcheck_broken_count_total metric is not working for DNS forwarding

Bug 1862584 - coredns_forward_healthcheck_broken_count_total metric is not working for DNS forwarding

Summary: coredns_forward_healthcheck_broken_count_total metric is not working for DNS ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Daneyon Hansen
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:	1860142
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-31 18:23 UTC by Daneyon Hansen
Modified:	2022-08-04 22:39 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1860142
Environment:
Last Closed:	2020-09-21 17:42:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift coredns pull 36	0	None	closed	[release-4.5] Bug 1862584: Adds HealthcheckBrokenCount to forward plugin	2020-11-26 16:33:58 UTC
Red Hat Product Errata	RHBA-2020:3719	0	None	None	None	2020-09-21 17:42:21 UTC

Description Daneyon Hansen 2020-07-31 18:23:00 UTC

+++ This bug was initially created as a clone of Bug #1860142 +++

Description of problem:

The CoreDNS Forward plugin [1] performs health checking of upstream servers when an error is encountered. However, these metrics are only exposed when a Corefile server block specifies the prometheus plugin. The DNS Operator does not specify the prometheus plugin per server block.

Version-Release number of selected component (if applicable):
4.6

How reproducible:
Always

Steps to Reproduce:
1. Create a cluster
2. Update the operator.dns/default resource by specifying a forwarding server [2]. To simulate a failure scenario, [2] should specify a failed DNS server.
3. rsh into a pod and perform an nslookup for a host in the zone specified in [2]. The lookup should fail.
4. curl the metrics endpoint of the server used in step 3.

Actual results:

No healthcheck metric fired.

Expected results:

The 'coredns_forward_healthcheck_broken_total' metric to fire.

Additional info:
[1] https://coredns.io/plugins/forward/
[2] https://docs.openshift.com/container-platform/4.5/networking/dns-operator.html#nw-dns-forward_dns-operator

--- Additional comment from Daneyon Hansen on 2020-07-23 19:33:13 UTC ---

A similar issue was discovered while reproducing the BZ's main issue. The CoreDNS forward plugin does not register [3] the `coredns_forward_healthcheck_broken_total` metric [4]. [5] was created upstream. This PR should be cherry-picked to openshift/coredns and backported to 4.4 (first release of DNS forwarding). To simulate this failure scenario, [2] should specify a failed DNS server as the first server and a functioning server as the second server and complete steps 3-4 of the reproducer.

[3] https://github.com/coredns/coredns/blob/v1.6.6/plugin/forward/setup.go#L35
[4] https://github.com/coredns/coredns/blob/v1.6.6/plugin/forward/metrics.go#L36-L41
[5] https://github.com/coredns/coredns/pull/4021

--- Additional comment from Daneyon Hansen on 2020-07-23 20:34:20 UTC ---

After further analysis, only the bug identified in https://bugzilla.redhat.com/show_bug.cgi?id=1860142#c1 exists. Here are the reproducer steps that indicate https://bugzilla.redhat.com/show_bug.cgi?id=1860142#c0 is not a bug:

1. Update dnses.operator.openshift.io/default with a dns forwarding configuration. Note that 216.239.32.1 and 216.239.32.2 are failed servers.

$ oc get dnses.operator.openshift.io/default -o yaml
apiVersion: operator.openshift.io/v1
kind: DNS
<SNIP>
spec:
  servers:
  - forwardPlugin:
      upstreams:
      - 216.239.32.1
      - 216.239.32.2
    name: google
    zones:
    - daneyon.com

2. Verify the Corefile configuration has been updated for the cluster dns server (10.131.0.3) used for testing.

$ oc get po/dns-default-4df9v -n openshift-dns -o wide
NAME                READY   STATUS    RESTARTS   AGE     IP           NODE                                         NOMINATED NODE   READINESS GATES
dns-default-4df9v   3/3     Running   0          5h14m   10.131.0.3   ip-10-0-198-229.us-west-2.compute.internal   <none>           <none>

$ oc exec dns-default-4df9v -n openshift-dns -c dns -- cat /etc/coredns/Corefile
# google
daneyon.com:5353 {
    forward . 216.239.32.1 216.239.32.2
}
.:5353 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        upstream
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf {
        policy sequential
    }
    cache 30
    reload
}

3. Get the token used for scraping the metrics endpoint:

$ oc serviceaccounts -n openshift-monitoring get-token prometheus-k8s
eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA

4. rsh into a pod, perform a dns query using the server from step 2 and scrape the metrics endpoint.

$ oc rsh -n openshift-ingress router-default-78777b5df4-bp6gn

sh-4.2$ nslookup -port=5353 resume.daneyon.com 10.131.0.3
Server:		10.131.0.3
Address:	10.131.0.3#5353

** server can't find resume.daneyon.com: SERVFAIL

sh-4.2$ curl -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA" -k https://10.131.0.3:9154/metrics | grep healthcheck
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30796    0 30796    0     0   989k      0 --:--:-- --:--:--# HELP coredns_forward_healthcheck_failure_count_total Counter of the number of failed healthchecks.
 -# TYPE coredns_forward_healthcheck_failure_count_total counter
-coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 4061
:coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 1344

5. Verify that the coredns_forward_healthcheck_failure_count_total counters are increasing:

sh-4.2$ curl -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA" -k https://10.131.0.3:9154/metrics | grep healthcheck
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30799    0 30799    0     0  1032k      0 --:--:-- --:--:-- --:--:-- 1074k
# HELP coredns_forward_healthcheck_failure_count_total Counter of the number of failed healthchecks.
# TYPE coredns_forward_healthcheck_failure_count_total counter
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 4063
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 1345

Note that the coredns_forward_healthcheck_broken_total metric should have also been triggered but does not due to [5].

--- Additional comment from Andrew McDermott on 2020-07-30 09:58:10 UTC ---

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 4 Daneyon Hansen 2020-09-09 15:55:06 UTC

I’m adding UpcomingSprint because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 7 Arvind iyengar 2020-09-16 07:18:54 UTC

The merge made into "4.5.0-0.nightly-2020-09-12-013926" release. With this version, it is noted that the "coredns_forward_healthcheck_failure_count_total" metric increments properly 
-----
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-12-013926   True        False         163m    Cluster version is 4.5.0-0.nightly-2020-09-12-013926

$ oc -n openshift-ingress rsh router-default-f767cc9c4-fs6dj
sh-4.2$ for i in {1..10}; do curl -sS -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjNPWjlWRGlvaklRTDhkeTI0eHlMa0ZINFZ4X3FMWTBTaHV6T2FIZGpkOVkifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi05a3R6dCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjI2ZDI5YWRiLWI1Y2ItNDAxYS04MTc3LWI0MTI0ODJlNWM0NSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.mzsNKq27p_QiviyKLgJBiDxjD1KoGXxg_d4r0HLUtmOdMc5xswuaka5gR6jicmQWIgYTcv5V_R_tJZkxYYaOOfnt52op56ZFxRZZke722Ba1hLxRdc3QGLsEDaYnRgT53_FPj-pUrWHmua4Y3LRxszp0W0lNTXlkD8L5h1OwmJHLB36oNJ8RDUfaQprHNgtRuk_qGOglU1firTxE-dm-QudHspkQz0xheX6SsdDqjn-dGFLQubO7r6JRSorxVC0FYLxyRauuGjqoizu6sXk2hIyIR-DnL0li1168GXWZ5xPg5NaBq4AClfIkcnd2VGofeLJmkDvOX5FqSKY4ynYW_g" -k https://10.131.0.2:9154/metrics  | grep -i healthcheck  | grep -iv "^#"; sleep 2; done
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 973
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 972
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 975
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 973
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 976
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 975
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 977
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 976
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 979
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 977
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 980
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 979
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 982
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 980
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 983
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 982
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 984
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 983
coredns_forward_healthcheck_broken_count_total 1
coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 986
coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 984
-----

Comment 9 errata-xmlrpc 2020-09-21 17:42:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.11 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3719

Note You need to log in before you can comment on or make changes to this bug.