+++ This bug was initially created as a clone of Bug #1860142 +++ Description of problem: The CoreDNS Forward plugin [1] performs health checking of upstream servers when an error is encountered. However, these metrics are only exposed when a Corefile server block specifies the prometheus plugin. The DNS Operator does not specify the prometheus plugin per server block. Version-Release number of selected component (if applicable): 4.6 How reproducible: Always Steps to Reproduce: 1. Create a cluster 2. Update the operator.dns/default resource by specifying a forwarding server [2]. To simulate a failure scenario, [2] should specify a failed DNS server. 3. rsh into a pod and perform an nslookup for a host in the zone specified in [2]. The lookup should fail. 4. curl the metrics endpoint of the server used in step 3. Actual results: No healthcheck metric fired. Expected results: The 'coredns_forward_healthcheck_broken_total' metric to fire. Additional info: [1] https://coredns.io/plugins/forward/ [2] https://docs.openshift.com/container-platform/4.5/networking/dns-operator.html#nw-dns-forward_dns-operator --- Additional comment from Daneyon Hansen on 2020-07-23 19:33:13 UTC --- A similar issue was discovered while reproducing the BZ's main issue. The CoreDNS forward plugin does not register [3] the `coredns_forward_healthcheck_broken_total` metric [4]. [5] was created upstream. This PR should be cherry-picked to openshift/coredns and backported to 4.4 (first release of DNS forwarding). To simulate this failure scenario, [2] should specify a failed DNS server as the first server and a functioning server as the second server and complete steps 3-4 of the reproducer. [3] https://github.com/coredns/coredns/blob/v1.6.6/plugin/forward/setup.go#L35 [4] https://github.com/coredns/coredns/blob/v1.6.6/plugin/forward/metrics.go#L36-L41 [5] https://github.com/coredns/coredns/pull/4021 --- Additional comment from Daneyon Hansen on 2020-07-23 20:34:20 UTC --- After further analysis, only the bug identified in https://bugzilla.redhat.com/show_bug.cgi?id=1860142#c1 exists. Here are the reproducer steps that indicate https://bugzilla.redhat.com/show_bug.cgi?id=1860142#c0 is not a bug: 1. Update dnses.operator.openshift.io/default with a dns forwarding configuration. Note that 216.239.32.1 and 216.239.32.2 are failed servers. $ oc get dnses.operator.openshift.io/default -o yaml apiVersion: operator.openshift.io/v1 kind: DNS <SNIP> spec: servers: - forwardPlugin: upstreams: - 216.239.32.1 - 216.239.32.2 name: google zones: - daneyon.com 2. Verify the Corefile configuration has been updated for the cluster dns server (10.131.0.3) used for testing. $ oc get po/dns-default-4df9v -n openshift-dns -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-4df9v 3/3 Running 0 5h14m 10.131.0.3 ip-10-0-198-229.us-west-2.compute.internal <none> <none> $ oc exec dns-default-4df9v -n openshift-dns -c dns -- cat /etc/coredns/Corefile # google daneyon.com:5353 { forward . 216.239.32.1 216.239.32.2 } .:5353 { errors health kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure upstream fallthrough in-addr.arpa ip6.arpa } prometheus :9153 forward . /etc/resolv.conf { policy sequential } cache 30 reload } 3. Get the token used for scraping the metrics endpoint: $ oc serviceaccounts -n openshift-monitoring get-token prometheus-k8s eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA 4. rsh into a pod, perform a dns query using the server from step 2 and scrape the metrics endpoint. $ oc rsh -n openshift-ingress router-default-78777b5df4-bp6gn sh-4.2$ nslookup -port=5353 resume.daneyon.com 10.131.0.3 Server: 10.131.0.3 Address: 10.131.0.3#5353 ** server can't find resume.daneyon.com: SERVFAIL sh-4.2$ curl -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA" -k https://10.131.0.3:9154/metrics | grep healthcheck % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 30796 0 30796 0 0 989k 0 --:--:-- --:--:--# HELP coredns_forward_healthcheck_failure_count_total Counter of the number of failed healthchecks. -# TYPE coredns_forward_healthcheck_failure_count_total counter -coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 4061 :coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 1344 5. Verify that the coredns_forward_healthcheck_failure_count_total counters are increasing: sh-4.2$ curl -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImkxTE44SWJKVC12cFUzTG1zdzRrb05BNkI1SzA0bFlGLTkybUJqbU81YWMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1wcmI2ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjA1NGRlZjliLTAwZWMtNDU5YS1iNzdmLWM4MDk3ODFmNmI3YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.Dd4fku6-TVu6f60JXWdH7mlwPU-woxq_xx0jjIFcEI5O0zl82E5irkNaY-yTrpjrLHV-vrn1kdERJNLmo5t-b79YSHP6g2kyB9bDzXgioot1gnuA780jRE7KcDOgOuQqUfQU2OGajN-bqDdEYe80gfgOqVBLy00MFEpvssrqR5grsOCnpRWk0X1DbIykYrE8JVullB27AEIvmBZ4_PWhnDJgENHDoVvZfAdx9M9jgNkQMhviS6N-kQD9_htnIUxRFACPihS2IYQfhwyygFEWJ2h34hxvYYzorSt8In-DHJkjMo_iCScHUSvfQuFZWHkoNmL_VEsJPflfv_nyQ60nwA" -k https://10.131.0.3:9154/metrics | grep healthcheck % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 30799 0 30799 0 0 1032k 0 --:--:-- --:--:-- --:--:-- 1074k # HELP coredns_forward_healthcheck_failure_count_total Counter of the number of failed healthchecks. # TYPE coredns_forward_healthcheck_failure_count_total counter coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 4063 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 1345 Note that the coredns_forward_healthcheck_broken_total metric should have also been triggered but does not due to [5]. --- Additional comment from Andrew McDermott on 2020-07-30 09:58:10 UTC --- I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I’m adding UpcomingSprint because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
The merge made into "4.5.0-0.nightly-2020-09-12-013926" release. With this version, it is noted that the "coredns_forward_healthcheck_failure_count_total" metric increments properly ----- $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-12-013926 True False 163m Cluster version is 4.5.0-0.nightly-2020-09-12-013926 $ oc -n openshift-ingress rsh router-default-f767cc9c4-fs6dj sh-4.2$ for i in {1..10}; do curl -sS -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjNPWjlWRGlvaklRTDhkeTI0eHlMa0ZINFZ4X3FMWTBTaHV6T2FIZGpkOVkifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi05a3R6dCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjI2ZDI5YWRiLWI1Y2ItNDAxYS04MTc3LWI0MTI0ODJlNWM0NSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.mzsNKq27p_QiviyKLgJBiDxjD1KoGXxg_d4r0HLUtmOdMc5xswuaka5gR6jicmQWIgYTcv5V_R_tJZkxYYaOOfnt52op56ZFxRZZke722Ba1hLxRdc3QGLsEDaYnRgT53_FPj-pUrWHmua4Y3LRxszp0W0lNTXlkD8L5h1OwmJHLB36oNJ8RDUfaQprHNgtRuk_qGOglU1firTxE-dm-QudHspkQz0xheX6SsdDqjn-dGFLQubO7r6JRSorxVC0FYLxyRauuGjqoizu6sXk2hIyIR-DnL0li1168GXWZ5xPg5NaBq4AClfIkcnd2VGofeLJmkDvOX5FqSKY4ynYW_g" -k https://10.131.0.2:9154/metrics | grep -i healthcheck | grep -iv "^#"; sleep 2; done coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 973 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 972 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 975 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 973 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 976 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 975 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 977 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 976 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 979 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 977 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 980 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 979 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 982 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 980 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 983 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 982 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 984 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 983 coredns_forward_healthcheck_broken_count_total 1 coredns_forward_healthcheck_failure_count_total{to="216.239.32.1:53"} 986 coredns_forward_healthcheck_failure_count_total{to="216.239.32.2:53"} 984 -----
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.11 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3719