Due to priorities and lack of capacity, we have still not been able to engage on this issue. Based on comment 25, it seems that at least in some cases the alert is legitimate, and that the application is sending spurious requests that elicit SERVFAIL responses from the upstream resolver. Comment 21 asks whether CoreDNS could be reporting NXDOMAIN errors as SERVFAIL errors. I could be mistaken, but I don't believe this is the case; as far as I can tell, the only cases in which CoreDNS's "forward" plugin returns a SERVFAIL response are if no upstream resolver can be reached or an upstream resolver returns a SERVFAIL response, and the only case in which CoreDNS's "kubernetes" plugin returns a SERVFAIL response is if CoreDNS hasn't synched with the Kubernetes API (which should only need to happen once when CoreDNS starts). Could you check the "coredns_forward_responses_total" metric to see whether the "forward" plugin is reporting SERVFAIL responses? Note that OpenShift 4.10 adds an API to make it easier to increase CoreDNS's logging verbosity; cf. <https://github.com/openshift/enhancements/pull/931/files>. This should make it easier to diagnose these sorts of issues.
These two upstream changes may be relevant in that they improve the logging and metrics for certain errors: > do not log NOERROR in log plugin when response is not available > the log plugin logs NOERROR rcode in case of no response is written, this PR instead changes this to log placeholder ( - ), which at least does not mislead the reader of logs https://github.com/coredns/coredns/pull/4725 > when no response is written, fallback to status of next plugin in prometheus plugin > when no response is written from up the chain of plugins, the default value of dnstest.Recorder for rcode (0) is used as rcode reported to the coredns_dns_responses_total metric, which is misleading and wrong. This PR changes the behaviour that when no response is written, the return status of the next plugin is used. https://github.com/coredns/coredns/pull/4727 We will ship a version of CoreDNS with these changes in OpenShift 4.10.0.
There is an additional upstream change in CoreDNS that may be of interest to people following this BZ: "plugin/prometheus: write rcode properly to the metrics" <https://github.com/coredns/coredns/pull/5126>. The related issue, <https://github.com/coredns/coredns/issues/5125>, is as follows: > Hello, after bump to the latest 1.8.7 CoreDNS we noticed that CoreDNS prometheus metric related to DNS responses (`coredns_dns_responses_total`) shows wrong rcode (label `rcode`). Even when resolution ends in NXDOMAIN, the metric shows it as NOERROR. OpenShift 4.11.0 will include CoreDNS 1.9.2, which includes <https://github.com/coredns/coredns/pull/5126>. Aside from that, I notice that the cases linked to this BZ are all closed now. Please let me know if this BZ still requires attention.
Closing per comment 30. Please re-open the BZ or file a new bug if further attention is needed on the issue.