Bug 1968415 - CoreDNSErrorsHigh alert generated without any issue in the cluster.
Summary: CoreDNSErrorsHigh alert generated without any issue in the cluster.
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: aos-network-edge-staff
QA Contact: Melvin Joseph
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-07 11:14 UTC by Vedanti Jaypurkar
Modified: 2022-10-27 17:33 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-27 17:32:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 26 Miciah Dashiel Butler Masters 2022-01-06 18:48:20 UTC
Due to priorities and lack of capacity, we have still not been able to engage on this issue.  

Based on comment 25, it seems that at least in some cases the alert is legitimate, and that the application is sending spurious requests that elicit SERVFAIL responses from the upstream resolver.  

Comment 21 asks whether CoreDNS could be reporting NXDOMAIN errors as SERVFAIL errors.  I could be mistaken, but I don't believe this is the case; as far as I can tell, the only cases in which CoreDNS's "forward" plugin returns a SERVFAIL response are if no upstream resolver can be reached or an upstream resolver returns a SERVFAIL response, and the only case in which CoreDNS's "kubernetes" plugin returns a SERVFAIL response is if CoreDNS hasn't synched with the Kubernetes API (which should only need to happen once when CoreDNS starts).  Could you check the "coredns_forward_responses_total" metric to see whether the "forward" plugin is reporting SERVFAIL responses?  

Note that OpenShift 4.10 adds an API to make it easier to increase CoreDNS's logging verbosity; cf. <https://github.com/openshift/enhancements/pull/931/files>.  This should make it easier to diagnose these sorts of issues.

Comment 27 Miciah Dashiel Butler Masters 2022-01-30 23:09:30 UTC
These two upstream changes may be relevant in that they improve the logging and metrics for certain errors:

> do not log NOERROR in log plugin when response is not available

> the log plugin logs NOERROR rcode in case of no response is written, this PR instead changes this to log placeholder ( - ), which at least does not mislead the reader of logs

https://github.com/coredns/coredns/pull/4725

> when no response is written, fallback to status of next plugin in prometheus plugin

> when no response is written from up the chain of plugins, the default value of dnstest.Recorder for rcode (0) is used as rcode reported to the coredns_dns_responses_total metric, which is misleading and wrong. This PR changes the behaviour that when no response is written, the return status of the next plugin is used.

https://github.com/coredns/coredns/pull/4727

We will ship a version of CoreDNS with these changes in OpenShift 4.10.0.

Comment 30 Miciah Dashiel Butler Masters 2022-06-03 17:13:43 UTC
There is an additional upstream change in CoreDNS that may be of interest to people following this BZ: "plugin/prometheus: write rcode properly to the metrics" <https://github.com/coredns/coredns/pull/5126>.  The related issue, <https://github.com/coredns/coredns/issues/5125>, is as follows: 

> Hello, after bump to the latest 1.8.7 CoreDNS we noticed that CoreDNS prometheus metric related to DNS responses (`coredns_dns_responses_total`) shows wrong rcode (label `rcode`). Even when resolution ends in NXDOMAIN, the metric shows it as NOERROR.  

OpenShift 4.11.0 will include CoreDNS 1.9.2, which includes <https://github.com/coredns/coredns/pull/5126>.  

Aside from that, I notice that the cases linked to this BZ are all closed now.  Please let me know if this BZ still requires attention.

Comment 35 Miciah Dashiel Butler Masters 2022-10-27 17:32:12 UTC
Closing per comment 30.  Please re-open the BZ or file a new bug if further attention is needed on the issue.


Note You need to log in before you can comment on or make changes to this bug.