Bug 2020489

Summary: coredns_dns metrics don't include the custom zone metrics data due to CoreDNS prometheus plugin is not defined
Product: OpenShift Container Platform Reporter: Jie Wu <jiewu>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: Shudi Li <shudili>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, hongli, mmasters
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The DNS operator did not enable the "prometheus" plugin in server blocks for custom upstream resolvers. Consequence: CoreDNS did not report metrics for upstream resolvers and only reported metrics for the default server block. Fix: The DNS operator was changed to enable the "prometheus" plugin in all server blocks. Result: CoreDNS now reports Prometheus metrics for custom upstream resolvers.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:25:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jie Wu 2021-11-05 03:53:28 UTC
Description of problem:
My test environment is 4.6.13, but 4.x also has this issue.

It found that the metrics of 'coredns_dns_*' such as 'coredns_dns_request_count_total' don't included the custom zone metrics data, due to CoreDNS prometheus plugin is not defined in custom zone block.

The documentation about 'prometheus enables Prometheus metrics'.
https://coredns.io/plugins/metrics/
It mentioned that 'This plugin can only be used once per Server Block'.

Currently, the coredns configuration is defined the prometheus plugin in '.' zone, but others zone is not defined.

OpenShift release version:
4.x

Cluster Platform:
all

How reproducible:
Please see the details.

Steps to Reproduce (in detail):
1. Configure the DNS forwarding for the custom zones.
https://docs.openshift.com/container-platform/4.9/networking/dns-operator.html#nw-dns-forward_dns-operator

# oc edit dns.operator/default
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
  name: default
spec:
  servers:
  - name: foo-server 
    zones: 
      - foo.com
    forwardPlugin:
      upstreams: 
        - 1.1.1.1
        - 2.2.2.2:5353
  - name: bar-server
    zones:
      - bar.com
      - example.com
    forwardPlugin:
      upstreams:
        - 3.3.3.3
        - 4.4.4.4:5454

2. Go to the OAuth pods and do the dns query for the custom forwwarding zone.
# oc -n openshift-authentication rsh oauth-openshift-xxxxxxxxxx
$ curl example.com
$ curl foo.com

3. Open the Prometheus WebConsole and do the PromQL
sum by (zone) (coredns_dns_request_count_total)

Element	        Value
{zone="."}	804685

Actual results:
It is only '.' zone metrics.

PromQL-> sum by (zone) (coredns_dns_request_count_total)
Element	        Value
{zone="."}	804685

# oc -n openshift-dns edit cm dns-default
  Corefile: |
    # foo-server
    foo.com:5353 {
        forward . 1.1.1.1 2.2.2.2:5353
    }
    # bar-server
    bar.com:5353 example.com:5353 {
        forward . 3.3.3.3 4.4.4.4:5454
    }
    .:5353 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 30
        reload
    }

Expected results:
All zone metrics should be included.

PromQL-> sum by (zone) (coredns_dns_request_count_total)
Element	Value
{zone="."}	804685
{zone="example.com."}	4644
{zone="foo.com."}	10

# oc -n openshift-dns edit cm dns-default
  Corefile: |
    # foo-server
    foo.com:5353 {
        forward . 1.1.1.1 2.2.2.2:5353
        prometheus :9153    <- prometheus plugin should be defined for this zone
    }
    # bar-server
    bar.com:5353 example.com:5353 {
        forward . 3.3.3.3 4.4.4.4:5454
        prometheus :9153    <- prometheus plugin should be defined for this zone
    }
    .:5353 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 30
        reload
    }

Impact of the problem:
The customer cannot get the custom zone metrics for the data analysis and the troubleshooting.

Additional info:
This issue can be verified via coredns operator unmanaged and modifying the coredns configuration.

To set up to unmanaged status for the dns operator, the spec of the clusterversion should be changed as below.
# oc edit clusterversion

Append the overrides args for the coredns operator.
spec:
  overrides:
  - group: apps/v1
    kind: Deployment
    name: dns-operator
    namespace: openshift-dns-operator
    unmanaged: true

Change the replicas to 0 for the deployment of the dns-operator.
# oc -n openshift-dns-operator scale --replicas=0 deployment/dns-operator

Check the deployment of the dns-operator is 0.
# oc -n  openshift-dns-operator  get deployment
NAME           READY   UP-TO-DATE   AVAILABLE   AGE
dns-operator   0/0     0            0           15d

Modify the default CoreDNS configuration.
# oc -n openshift-dns edit cm dns-default
  Corefile: |
    # foo-server
    foo.com:5353 {
        forward . 1.1.1.1 2.2.2.2:5353
        prometheus :9153    <- Add this line, prometheus plugin will collect the metrics for this zone
    }
    # bar-server
    bar.com:5353 example.com:5353 {
        forward . 3.3.3.3 4.4.4.4:5454
        prometheus :9153    <- Add this line, prometheus plugin will collect the metrics for this zone
    }
    .:5353 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            upstream
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 30
        reload
    }

** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2021-11-10 17:09:26 UTC
Setting blocker- because this does not represent a regression or security issue.  However, we will work on this promptly to ensure we report full metrics for CoreDNS.

Comment 4 Shudi Li 2021-11-15 02:27:50 UTC
Verified it with 4.10.0-0.nightly-2021-11-14-184249 and passed

1.
% oc get clusterversion                         
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-11-14-184249   True        False         81m     Cluster version is 4.10.0-0.nightly-2021-11-14-184249
%

2. Configure the DNS forwarding for the two custom zones
%oc edit dns.operator/default
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
  name: default
spec:
  servers:
  - name: foo-server 
    zones: 
      - foo1.com
    forwardPlugin:
      upstreams: 
        - 1.1.1.1
        - 2.2.2.2:5353
  - name: bar-server
    zones:
      - bar2.com
      - example.com
    forwardPlugin:
      upstreams:
        - 3.3.3.3
        - 4.4.4.4:5454

3. check cm dns-default, prometheus plugin is added by default
% oc -n openshift-dns get cm dns-default -o yaml
apiVersion: v1
data:
  Corefile: |
    # foo-server
    foo1.com:5353 {
        prometheus 127.0.0.1:9153                     <---
        forward . 1.1.1.1 2.2.2.2:5353 {
            policy random
        }
        errors
        bufsize 512
        cache 900 {
            denial 9984 30
        }
    }
    # bar-server
    bar2.com:5353 example.com:5353 {
        prometheus 127.0.0.1:9153                  <---
        forward . 3.3.3.3 4.4.4.4:5454 {
            policy random
        }
        errors
        bufsize 512
        cache 900 {
            denial 9984 30
        }
    }
    .:5353 {
        bufsize 512
        errors
        health {
            lameduck 20s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus 127.0.0.1:9153
        ...

4. Send traffic to foo1.com and bar2.com in an oauth-openshift-xxx pod as the bug described

5. Check the coredns_dns metrics in the Prometheus WebConsole and doing the PromQL:
sum by (zone) (coredns_dns_requests_total)
zone                value
bar2.com.            4
foo1.com.            4
.                    38273

Comment 7 errata-xmlrpc 2022-03-10 16:25:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056