Description of problem: One of the dns pods continuous restart with error: Failed to list ... dial tcp 172.30.0.1:443: connect: no route to host . But the DNS operator's status is available. Version-Release number of selected component (if applicable): Payload: 4.1.0-0.nightly-2019-05-15-151517 How reproducible: always Steps to Reproduce: 1. Follow the doc: https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit to do certificate recovery ; 2. After done the recovery , run e2e test, and check the cluster status; Actual results: 2. the cluster could run the openshift/conformance e2w test, but one of the dns pods continuous restart with the dns operator available status: [yinzhou@192 Downloads]$ oc get po -n openshift-dns NAME READY STATUS RESTARTS AGE dns-default-5rl5v 2/2 Running 5 9h dns-default-6tr68 2/2 Running 137 9h dns-default-bdkjx 2/2 Running 2 9h dns-default-d45g2 2/2 Running 2 9h dns-default-k5rjp 2/2 Running 5 9h dns-default-pfwsc 2/2 Running 7 9h [yinzhou@192 Downloads]$ oc get clusteroperator dns NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.1.0-0.nightly-2019-05-15-151517 True False False 9h Check the container has logs: [root@ip-10-0-175-169 ~]# crictl ps -a CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT POD ID 578dad58d9dc5 44ed977fdb334e53eedbad02a1fb51e9a6618e3208954ae72a1493c0ecf2f195 12 seconds ago Running dns 129 938d100592881 [root@ip-10-0-175-169 ~]# crictl logs -f 578dad58d9dc5 E0517 14:38:03.733441 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host E0517 14:38:03.733521 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host E0517 14:38:03.733446 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host E0517 14:38:07.829383 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host E0517 14:38:07.829419 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host E0517 14:38:07.829383 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host .:5353 2019-05-17T14:38:08.457Z [INFO] CoreDNS-1.3.1 2019-05-17T14:38:08.457Z [INFO] linux/amd64, go1.10.8, CoreDNS-1.3.1 linux/amd64, go1.10.8, Expected results: 2. No restart for pod. Or the clusteroperator's status should be "DEGRADED" Additional info:
The dns resource status looks good: $ oc get -n openshift-dns-operator dns.operator.openshift.io -o yaml apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: DNS metadata: creationTimestamp: "2019-05-17T05:06:38Z" finalizers: - dns.operator.openshift.io/dns-controller generation: 1 name: default resourceVersion: "361812" selfLink: /apis/operator.openshift.io/v1/dnses/default uid: 8d4c58df-7861-11e9-9842-02a4275cc94e spec: {} status: clusterDomain: cluster.local clusterIP: 172.30.0.10 conditions: - lastTransitionTime: "2019-05-17T18:56:04Z" message: Not all Nodes running DaemonSet pod reason: DaemonSetDegraded status: "True" type: Degraded - lastTransitionTime: "2019-05-17T18:56:04Z" message: 5 Nodes running a DaemonSet pod, want 6 reason: Reconciling status: "True" type: Progressing - lastTransitionTime: "2019-05-17T05:34:02Z" message: Minimum number of Nodes running DaemonSet pod reason: AsExpected status: "True" type: Available kind: List metadata: resourceVersion: "" selfLink: "" The operator status seems to be misreporting Degraded=False given the nonzero unavailable dns replicas: $ oc get clusteroperators/dns -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-05-17T05:06:39Z" generation: 1 name: dns resourceVersion: "20172" selfLink: /apis/config.openshift.io/v1/clusteroperators/dns uid: 8db6e6b4-7861-11e9-9842-02a4275cc94e spec: {} status: conditions: - lastTransitionTime: "2019-05-17T05:34:02Z" message: All desired DNS DaemonSets available and operand Namespace exists reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2019-05-17T05:34:02Z" message: Desired and available number of DNS DaemonSets are equal reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2019-05-17T05:34:02Z" message: At least 1 DNS DaemonSet available reason: AsExpected status: "True" type: Available extension: null relatedObjects: - group: "" name: openshift-dns-operator resource: namespaces - group: "" name: openshift-dns resource: namespaces versions: - name: operator version: 4.1.0-0.nightly-2019-05-15-151517 - name: coredns version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3e6664558ae1d1a3b773673c5998f1239eccc3ade3b7b4f85aae4f86b54f390 - name: openshift-cli version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:98995ecf1afb6121c0947d9d645dd0ce63b79c55650045db53a18e4ee8190a97 So, something's definitely going wrong in the operator with the operator status reporting. Separately, we need to understand why the failing CoreDNS pod can't communicate with the apiserver in the first place. In addition, please note that cluster DNS services are still functional from the node on which the CoreDNS pod is failing — requests will be routed through the SDN to other CoreDNS pods on other nodes. However, there is necessarily some (unmeasured) performance impact during the outage.
Just a quick update on the underlying cause of the COreDNS pod crash loop. Looks like containers in the CoreDNS pod network namespace can't route to the apiserver IP address (172.30.0.1). There _might_ be something SDN (nftables) related here. The similar alertmanager container on the same node has no issues and has some possibly benign but visible nftables differences. I've attached some state dumps. It would be useful to get the SDN folks to take a quick look for anything obvious that stands out.
Looks like the three rules for the dns pod clusterIP correspond to the three ports defined on its service, so that's okay.
A DNS resource is considered "Available" as long as A) the Service has been assigned a ClusterIP and B) at least 1 DaemonSet pod reports a status of "Available". The clusteroperator/dns reports "Degraded" if A) the operand namespace does not exist or B) if the number of DNS resource is 0 or C) the number of DNS resources reporting "Available" does not match the number of total DNS resources for the cluster. Should the Operator "Degraded" condition be based on no DNS's reporting "Degraded" as apposed to using the "Available" DNS status condition?
verified with 4.2.0-0.nightly-2019-06-25-003324 and the issue has been fixed. 1. adding below ovs rules to drop all traffic from one DNS pod to 172.30.0.1 ovs-ofctl -O openflow13 add-flow br0 "table=20, priority=500,ip,in_port=10,nw_src=$dnsPodIP,nw_dst=172.30.0.1 actions=drop" 2. kill coredns process to force it restart 3. check the dns pod and logs: $ oc get pod -n openshift-dns NAME READY STATUS RESTARTS AGE dns-default-zfrq9 1/2 Running 7 98m E0626 07:38:50.728491 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout 4. check default DNS CR status $ oc get -n openshift-dns-operator dns.operator/default -o yaml status: conditions: - lastTransitionTime: "2019-06-26T07:18:38Z" message: Not all Nodes running DaemonSet pod reason: DaemonSetDegraded status: "True" type: Degraded - lastTransitionTime: "2019-06-26T07:18:38Z" message: 5 Nodes running a DaemonSet pod, want 6 reason: Reconciling status: "True" type: Progressing - lastTransitionTime: "2019-06-25T06:18:09Z" message: Minimum number of Nodes running DaemonSet pod reason: AsExpected status: "True" type: Available 5. check clusteroperator/dns and ensure the status is "DEGRADED" $ oc get co/dns NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.2.0-0.nightly-2019-06-25-003324 True True True 25h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922