Bug 1711364
| Summary: | One of the dns pods continuous restart ,but the dns operator's status is available | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | zhou ying <yinzhou> |
| Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> |
| Networking sub component: | DNS | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | low | CC: | aos-bugs, dhansen, dmace, mfisher |
| Version: | 4.1.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:29:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
zhou ying
2019-05-17 15:22:15 UTC
The dns resource status looks good:
$ oc get -n openshift-dns-operator dns.operator.openshift.io -o yaml
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
creationTimestamp: "2019-05-17T05:06:38Z"
finalizers:
- dns.operator.openshift.io/dns-controller
generation: 1
name: default
resourceVersion: "361812"
selfLink: /apis/operator.openshift.io/v1/dnses/default
uid: 8d4c58df-7861-11e9-9842-02a4275cc94e
spec: {}
status:
clusterDomain: cluster.local
clusterIP: 172.30.0.10
conditions:
- lastTransitionTime: "2019-05-17T18:56:04Z"
message: Not all Nodes running DaemonSet pod
reason: DaemonSetDegraded
status: "True"
type: Degraded
- lastTransitionTime: "2019-05-17T18:56:04Z"
message: 5 Nodes running a DaemonSet pod, want 6
reason: Reconciling
status: "True"
type: Progressing
- lastTransitionTime: "2019-05-17T05:34:02Z"
message: Minimum number of Nodes running DaemonSet pod
reason: AsExpected
status: "True"
type: Available
kind: List
metadata:
resourceVersion: ""
selfLink: ""
The operator status seems to be misreporting Degraded=False given the nonzero unavailable dns replicas:
$ oc get clusteroperators/dns -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: "2019-05-17T05:06:39Z"
generation: 1
name: dns
resourceVersion: "20172"
selfLink: /apis/config.openshift.io/v1/clusteroperators/dns
uid: 8db6e6b4-7861-11e9-9842-02a4275cc94e
spec: {}
status:
conditions:
- lastTransitionTime: "2019-05-17T05:34:02Z"
message: All desired DNS DaemonSets available and operand Namespace exists
reason: AsExpected
status: "False"
type: Degraded
- lastTransitionTime: "2019-05-17T05:34:02Z"
message: Desired and available number of DNS DaemonSets are equal
reason: AsExpected
status: "False"
type: Progressing
- lastTransitionTime: "2019-05-17T05:34:02Z"
message: At least 1 DNS DaemonSet available
reason: AsExpected
status: "True"
type: Available
extension: null
relatedObjects:
- group: ""
name: openshift-dns-operator
resource: namespaces
- group: ""
name: openshift-dns
resource: namespaces
versions:
- name: operator
version: 4.1.0-0.nightly-2019-05-15-151517
- name: coredns
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3e6664558ae1d1a3b773673c5998f1239eccc3ade3b7b4f85aae4f86b54f390
- name: openshift-cli
version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:98995ecf1afb6121c0947d9d645dd0ce63b79c55650045db53a18e4ee8190a97
So, something's definitely going wrong in the operator with the operator status reporting.
Separately, we need to understand why the failing CoreDNS pod can't communicate with the apiserver in the first place.
In addition, please note that cluster DNS services are still functional from the node on which the CoreDNS pod is failing — requests will be routed through the SDN to other CoreDNS pods on other nodes. However, there is necessarily some (unmeasured) performance impact during the outage.
Just a quick update on the underlying cause of the COreDNS pod crash loop. Looks like containers in the CoreDNS pod network namespace can't route to the apiserver IP address (172.30.0.1). There _might_ be something SDN (nftables) related here. The similar alertmanager container on the same node has no issues and has some possibly benign but visible nftables differences. I've attached some state dumps. It would be useful to get the SDN folks to take a quick look for anything obvious that stands out. Looks like the three rules for the dns pod clusterIP correspond to the three ports defined on its service, so that's okay. A DNS resource is considered "Available" as long as A) the Service has been assigned a ClusterIP and B) at least 1 DaemonSet pod reports a status of "Available". The clusteroperator/dns reports "Degraded" if A) the operand namespace does not exist or B) if the number of DNS resource is 0 or C) the number of DNS resources reporting "Available" does not match the number of total DNS resources for the cluster. Should the Operator "Degraded" condition be based on no DNS's reporting "Degraded" as apposed to using the "Available" DNS status condition? verified with 4.2.0-0.nightly-2019-06-25-003324 and the issue has been fixed. 1. adding below ovs rules to drop all traffic from one DNS pod to 172.30.0.1 ovs-ofctl -O openflow13 add-flow br0 "table=20, priority=500,ip,in_port=10,nw_src=$dnsPodIP,nw_dst=172.30.0.1 actions=drop" 2. kill coredns process to force it restart 3. check the dns pod and logs: $ oc get pod -n openshift-dns NAME READY STATUS RESTARTS AGE dns-default-zfrq9 1/2 Running 7 98m E0626 07:38:50.728491 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout 4. check default DNS CR status $ oc get -n openshift-dns-operator dns.operator/default -o yaml status: conditions: - lastTransitionTime: "2019-06-26T07:18:38Z" message: Not all Nodes running DaemonSet pod reason: DaemonSetDegraded status: "True" type: Degraded - lastTransitionTime: "2019-06-26T07:18:38Z" message: 5 Nodes running a DaemonSet pod, want 6 reason: Reconciling status: "True" type: Progressing - lastTransitionTime: "2019-06-25T06:18:09Z" message: Minimum number of Nodes running DaemonSet pod reason: AsExpected status: "True" type: Available 5. check clusteroperator/dns and ensure the status is "DEGRADED" $ oc get co/dns NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.2.0-0.nightly-2019-06-25-003324 True True True 25h Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |