1711364 – One of the dns pods continuous restart ,but the dns operator's status is available

Bug 1711364 - One of the dns pods continuous restart ,but the dns operator's status is available

Summary: One of the dns pods continuous restart ,but the dns operator's status is avai...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-17 15:22 UTC by zhou ying
Modified:	2022-08-04 22:39 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:29:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-dns-operator pull 109	0	None	closed	Bug 1711364: Report operator is degraded if any DNS is degraded	2021-02-12 11:09:00 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:29:17 UTC

Description zhou ying 2019-05-17 15:22:15 UTC

Description of problem:
One of the dns pods continuous restart with error: Failed to list ... dial tcp 172.30.0.1:443: connect: no route to host  . But the DNS operator's status is available. 

Version-Release number of selected component (if applicable):
Payload: 4.1.0-0.nightly-2019-05-15-151517

How reproducible:
always

Steps to Reproduce:
1. Follow the doc: https://docs.google.com/document/d/1ONkxdDmQVLBNJrSJymfKPrndo7b4vgCA2zwL9xHYx6A/edit to do certificate recovery ;
2. After done the recovery , run e2e test, and check the cluster status;



Actual results:
2. the cluster could run the openshift/conformance e2w test, but one of the dns pods continuous restart with the dns operator available status:
[yinzhou@192 Downloads]$ oc get po -n openshift-dns
NAME                READY   STATUS    RESTARTS   AGE
dns-default-5rl5v   2/2     Running   5          9h
dns-default-6tr68   2/2     Running   137        9h
dns-default-bdkjx   2/2     Running   2          9h
dns-default-d45g2   2/2     Running   2          9h
dns-default-k5rjp   2/2     Running   5          9h
dns-default-pfwsc   2/2     Running   7          9h
[yinzhou@192 Downloads]$ oc get clusteroperator dns
NAME   VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
dns    4.1.0-0.nightly-2019-05-15-151517   True        False         False      9h

Check the container has logs:
[root@ip-10-0-175-169 ~]# crictl ps -a
CONTAINER ID        IMAGE                                                              CREATED              STATE               NAME                      ATTEMPT             POD ID
578dad58d9dc5       44ed977fdb334e53eedbad02a1fb51e9a6618e3208954ae72a1493c0ecf2f195   12 seconds ago       Running             dns                       129                 938d100592881

[root@ip-10-0-175-169 ~]# crictl logs -f 578dad58d9dc5
E0517 14:38:03.733441       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0517 14:38:03.733521       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0517 14:38:03.733446       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0517 14:38:07.829383       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0517 14:38:07.829419       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://172.30.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0517 14:38:07.829383       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
.:5353
2019-05-17T14:38:08.457Z [INFO] CoreDNS-1.3.1
2019-05-17T14:38:08.457Z [INFO] linux/amd64, go1.10.8, 
CoreDNS-1.3.1
linux/amd64, go1.10.8, 




Expected results:
2. No restart for pod. Or the clusteroperator's status should be "DEGRADED"



Additional info:

Comment 2 Dan Mace 2019-05-17 19:11:33 UTC

The dns resource status looks good:

$ oc get -n openshift-dns-operator dns.operator.openshift.io -o yaml
apiVersion: v1                                  
items:                                                      
- apiVersion: operator.openshift.io/v1
  kind: DNS         
  metadata:          
    creationTimestamp: "2019-05-17T05:06:38Z"
    finalizers:
    - dns.operator.openshift.io/dns-controller
    generation: 1
    name: default
    resourceVersion: "361812"
    selfLink: /apis/operator.openshift.io/v1/dnses/default
    uid: 8d4c58df-7861-11e9-9842-02a4275cc94e
  spec: {}
  status:                
    clusterDomain: cluster.local                        
    clusterIP: 172.30.0.10
    conditions:          
    - lastTransitionTime: "2019-05-17T18:56:04Z"
      message: Not all Nodes running DaemonSet pod                          
      reason: DaemonSetDegraded                                             
      status: "True"                                                        
      type: Degraded                                                        
    - lastTransitionTime: "2019-05-17T18:56:04Z"                            
      message: 5 Nodes running a DaemonSet pod, want 6                      
      reason: Reconciling                                                   
      status: "True"                                                        
      type: Progressing
    - lastTransitionTime: "2019-05-17T05:34:02Z"
      message: Minimum number of Nodes running DaemonSet pod
      reason: AsExpected
      status: "True"
      type: Available
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


The operator status seems to be misreporting Degraded=False given the nonzero unavailable dns replicas:

$ oc get clusteroperators/dns -o yaml         
apiVersion: config.openshift.io/v1
kind: ClusterOperator     
metadata:                                               
  creationTimestamp: "2019-05-17T05:06:39Z"               
  generation: 1                              
  name: dns                                      
  resourceVersion: "20172"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/dns
  uid: 8db6e6b4-7861-11e9-9842-02a4275cc94e
spec: {}                                        
status:                                         
  conditions:                                                                                                                      
  - lastTransitionTime: "2019-05-17T05:34:02Z"                              
    message: All desired DNS DaemonSets available and operand Namespace exists                                                     
    reason: AsExpected                                                      
    status: "False"                                                         
    type: Degraded                                                          
  - lastTransitionTime: "2019-05-17T05:34:02Z"                              
    message: Desired and available number of DNS DaemonSets are equal       
    reason: AsExpected                                              
    status: "False"                             
    type: Progressing                                       
  - lastTransitionTime: "2019-05-17T05:34:02Z"
    message: At least 1 DNS DaemonSet available
    reason: AsExpected
    status: "True"                           
    type: Available                          
  extension: null                             
  relatedObjects:
  - group: ""             
    name: openshift-dns-operator                        
    resource: namespaces                                  
  - group: ""                                
    name: openshift-dns                          
    resource: namespaces
  versions:
  - name: operator
    version: 4.1.0-0.nightly-2019-05-15-151517
  - name: coredns
    version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3e6664558ae1d1a3b773673c5998f1239eccc3ade3b7b4f85aae4f86b54f390
  - name: openshift-cli
    version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:98995ecf1afb6121c0947d9d645dd0ce63b79c55650045db53a18e4ee8190a97


So, something's definitely going wrong in the operator with the operator status reporting.

Separately, we need to understand why the failing CoreDNS pod can't communicate with the apiserver in the first place.

In addition, please note that cluster DNS services are still functional from the node on which the CoreDNS pod is failing — requests will be routed through the SDN to other CoreDNS pods on other nodes. However, there is necessarily some (unmeasured) performance impact during the outage.

Comment 5 Dan Mace 2019-05-17 21:08:09 UTC

Just a quick update on the underlying cause of the COreDNS pod crash loop. Looks like containers in the CoreDNS pod network namespace can't route to the apiserver IP address (172.30.0.1). There _might_ be something SDN (nftables) related here. The similar alertmanager container on the same node has no issues and has some possibly benign but visible nftables differences. I've attached some state dumps. It would be useful to get the SDN folks to take a quick look for anything obvious that stands out.

Comment 6 Dan Mace 2019-05-17 21:43:48 UTC

Looks like the three rules for the dns pod clusterIP correspond to the three ports defined on its service, so that's okay.

Comment 7 Daneyon Hansen 2019-05-17 21:54:48 UTC

A DNS resource is considered "Available" as long as  A) the Service has been assigned a ClusterIP and B) at least 1 DaemonSet pod reports a status of "Available". The clusteroperator/dns reports "Degraded" if A) the operand namespace does not exist or B) if the number of DNS resource is 0 or C) the number of DNS resources reporting "Available" does not match the number of total DNS resources for the cluster. Should the Operator "Degraded" condition be based on no DNS's reporting "Degraded" as apposed to using the "Available" DNS status condition?

Comment 9 Hongan Li 2019-06-26 07:48:38 UTC

verified with 4.2.0-0.nightly-2019-06-25-003324 and the issue has been fixed.

1. adding below ovs rules to drop all traffic from one DNS pod to 172.30.0.1
ovs-ofctl -O openflow13 add-flow br0 "table=20, priority=500,ip,in_port=10,nw_src=$dnsPodIP,nw_dst=172.30.0.1 actions=drop"
2. kill coredns process to force it restart
3. check the dns pod and logs:

$ oc get pod -n openshift-dns
NAME                READY   STATUS    RESTARTS   AGE
dns-default-zfrq9   1/2     Running   7          98m

E0626 07:38:50.728491       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout

4. check default DNS CR status
$ oc get -n openshift-dns-operator dns.operator/default -o yaml
status:
  conditions:
  - lastTransitionTime: "2019-06-26T07:18:38Z"
    message: Not all Nodes running DaemonSet pod
    reason: DaemonSetDegraded
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-06-26T07:18:38Z"
    message: 5 Nodes running a DaemonSet pod, want 6
    reason: Reconciling
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-06-25T06:18:09Z"
    message: Minimum number of Nodes running DaemonSet pod
    reason: AsExpected
    status: "True"
    type: Available

5. check clusteroperator/dns and ensure the status is "DEGRADED"
$ oc get co/dns
NAME   VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
dns    4.2.0-0.nightly-2019-06-25-003324   True        True          True       25h

Comment 11 errata-xmlrpc 2019-10-16 06:29:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.