Description of problem: During a rollout, the dns-operator degraded status will flap. Noticed most lately in https://bugzilla.redhat.com/show_bug.cgi?id=1760473#c1: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/build-log.txt | grep 'changed Degraded' Oct 09 14:30:46.581 E clusteroperator/openshift-samples changed Degraded to True: APIServerError: Operation cannot be fulfilled on imagestreams.image.openshift.io "jenkins-agent-nodejs": the object has been modified; please apply your changes to the latest version and try again error replacing imagestream [jenkins-agent-nodejs]; Oct 09 14:30:48.117 W clusteroperator/openshift-samples changed Degraded to False Oct 09 14:33:55.344 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available But I think a better example of flapping is: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8197/build-log.txt | grep 'clusteroperator/dns changed Degraded' Oct 10 01:35:36.401 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:37:35.199 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Oct 10 01:39:35.364 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:40:07.031 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Oct 10 01:40:29.366 E clusteroperator/dns changed Degraded to True: NotAllDNSesAvailable: Not all desired DNS DaemonSets available Oct 10 01:40:51.153 W clusteroperator/dns changed Degraded to False: AsExpected: All desired DNS DaemonSets available and operand Namespace exists Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
verified with 4.3.0-0.nightly-2019-10-16-194525. ### force one dns pod to be unavailable, the ”Degrade“ status is False $ oc -n openshift-dns get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE dns-default 5 5 4 5 4 kubernetes.io/os=linux 6h26m $ oc get dns.operator -o yaml <---snip---> conditions: - lastTransitionTime: "2019-10-17T02:40:34Z" message: ClusterIP assigned to DNS Service and minimum DaemonSet pods running reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2019-10-17T07:41:39Z" message: 4 Nodes running a DaemonSet pod, want 5 reason: Reconciling status: "True" type: Progressing $ oc get co/dns NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.3.0-0.nightly-2019-10-16-194525 True True False 6h28m ### force two dns pod to be unavailable, the ”Degrade“ status is True $ oc get ds -n openshift-dns NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE dns-default 5 5 3 5 3 kubernetes.io/os=linux 6h40m $ oc get dns.operator -o yaml <---snip---> conditions: - lastTransitionTime: "2019-10-17T07:54:03Z" message: Too many unavailable CoreDNS pods (2 > 1 max unavailable) reason: MaxUnavailableExceeded status: "True" type: Degraded $ oc get co/dns NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.3.0-0.nightly-2019-10-16-194525 True True True 6h40m
just noticed that two nodes (one master and one worker) were possibly in "NotReady" at same time during upgrade, so still can found DNS Degraded with reason: NotAllDNSesAvailable. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.2 True True 178m Working towards 4.3.0-0.nightly-2019-11-02-092336: 13% complete $ oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 7h26m v1.14.6+7e13ab9a7 control-plane-0 NotReady,SchedulingDisabled master 7h26m v1.16.2 control-plane-1 Ready master 7h26m v1.14.6+7e13ab9a7 control-plane-2 Ready master 7h26m v1.14.6+7e13ab9a7 rhel77-0.weinliu-422-upgrade2.qe.devcluster.openshift.com Ready worker 4h25m v1.14.6+0365fa172 rhel77-1.weinliu-422-upgrade2.qe.devcluster.openshift.com NotReady,SchedulingDisabled worker 4h25m v1.14.6+0365fa172 $ oc get ds -n openshift-dns NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE dns-default 6 6 4 6 4 kubernetes.io/os=linux 7h25m $ oc get co/dns -o yaml <---snip---> status: conditions: - lastTransitionTime: "2019-11-05T08:53:50Z" message: Not all desired DNS DaemonSets available reason: NotAllDNSesAvailable status: "True" type: Degraded - lastTransitionTime: "2019-11-05T08:53:30Z" message: Not all DNS DaemonSets available. reason: Reconciling status: "True" type: Progressing - lastTransitionTime: "2019-11-05T04:09:04Z" message: At least 1 DNS DaemonSet available reason: AsExpected status: "True" type: Available $ oc get dns.operator -o yaml <---snip---> status: clusterDomain: cluster.local clusterIP: 172.30.0.10 conditions: - lastTransitionTime: "2019-11-05T08:53:50Z" message: Too many unavailable CoreDNS pods (2 > 1 max unavailable) reason: MaxUnavailableExceeded status: "True" type: Degraded - lastTransitionTime: "2019-11-05T08:53:50Z" message: 4 Nodes running a DaemonSet pod, want 6 reason: Reconciling status: "True" type: Progressing - lastTransitionTime: "2019-11-05T04:09:04Z" message: Minimum number of Nodes running DaemonSet pod reason: AsExpected status: "True" type: Available So I reopen this to consider if we should adjust max unavailable to 2.
Hongan, Prior to the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1753059 (which just merged yesterday), we were scheduling DNS pods even on tainted/unschedulable nodes which is what looks like you're seeing. In this scenario, I wonder if the desired replica count should be 4 instead of 6. I think the toleration fix would achieve that. Mind giving it another try with the latest payload? If that helps I think for now I'd prefer to leave max unavailable at 1 for the time being. Thank you for your detailed testing!
Yes, that make sense. Actually it has two phases during upgrade. First one is co/dns upgrade, all nodes should be ready and max unavailable dns pod is 1 during that time (fixed in this BZ). Second one is co/machine-config upgrade, it may marks more than one nodes as NotReady simultaneously then cause DNS Degraded. If the desired replica count can be adjusted properly in this case then should fix the issue. So let's move to https://bugzilla.redhat.com/show_bug.cgi?id=1753059 to track the latter. Moving this back to verified. Thank you Dan.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062