Bug 1705649
Summary: | [reliability] Cluster with halted master did not reschedule operators after 5m of being down | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Node | Assignee: | ravig <rgudimet> |
Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | aos-bugs, decarr, eparis, gblomqui, gklein, jokerman, mmccomas, rgudimet, sjenning |
Target Milestone: | --- | ||
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:48:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Clayton Coleman
2019-05-02 16:33:56 UTC
The specific "tolerate down node for X seconds" should probably be set on operators to 1-2m instead of using the default. That'd ensure progress. Also, this could be zone thresholds. oc get machine -A NAMESPACE NAME INSTANCE STATE TYPE REGION ZONE AGE openshift-machine-api ci-ln-gchzbyt-d5d6b-497pq-master-0 i-0b8ba53a09e86b120 running m4.xlarge us-east-1 us-east-1a 122m openshift-machine-api ci-ln-gchzbyt-d5d6b-497pq-master-1 i-06ef695e0b5d5d8f3 running m4.xlarge us-east-1 us-east-1b 122m openshift-machine-api ci-ln-gchzbyt-d5d6b-497pq-master-2 i-019e54566efae9ccc running m4.xlarge us-east-1 us-east-1a 122m openshift-machine-api ci-ln-gchzbyt-d5d6b-497pq-worker-us-east-1a-6k5cz i-071d957aabcec3819 running m4.large us-east-1 us-east-1a 121m openshift-machine-api ci-ln-gchzbyt-d5d6b-497pq-worker-us-east-1a-vzvhw i-0730151cbbc6b5c6f running m4.large us-east-1 us-east-1a 121m openshift-machine-api ci-ln-gchzbyt-d5d6b-497pq-worker-us-east-1b-vprmg i-0fc1d784a6c99335f running m4.large us-east-1 us-east-1b 121m ○ oc get nodes -o wide -L failure-domain.beta.kubernetes.io/zone NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ZONE ip-10-0-129-154.ec2.internal Ready master 124m v1.13.4+48f1990d7 10.0.129.154 <none> Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa) 4.18.0-80.el8.x86_64 cri-o://1.13.7-1.rhaos4.1.git4258573.el8 us-east-1a ip-10-0-129-97.ec2.internal NotReady master 124m v1.13.4+48f1990d7 10.0.129.97 <none> Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa) 4.18.0-80.el8.x86_64 cri-o://1.13.7-1.rhaos4.1.git4258573.el8 us-east-1a ip-10-0-131-44.ec2.internal Ready master 42m v1.13.4+48f1990d7 10.0.131.44 <none> Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa) 4.18.0-80.el8.x86_64 cri-o://1.13.7-1.rhaos4.1.git4258573.el8 us-east-1a ip-10-0-135-184.ec2.internal Ready worker 119m v1.13.4+48f1990d7 10.0.135.184 <none> Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa) 4.18.0-80.el8.x86_64 cri-o://1.13.7-1.rhaos4.1.git4258573.el8 us-east-1a ip-10-0-141-8.ec2.internal Ready worker 119m v1.13.4+48f1990d7 10.0.141.8 <none> Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa) 4.18.0-80.el8.x86_64 cri-o://1.13.7-1.rhaos4.1.git4258573.el8 us-east-1a ip-10-0-152-43.ec2.internal Ready worker 119m v1.13.4+48f1990d7 10.0.152.43 <none> Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa) 4.18.0-80.el8.x86_64 cri-o://1.13.7-1.rhaos4.1.git4258573.el8 us-east-1b ip-10-0-159-171.ec2.internal Ready master 124m v1.13.4+48f1990d7 10.0.159.171 <none> Red Hat Enterprise Linux CoreOS 410.8.20190502.1 (Ootpa) 4.18.0-80.el8.x86_64 cri-o://1.13.7-1.rhaos4.1.git4258573.el8 us-east-1b Ok, there are a few issues here: 1) components tolerating all taints or the unreachable taint for infinite time I wrote an e2e test to find these components: https://github.com/openshift/origin/pull/22752 The current list of offenders is: openshift-apiserver-operator/openshift-apiserver-operator openshift-authentication-operator/authentication-operator openshift-authentication/integrated-oauth-server openshift-cloud-credential-operator/cloud-credential-operator openshift-cluster-machine-approver/machine-approver openshift-cluster-node-tuning-operator/cluster-node-tuning-operator openshift-cluster-storage-operator/cluster-storage-operator openshift-cluster-version/cluster-version-operator openshift-console/downloads openshift-controller-manager-operator/openshift-controller-manager-operator openshift-dns-operator/dns-operator openshift-ingress-operator/ingress-operator openshift-kube-apiserver-operator/kube-apiserver-operator openshift-kube-controller-manager-operator/kube-controller-manager-operator openshift-kube-scheduler-operator/openshift-kube-scheduler-operator openshift-machine-config-operator/etcd-quorum-guard openshift-marketplace/marketplace-operator openshift-monitoring/cluster-monitoring-operator openshift-operator-lifecycle-manager/catalog-operator openshift-operator-lifecycle-manager/olm-operator openshift-operator-lifecycle-manager/olm-operators openshift-operator-lifecycle-manager/packageserver openshift-service-ca-operator/service-ca-operator openshift-service-ca/apiservice-cabundle-injector openshift-service-ca/configmap-cabundle-injector openshift-service-ca/service-serving-cert-signer openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator 2) mutationg admission plugins adding tolerations to pods https://github.com/openshift/origin/tree/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/defaulttolerationseconds Adds a not-ready and unreachable toleration for 300s. This is supposed to align with the 5m default pod-eviction-timeout on the kube-controller-manager. Fragile and must stay in sync. https://github.com/openshift/origin/tree/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/podtolerationrestriction Adds a memory-pressure toleration (infinte duration) for non-besteffort pods. This is a huge problem in my mind in that it steal eviction policy away from the kubelet. Consider the kubelet coming under disk pressure. It will set the disk-pressure condition. TaintNodeByCondition will taint the node with disk-pressure taint. All pods that don't have a toleration with NoExecute effect for that taint are evicted by the TaintBasedEviction controller. Meanwhile the kubelet is also evicting pods, more intelligently ranked on disk usage. I think we might have to disable TaintNodeByCondition and TaintBasedEviction until we can fix some of these glaring issues. Ravi, what do you think? > This is a huge problem in my mind in that it steal eviction policy away from the kubelet. > > Consider the kubelet coming under disk pressure. It will set the disk-pressure condition. TaintNodeByCondition will taint the node with disk-pressure taint. All pods that don't have a toleration with NoExecute effect for that taint are evicted by the TaintBasedEviction controller. Meanwhile the kubelet is also evicting pods, more intelligently ranked on disk usage. Ok, Ravi pointed out to me this isn't an issue because the toleration added by TaintNodeByCondition is a NoSchedule taint, not a NoExecute taint. So that is fine. > I think we might have to disable TaintNodeByCondition and TaintBasedEviction until we can fix some of these glaring issues. The issue isn't as bad as I first though, but this is still an escape hatch if we need it. The tl;dr here is that the pod-eviction-timeout on the KCM no longer has any effect and the defaulttolerationseconds admission plugin now controls the default pod eviction timeout. Also the timeout is now a PER-POD setting via the tolerationSeconds on the unreachable taint with the default set by the defaulttolerationseconds admission plugin. That plugin has flags to control the default tolerationSeconds https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/plugin/pkg/admission/defaulttolerationseconds/admission.go#L34-L40 The default is 300s (5m) to align with the historical default for pod-eviction-timeout on the KCM. Current list of pending PRs https://github.com/openshift/cluster-monitoring-operator/pull/342 https://github.com/openshift/machine-api-operator/pull/310 https://github.com/openshift/cluster-samples-operator/pull/142 https://github.com/openshift/cloud-credential-operator/pull/64 https://github.com/openshift/cluster-kube-apiserver-operator/pull/454 https://github.com/openshift/cluster-kube-controller-manager-operator/pull/246 https://github.com/openshift/cluster-openshift-apiserver-operator/pull/200 https://github.com/openshift/cluster-storage-operator/pull/31 https://github.com/openshift/cluster-node-tuning-operator/pull/50 https://github.com/openshift/cluster-machine-approver/pull/25 https://github.com/openshift/console-operator/pull/224 https://github.com/openshift/cluster-ingress-operator/pull/226 https://github.com/operator-framework/operator-lifecycle-manager/pull/843 https://github.com/openshift/service-ca-operator/pull/56 One more: https://github.com/openshift/cluster-kube-scheduler-operator/pull/124 Merged so far: https://github.com/openshift/cluster-samples-operator/pull/142 https://github.com/openshift/cluster-storage-operator/pull/31 https://github.com/openshift/console-operator/pull/224 12 left... No additional PRs merged today. Ravi, can you provide color as to why these aren't merging? CI? Unresponsible component owners? More have merged https://github.com/openshift/cluster-monitoring-operator/pull/342 https://github.com/openshift/machine-api-operator/pull/310 https://github.com/openshift/cloud-credential-operator/pull/64 https://github.com/openshift/cluster-kube-controller-manager-operator/pull/246 https://github.com/openshift/cluster-ingress-operator/pull/226 https://github.com/openshift/service-ca-operator/pull/56 These remain https://github.com/openshift/cluster-kube-apiserver-operator/pull/454 https://github.com/openshift/cluster-kube-scheduler-operator/pull/124 https://github.com/openshift/cluster-openshift-apiserver-operator/pull/200 https://github.com/openshift/cluster-node-tuning-operator/pull/50 https://github.com/openshift/cluster-machine-approver/pull/25 https://github.com/operator-framework/operator-lifecycle-manager/pull/843 Appears only https://github.com/operator-framework/operator-lifecycle-manager/pull/843 is outstanding. Thank you Ravi! so close. Whoops 2 left: https://github.com/openshift/cluster-kube-apiserver-operator/pull/454 https://github.com/operator-framework/operator-lifecycle-manager/pull/843 As of 4.1.0-0.ci-2019-05-07-132218 the e2e test shows openshift-apiserver-operator/openshift-apiserver-operator tolerates all taints openshift-authentication/integrated-oauth-server tolerates all taints openshift-authentication/oauth-openshift tolerates all taints openshift-cloud-credential-operator/cloud-credential-operator tolerates all taints openshift-cluster-machine-approver/machine-approver tolerates all taints openshift-cluster-node-tuning-operator/cluster-node-tuning-operator tolerates all taints openshift-cluster-version/cluster-version-operator tolerates all taints openshift-dns-operator/dns-operator tolerates all taints openshift-ingress-operator/ingress-operator tolerates all taints openshift-kube-apiserver-operator/kube-apiserver-operator tolerates all taints openshift-kube-controller-manager-operator/kube-controller-manager-operator tolerates all taints openshift-kube-scheduler-operator/openshift-kube-scheduler-operator tolerates all taints openshift-machine-config-operator/etcd-quorum-guard tolerates node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with no tolerationSeconds openshift-marketplace/marketplace-operator tolerates all taints openshift-network-operator/network-operator tolerates node.kubernetes.io/not-ready with no tolerationSeconds openshift-operator-lifecycle-manager/catalog-operator tolerates all taints openshift-operator-lifecycle-manager/olm-operator tolerates all taints openshift-operator-lifecycle-manager/olm-operators tolerates all taints openshift-operator-lifecycle-manager/packageserver tolerates all taints openshift-service-ca-operator/service-ca-operator tolerates all taints There is still some weirdness going on somewhere https://github.com/openshift/cloud-credential-operator/blob/master/manifests/01_deployment.yaml#L172-L183 However, in my cluster, the tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 120 - operator: Exists <---- - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 120 - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists Is there yet another admission plugin doing something? Clayton, hit some fun here. Apparently, whatever the CVO does to apply the new operator deployment manifests merges the new toleration list with the old. tl;dr There would be no way to remove tolerations from an operator deployment going forward. New list of unaddressed components (on a fresh cluster now) openshift-authentication-operator/authentication-operator tolerates all taints openshift-authentication/oauth-openshift tolerates all taints openshift-dns-operator/dns-operator tolerates all taints openshift-machine-config-operator/etcd-quorum-guard tolerates node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with no tolerationSeconds openshift-marketplace/marketplace-operator tolerates all taints Now merged https://github.com/openshift/cluster-kube-apiserver-operator/pull/454 https://github.com/operator-framework/operator-lifecycle-manager/pull/843 > openshift-authentication-operator/authentication-operator tolerates all taints > openshift-authentication/oauth-openshift tolerates all taints https://github.com/openshift/cluster-authentication-operator/pull/128 > openshift-dns-operator/dns-operator tolerates all taints https://github.com/openshift/cluster-dns-operator/pull/107 > openshift-machine-config-operator/etcd-quorum-guard tolerates node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with no tolerationSeconds https://github.com/openshift/machine-config-operator/pull/700 > openshift-marketplace/marketplace-operator tolerates all taints https://github.com/operator-framework/operator-marketplace/pull/178 (merged but not seeing effect in 4.1.0-0.ci-2019-05-07-180607) I was able to confirm marketplace-operator is fixed on 4.1.0-0.ci-2019-05-07-213620 https://github.com/openshift/machine-config-operator/pull/700 and https://github.com/openshift/cluster-authentication-operator/pull/128 are merged but not in a release yet Only https://github.com/openshift/cluster-dns-operator/pull/107 remains unmerged. last one through. Let's go new release payloads! As of 4.1.0-0.ci-2019-05-08-025712 only one left. openshift-operator-lifecycle-manager/olm-operators tolerates all taints Must have missed it in https://github.com/operator-framework/operator-lifecycle-manager/pull/843 :-/ https://github.com/operator-framework/operator-lifecycle-manager/pull/850 merged just waiting for a release build Abandoning attempt to change olm-operators. It is actually not an operator. Adding to e2e test whitelist. This should pass now and we can move to MODIFED after it does https://github.com/openshift/origin/pull/22752 I ran the e2e locally against 4.1.0-0.ci-2019-05-08-172654 and it passed I'm going MODIFIED as the code is fixed. We should continue to track if https://github.com/openshift/origin/pull/22752 merges, since that is an e2e to prevent this mess from coming back. But QE should be able to validate now. https://github.com/openshift/origin/pull/22752 merged so we don't lose this ground Note to QE: The validation for this is to ensure that all _operator_ pods do not 1) tolerate all taints or 2) tolerate not-ready or unreachable taint with no tolerationSeconds (i.e. indefinitely) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |