Bug 1650196
Summary: | Adding node taint kills critical cluster pods on tainted node | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jarle Bjørgeengen <jarle.bjorgeengen> |
Component: | Networking | Assignee: | Phil Cameron <pcameron> |
Status: | CLOSED ERRATA | QA Contact: | zhaozhanqi <zzhao> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.10.0 | CC: | aos-bugs, gblomqui, jarle.bjorgeengen, jokerman, mmccomas, sjenning |
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
Consequence:
Fix: openshift-ansible PR 11550 and PR 10731
Result:
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-07-23 19:56:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jarle Bjørgeengen
2018-11-15 15:03:25 UTC
Hi Jarle, The pods you mentioned even though critical have no toleration for `NoExecute` taint. To be clear, in general critical pods doesn't have tolerations unless explicitly specified in the pod spec or created through a DS. Are you noticing any issue with those pods not having the toleration? I am curious about your use-case here, is this causing an upgrade or some other cluster functionality to fail? Hi Ravig, I expected that critical pods, like the pods running core cluster node services (sdn,ovs, and so on) would have all tolerations provided by the installation. I.e. i did not expect the node to be offline when adding those taints. It does not help to add tolerations to pods from user-deployments. It won't schedule anythning on the node because the node is no longer running the required services to be able to receive _any_ pods. The ovs and sdn issues appear to be fixed by OVS: tolerate taints #10731 https://github.com/openshift/openshift-ansible/pull/10731/files openshift-merge-robot merged 1 commit into openshift:release-3.11 from vrutkovs:tolerate-ovs on Nov 20, 2018 SDN: tolerate taints #11550 https://github.com/openshift/openshift-ansible/pull/11550 openshift-merge-robot merged 1 commit into openshift:release-3.11 from vrutkovs:3.11-sdn-tolerations on Apr 24 commit aea7524b8b20b59b0238feb9df36a3b2de413dab (HEAD, upstream/pr/10735) Author: Vadim Rutkovsky <roignac> Date: Tue Nov 20 15:55:46 2018 +0100 OVS: tolerate taints Could you try a recent version? $ oc version ... Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2019-06-14T18:41:57Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"} $ oc adm taint node node.lab.variantweb.net dedicated=special-user:NoExecute node/node.lab.variantweb.net tainted $ oc get node node.lab.variantweb.net -ojson | jq '.spec.taints' [ { "effect": "NoExecute", "key": "dedicated", "value": "special-user" } ] $ ogpa -o wide | grep node.lab openshift-node sync-w54sm 1/1 Running 0 18m 10.42.10.208 node.lab.variantweb.net <none> openshift-sdn ovs-vlfb5 1/1 Running 0 18m 10.42.10.208 node.lab.variantweb.net <none> openshift-sdn sdn-l9g5h 1/1 Running 0 18m 10.42.10.208 node.lab.variantweb.net <none> I do see that the node-exporter is still evicted though. That's an issue for monitoring though. This bug is fixed. fyi opened https://bugzilla.redhat.com/show_bug.cgi?id=1720758 to track node-exporter bug Verified this bug on v3.10.153 all openshift-sdn pod are working even if esAccess=true:NoExecute Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1753 |