Hide Forgot
Description of problem: When a node is tainted, the sync daemon set will not run a pod on it. This leads to failures during the installation (e.g. 'Wait for sync DS to set annotations on master nodes'). Version-Release number of the following components: v3.11.16 How reproducible: 100% Steps to Reproduce: 1. Taint a node 2. Upgrade from 3.10 to 3.11 3. An error will occur during the installation Expected results: The sync daemonset is required for node configuration to be applied. It seems like it should tolerate most taints. Additional info: The taint which interfered with my upgrade was 'NodeWithImpairedVolumes=true:NoSchedule' - this can be applied by the storage system without notice whenever ebs issues are encountered.
Created PR to master: https://github.com/openshift/openshift-ansible/pull/10310
3.11 cherrypick https://github.com/openshift/openshift-ansible/pull/10646
Fix is available in openshift-ansible-3.11.42-1
tested the fix in 43 build but still pods are getting affected by the taint. Here are details. Version: openshift v3.11.43 kubernetes v1.11.0+d4cacc0 Steps to reproduce: 1. $node is the compute node where the pods are running 2. Note the pods running for sync and ovs,sdn pods # oc get pods -n openshift-node -o wide | grep $node ; oc get pods -n openshift-sdn -o wide | grep $node ; sync-vkxwk 1/1 Running ovs-gjkts 1/1 Running sdn-c7c7f 1/1 Running 3. Note that there are 3 pods running for each of sync, ovs, sdn 4. taint the node #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute #oc describe node $node | grep -i taint Taints: NodeWithImpairedVolumes=true:NoExecute 5. restart the api and controllers # master-restart api # master-restart controllers 6. Note the pods for sync and ovs,sdn pods sync-vkxwk 0/1 CrashLoopBackOff sdn-c7c7f 0/1 CrashLoopBackOff The ovs pod doesn't even show up and the sync and sdn pod goes into CrashLoopBackOff constantly.
(In reply to Siva Reddy from comment #4) > tested the fix in 43 build but still pods are getting affected by the taint. > Here are details. Do sync pods get affected by this taint? Do sync pods start on tainted nodes?
Do pods start on tainted nodes? they are going into crashloopbackoff What's in the logs for these containers? I have sent you the cluster details via private message Why would a master api/controller restart be required? I was just following the steps in the guide to apply taints and tolerations. it may not be needed https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html
The sync daemon sets are tolerating the taints put on the node. Version: oc v3.11.44 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server openshift v3.11.44 kubernetes v1.11.0+d4cacc0 Steps to reproduce: 1. taint the node # oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute #oc describe node $node | grep -i taint 2. delete the sync pod ds and recreate it # oc get ds sync -o yaml > sync-ds.yaml # oc delete ds sync # oc create -f sync-ds.yaml 3. Note the sync pods The sync pods get created with out any issue in spite of taint being present on the node.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024