Bug 1635462 - sync daemonset should tolerate taints
Summary: sync daemonset should tolerate taints
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.z
Assignee: Vadim Rutkovsky
QA Contact: Siva Reddy
URL:
Whiteboard:
Depends On:
Blocks: 1685952 1690200
TreeView+ depends on / blocked
 
Reported: 2018-10-03 01:44 UTC by Justin Pierce
Modified: 2019-03-19 01:33 UTC (History)
8 users (show)

Fixed In Version: openshift-ansible-3.11.44-1.git.0.11d174e.el7.noarch.rpm
Doc Type: Bug Fix
Doc Text:
Cause: sync daemonset didn't run on all nodes Consequence: upgrade failed, as some nodes didn't have annotation set Fix: sync daemonset tolerates all taints and runs on all nodes Result: upgrade succeeds
Clone Of:
: 1685952 (view as bug list)
Environment:
Last Closed: 2019-01-10 09:04:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1635804 0 unspecified CLOSED openshift-sdn daemonsets should tolerate taints 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2019:0024 0 None None None 2019-01-10 09:04:07 UTC

Internal Links: 1635804

Description Justin Pierce 2018-10-03 01:44:44 UTC
Description of problem:
When a node is tainted, the sync daemon set will not run a pod on it. This leads to failures during the installation (e.g. 'Wait for sync DS to set annotations on master nodes'). 

Version-Release number of the following components:
v3.11.16

How reproducible:
100%

Steps to Reproduce:
1. Taint a node
2. Upgrade from 3.10 to 3.11
3. An error will occur during the installation

Expected results:
The sync daemonset is required for node configuration to be applied. It seems like it should tolerate most taints.

Additional info:
The taint which interfered with my upgrade was 'NodeWithImpairedVolumes=true:NoSchedule' - this can be applied by the storage system without notice whenever ebs issues are encountered.

Comment 1 Vadim Rutkovsky 2018-10-03 08:54:40 UTC
Created PR to master: https://github.com/openshift/openshift-ansible/pull/10310

Comment 2 Vadim Rutkovsky 2018-11-09 06:41:12 UTC
3.11 cherrypick https://github.com/openshift/openshift-ansible/pull/10646

Comment 3 Vadim Rutkovsky 2018-11-09 06:42:14 UTC
Fix is available in openshift-ansible-3.11.42-1

Comment 4 Siva Reddy 2018-11-13 19:43:09 UTC
tested the fix in 43 build but still pods are getting affected by the taint. Here are details.

Version:
openshift v3.11.43
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. $node is the compute node where the pods are running
2. Note the pods running for sync and ovs,sdn pods
   # oc get pods -n openshift-node -o wide | grep $node ;  oc get pods -n openshift-sdn -o wide | grep $node ;
    sync-vkxwk   1/1       Running   
    ovs-gjkts   1/1       Running   
    sdn-c7c7f   1/1       Running   
3. Note that there are 3 pods running for each of sync, ovs, sdn
4. taint the node
    #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
      Taints:             NodeWithImpairedVolumes=true:NoExecute
5. restart the api and controllers
   # master-restart api
   # master-restart controllers                                                                                                                        
6. Note the pods for sync and ovs,sdn pods
   sync-vkxwk   0/1       CrashLoopBackOff
   sdn-c7c7f   0/1       CrashLoopBackOff 


     The ovs pod doesn't even show up and the sync and sdn pod goes into CrashLoopBackOff constantly.

Comment 5 Vadim Rutkovsky 2018-11-14 07:58:05 UTC
(In reply to Siva Reddy from comment #4)
> tested the fix in 43 build but still pods are getting affected by the taint.
> Here are details.

Do sync pods get affected by this taint?
Do sync pods start on tainted nodes?

Comment 6 Siva Reddy 2018-11-15 16:47:44 UTC
Do pods start on tainted nodes? 
   they are going into crashloopbackoff
What's in the logs for these containers?
   I have sent you the cluster details via private message
Why would a master api/controller restart be required?
   I was just following the steps in the guide to apply taints and tolerations. it may not be needed
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html

Comment 7 Siva Reddy 2018-11-16 10:58:59 UTC
The sync daemon sets are tolerating the taints put on the node.

Version:
oc v3.11.44
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server 
openshift v3.11.44
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. taint the node
   # oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
2. delete the sync pod ds and recreate it
   # oc get ds sync -o yaml > sync-ds.yaml
   # oc delete ds sync
   # oc create -f sync-ds.yaml
3. Note the sync pods

    The sync pods get created with out any issue in spite of taint being present on the node.

Comment 9 errata-xmlrpc 2019-01-10 09:04:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024


Note You need to log in before you can comment on or make changes to this bug.