Bug 1635462

Summary: sync daemonset should tolerate taints
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: InstallerAssignee: Vadim Rutkovsky <vrutkovs>
Status: CLOSED ERRATA QA Contact: Siva Reddy <schituku>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, jialiu, jokerman, mmccomas, schituku, smossber, vrutkovs, wmeng
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-ansible-3.11.44-1.git.0.11d174e.el7.noarch.rpm Doc Type: Bug Fix
Doc Text:
Cause: sync daemonset didn't run on all nodes Consequence: upgrade failed, as some nodes didn't have annotation set Fix: sync daemonset tolerates all taints and runs on all nodes Result: upgrade succeeds
Story Points: ---
Clone Of:
: 1685952 (view as bug list) Environment:
Last Closed: 2019-01-10 09:04:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1685952, 1690200    

Description Justin Pierce 2018-10-03 01:44:44 UTC
Description of problem:
When a node is tainted, the sync daemon set will not run a pod on it. This leads to failures during the installation (e.g. 'Wait for sync DS to set annotations on master nodes'). 

Version-Release number of the following components:
v3.11.16

How reproducible:
100%

Steps to Reproduce:
1. Taint a node
2. Upgrade from 3.10 to 3.11
3. An error will occur during the installation

Expected results:
The sync daemonset is required for node configuration to be applied. It seems like it should tolerate most taints.

Additional info:
The taint which interfered with my upgrade was 'NodeWithImpairedVolumes=true:NoSchedule' - this can be applied by the storage system without notice whenever ebs issues are encountered.

Comment 1 Vadim Rutkovsky 2018-10-03 08:54:40 UTC
Created PR to master: https://github.com/openshift/openshift-ansible/pull/10310

Comment 2 Vadim Rutkovsky 2018-11-09 06:41:12 UTC
3.11 cherrypick https://github.com/openshift/openshift-ansible/pull/10646

Comment 3 Vadim Rutkovsky 2018-11-09 06:42:14 UTC
Fix is available in openshift-ansible-3.11.42-1

Comment 4 Siva Reddy 2018-11-13 19:43:09 UTC
tested the fix in 43 build but still pods are getting affected by the taint. Here are details.

Version:
openshift v3.11.43
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. $node is the compute node where the pods are running
2. Note the pods running for sync and ovs,sdn pods
   # oc get pods -n openshift-node -o wide | grep $node ;  oc get pods -n openshift-sdn -o wide | grep $node ;
    sync-vkxwk   1/1       Running   
    ovs-gjkts   1/1       Running   
    sdn-c7c7f   1/1       Running   
3. Note that there are 3 pods running for each of sync, ovs, sdn
4. taint the node
    #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
      Taints:             NodeWithImpairedVolumes=true:NoExecute
5. restart the api and controllers
   # master-restart api
   # master-restart controllers                                                                                                                        
6. Note the pods for sync and ovs,sdn pods
   sync-vkxwk   0/1       CrashLoopBackOff
   sdn-c7c7f   0/1       CrashLoopBackOff 


     The ovs pod doesn't even show up and the sync and sdn pod goes into CrashLoopBackOff constantly.

Comment 5 Vadim Rutkovsky 2018-11-14 07:58:05 UTC
(In reply to Siva Reddy from comment #4)
> tested the fix in 43 build but still pods are getting affected by the taint.
> Here are details.

Do sync pods get affected by this taint?
Do sync pods start on tainted nodes?

Comment 6 Siva Reddy 2018-11-15 16:47:44 UTC
Do pods start on tainted nodes? 
   they are going into crashloopbackoff
What's in the logs for these containers?
   I have sent you the cluster details via private message
Why would a master api/controller restart be required?
   I was just following the steps in the guide to apply taints and tolerations. it may not be needed
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html

Comment 7 Siva Reddy 2018-11-16 10:58:59 UTC
The sync daemon sets are tolerating the taints put on the node.

Version:
oc v3.11.44
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server 
openshift v3.11.44
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. taint the node
   # oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
2. delete the sync pod ds and recreate it
   # oc get ds sync -o yaml > sync-ds.yaml
   # oc delete ds sync
   # oc create -f sync-ds.yaml
3. Note the sync pods

    The sync pods get created with out any issue in spite of taint being present on the node.

Comment 9 errata-xmlrpc 2019-01-10 09:04:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024