Bug 1635462

Summary:	sync daemonset should tolerate taints
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Installer	Assignee:	Vadim Rutkovsky <vrutkovs>
Status:	CLOSED ERRATA	QA Contact:	Siva Reddy <schituku>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	aos-bugs, jialiu, jokerman, mmccomas, schituku, smossber, vrutkovs, wmeng
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift-ansible-3.11.44-1.git.0.11d174e.el7.noarch.rpm	Doc Type:	Bug Fix
Doc Text:	Cause: sync daemonset didn't run on all nodes Consequence: upgrade failed, as some nodes didn't have annotation set Fix: sync daemonset tolerates all taints and runs on all nodes Result: upgrade succeeds	Story Points:	---
Clone Of:
Clones:	1685952 (view as bug list)		Environment:
Last Closed:	2019-01-10 09:04:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1685952, 1690200

Description Justin Pierce 2018-10-03 01:44:44 UTC

Description of problem:
When a node is tainted, the sync daemon set will not run a pod on it. This leads to failures during the installation (e.g. 'Wait for sync DS to set annotations on master nodes'). 

Version-Release number of the following components:
v3.11.16

How reproducible:
100%

Steps to Reproduce:
1. Taint a node
2. Upgrade from 3.10 to 3.11
3. An error will occur during the installation

Expected results:
The sync daemonset is required for node configuration to be applied. It seems like it should tolerate most taints.

Additional info:
The taint which interfered with my upgrade was 'NodeWithImpairedVolumes=true:NoSchedule' - this can be applied by the storage system without notice whenever ebs issues are encountered.

Comment 1 Vadim Rutkovsky 2018-10-03 08:54:40 UTC

Created PR to master: https://github.com/openshift/openshift-ansible/pull/10310

Comment 2 Vadim Rutkovsky 2018-11-09 06:41:12 UTC

3.11 cherrypick https://github.com/openshift/openshift-ansible/pull/10646

Comment 3 Vadim Rutkovsky 2018-11-09 06:42:14 UTC

Fix is available in openshift-ansible-3.11.42-1

Comment 4 Siva Reddy 2018-11-13 19:43:09 UTC

tested the fix in 43 build but still pods are getting affected by the taint. Here are details.

Version:
openshift v3.11.43
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. $node is the compute node where the pods are running
2. Note the pods running for sync and ovs,sdn pods
   # oc get pods -n openshift-node -o wide | grep $node ;  oc get pods -n openshift-sdn -o wide | grep $node ;
    sync-vkxwk   1/1       Running   
    ovs-gjkts   1/1       Running   
    sdn-c7c7f   1/1       Running   
3. Note that there are 3 pods running for each of sync, ovs, sdn
4. taint the node
    #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
      Taints:             NodeWithImpairedVolumes=true:NoExecute
5. restart the api and controllers
   # master-restart api
   # master-restart controllers                                                                                                                        
6. Note the pods for sync and ovs,sdn pods
   sync-vkxwk   0/1       CrashLoopBackOff
   sdn-c7c7f   0/1       CrashLoopBackOff 


     The ovs pod doesn't even show up and the sync and sdn pod goes into CrashLoopBackOff constantly.

Comment 5 Vadim Rutkovsky 2018-11-14 07:58:05 UTC

(In reply to Siva Reddy from comment #4)
> tested the fix in 43 build but still pods are getting affected by the taint.
> Here are details.

Do sync pods get affected by this taint?
Do sync pods start on tainted nodes?

Comment 6 Siva Reddy 2018-11-15 16:47:44 UTC

Do pods start on tainted nodes? 
   they are going into crashloopbackoff
What's in the logs for these containers?
   I have sent you the cluster details via private message
Why would a master api/controller restart be required?
   I was just following the steps in the guide to apply taints and tolerations. it may not be needed
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html

Comment 7 Siva Reddy 2018-11-16 10:58:59 UTC

The sync daemon sets are tolerating the taints put on the node.

Version:
oc v3.11.44
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server 
openshift v3.11.44
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. taint the node
   # oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
2. delete the sync pod ds and recreate it
   # oc get ds sync -o yaml > sync-ds.yaml
   # oc delete ds sync
   # oc create -f sync-ds.yaml
3. Note the sync pods

    The sync pods get created with out any issue in spite of taint being present on the node.

Comment 9 errata-xmlrpc 2019-01-10 09:04:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024