1635462 – sync daemonset should tolerate taints

Bug 1635462 - sync daemonset should tolerate taints

Summary: sync daemonset should tolerate taints

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Vadim Rutkovsky
QA Contact:	Siva Reddy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1685952 1690200
TreeView+	depends on / blocked

Reported:	2018-10-03 01:44 UTC by Justin Pierce
Modified:	2019-03-19 01:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openshift-ansible-3.11.44-1.git.0.11d174e.el7.noarch.rpm
Doc Type:	Bug Fix
Doc Text:	Cause: sync daemonset didn't run on all nodes Consequence: upgrade failed, as some nodes didn't have annotation set Fix: sync daemonset tolerates all taints and runs on all nodes Result: upgrade succeeds
Clone Of:
Clones:	1685952 (view as bug list)
Environment:
Last Closed:	2019-01-10 09:04:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1635804	0	unspecified	CLOSED	openshift-sdn daemonsets should tolerate taints	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2019:0024	0	None	None	None	2019-01-10 09:04:07 UTC

Internal Links: 1635804

Description Justin Pierce 2018-10-03 01:44:44 UTC

Description of problem:
When a node is tainted, the sync daemon set will not run a pod on it. This leads to failures during the installation (e.g. 'Wait for sync DS to set annotations on master nodes'). 

Version-Release number of the following components:
v3.11.16

How reproducible:
100%

Steps to Reproduce:
1. Taint a node
2. Upgrade from 3.10 to 3.11
3. An error will occur during the installation

Expected results:
The sync daemonset is required for node configuration to be applied. It seems like it should tolerate most taints.

Additional info:
The taint which interfered with my upgrade was 'NodeWithImpairedVolumes=true:NoSchedule' - this can be applied by the storage system without notice whenever ebs issues are encountered.

Comment 1 Vadim Rutkovsky 2018-10-03 08:54:40 UTC

Created PR to master: https://github.com/openshift/openshift-ansible/pull/10310

Comment 2 Vadim Rutkovsky 2018-11-09 06:41:12 UTC

3.11 cherrypick https://github.com/openshift/openshift-ansible/pull/10646

Comment 3 Vadim Rutkovsky 2018-11-09 06:42:14 UTC

Fix is available in openshift-ansible-3.11.42-1

Comment 4 Siva Reddy 2018-11-13 19:43:09 UTC

tested the fix in 43 build but still pods are getting affected by the taint. Here are details.

Version:
openshift v3.11.43
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. $node is the compute node where the pods are running
2. Note the pods running for sync and ovs,sdn pods
   # oc get pods -n openshift-node -o wide | grep $node ;  oc get pods -n openshift-sdn -o wide | grep $node ;
    sync-vkxwk   1/1       Running   
    ovs-gjkts   1/1       Running   
    sdn-c7c7f   1/1       Running   
3. Note that there are 3 pods running for each of sync, ovs, sdn
4. taint the node
    #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
      Taints:             NodeWithImpairedVolumes=true:NoExecute
5. restart the api and controllers
   # master-restart api
   # master-restart controllers                                                                                                                        
6. Note the pods for sync and ovs,sdn pods
   sync-vkxwk   0/1       CrashLoopBackOff
   sdn-c7c7f   0/1       CrashLoopBackOff 


     The ovs pod doesn't even show up and the sync and sdn pod goes into CrashLoopBackOff constantly.

Comment 5 Vadim Rutkovsky 2018-11-14 07:58:05 UTC

(In reply to Siva Reddy from comment #4)
> tested the fix in 43 build but still pods are getting affected by the taint.
> Here are details.

Do sync pods get affected by this taint?
Do sync pods start on tainted nodes?

Comment 6 Siva Reddy 2018-11-15 16:47:44 UTC

Do pods start on tainted nodes? 
   they are going into crashloopbackoff
What's in the logs for these containers?
   I have sent you the cluster details via private message
Why would a master api/controller restart be required?
   I was just following the steps in the guide to apply taints and tolerations. it may not be needed
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html

Comment 7 Siva Reddy 2018-11-16 10:58:59 UTC

The sync daemon sets are tolerating the taints put on the node.

Version:
oc v3.11.44
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server 
openshift v3.11.44
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. taint the node
   # oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
2. delete the sync pod ds and recreate it
   # oc get ds sync -o yaml > sync-ds.yaml
   # oc delete ds sync
   # oc create -f sync-ds.yaml
3. Note the sync pods

    The sync pods get created with out any issue in spite of taint being present on the node.

Comment 9 errata-xmlrpc 2019-01-10 09:04:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024

Note You need to log in before you can comment on or make changes to this bug.