1635804 – openshift-sdn daemonsets should tolerate taints

Bug 1635804 - openshift-sdn daemonsets should tolerate taints

Summary: openshift-sdn daemonsets should tolerate taints

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Vadim Rutkovsky
QA Contact:	Siva Reddy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-03 17:34 UTC by Justin Pierce
Modified:	2018-12-12 14:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: SDN daemonset didn't run on all nodes Consequence: upgrade failed, as some nodes didn't have internal network setup Fix: SDN daemonset tolerates all taints and runs on all nodes Result: upgrade succeeds
Clone Of:
Environment:
Last Closed:	2018-12-12 14:15:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sdn pod log (4.91 KB, text/plain) 2018-11-16 16:37 UTC, Siva Reddy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1635462	0	unspecified	CLOSED	sync daemonset should tolerate taints	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2018:3743	0	None	None	None	2018-12-12 14:16:01 UTC

Internal Links: 1635462

Description Justin Pierce 2018-10-03 17:34:37 UTC

Description of problem:
The openshift-sdn daemonsets perform critical network setup for nodes. The pods should therefore tolerate most taints a customer/component might apply to a node. For example, the storage subsystem will automatically apply 'NodeWithImpairedVolumes=true:NoSchedule'. At present, this will inhibit sdn pods from running on the node until the taint is manually removed. Likewise, if a customer taints a node, they should not lose sdn functionality.

Version-Release number of the following components:
v3.11.16

Steps to Reproduce:
1. Apply a taint a node
2. Observe that sdn daemonsets will not run on the node.

Comment 1 Vadim Rutkovsky 2018-11-09 06:41:00 UTC

PR https://github.com/openshift/openshift-ansible/pull/10646

Comment 2 Vadim Rutkovsky 2018-11-09 06:42:09 UTC

Fix is available in openshift-ansible-3.11.42-1

Comment 3 Siva Reddy 2018-11-13 19:41:54 UTC

tested the fix in 43 build but still pods are getting affected by the taint. Here are details.

Version:
openshift v3.11.43
kubernetes v1.11.0+d4cacc0

Steps to reproduce:
1. $node is the compute node where the pods are running
2. Note the pods running for sync and ovs,sdn pods
   # oc get pods -n openshift-node -o wide | grep $node ;  oc get pods -n openshift-sdn -o wide | grep $node ;
    sync-vkxwk   1/1       Running   
    ovs-gjkts   1/1       Running   
    sdn-c7c7f   1/1       Running   
3. Note that there are 3 pods running for each of sync, ovs, sdn
4. taint the node
    #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
      Taints:             NodeWithImpairedVolumes=true:NoExecute
5. restart the api and controllers
   # master-restart api
   # master-restart controllers                                                                                                                        
6. Note the pods for sync and ovs,sdn pods
   sync-vkxwk   0/1       CrashLoopBackOff
   sdn-c7c7f   0/1       CrashLoopBackOff 


     The ovs pod doesn't even show up and the sync and sdn pod goes into CrashLoopBackOff constantly.

Comment 4 Vadim Rutkovsky 2018-11-14 07:59:44 UTC

(In reply to Siva Reddy from comment #3)
> tested the fix in 43 build but still pods are getting affected by the taint.


> 6. Note the pods for sync and ovs,sdn pods
>    sync-vkxwk   0/1       CrashLoopBackOff
>    sdn-c7c7f   0/1       CrashLoopBackOff

Do pods start on tainted nodes? 
What's in the logs for these containers?
Why would a master api/controller restart be required?

Comment 5 Siva Reddy 2018-11-15 16:47:12 UTC

Do pods start on tainted nodes? 
   they are going into crashloopbackoff
What's in the logs for these containers?
   I have sent you the cluster details via private message
Why would a master api/controller restart be required?
   I was just following the steps in the guide to apply taints and tolerations. it may not be needed
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/taints_tolerations.html

Comment 7 Casey Callendrello 2018-11-16 11:33:39 UTC

Can you attach the logs of the crashlooping sdn pods?

Taints don't affect pods, they just affect whether or not the pod is scheduled. If the pod is crashlooping it's a different (but perhaps related) bug.

Comment 8 Siva Reddy 2018-11-16 15:27:24 UTC

Yes I agree. The pod is crashlooping because the taint is terminating the ovs pod and hence sdn pod is crashlooping is my guess.
  The environment is no longer available, but will spin a new one and attach the logs.

Comment 9 Siva Reddy 2018-11-16 16:37:33 UTC

Created attachment 1506492 [details]
sdn pod log

Comment 10 Casey Callendrello 2018-11-16 16:40:43 UTC

Yes, it looks like the OVS daemonset missed the taints as well. Vadim, can you take care of that?

Comment 11 Vadim Rutkovsky 2018-11-16 22:35:58 UTC

Ah, that would explain it

Comment 12 Vadim Rutkovsky 2018-11-20 14:58:32 UTC

3.11 PR - https://github.com/openshift/openshift-ansible/pull/10731

Comment 16 Vadim Rutkovsky 2018-11-29 19:28:04 UTC

Fix is available in openshift-ansible-3.11.45-1

Comment 17 Siva Reddy 2018-12-03 15:46:08 UTC

    Verified that the sdn daemon-sets are tolerating the taints and not crashing after tainting the nodes.

Version:

openshift v3.11.51
kubernetes v1.11.0+d4cacc0
oc v3.11.50
openshift-ansible-3.11.51-1.git.0.51c90a3.el7.noarch.rpm

Verification steps:
1. $node is the compute node where the pods are running
2. Note the pods running for sync and ovs,sdn pods
   # oc get pods -n openshift-node -o wide | grep $node ;  oc get pods -n openshift-sdn -o wide | grep $node ;
    sync-vkxwk   1/1       Running   
    ovs-gjkts   1/1       Running   
    sdn-c7c7f   1/1       Running   
3. Note that there are 3 pods running for each of sync, ovs, sdn
4. taint the node
    #oc adm taint node $node NodeWithImpairedVolumes=true:NoExecute
    #oc describe node $node | grep -i taint
      Taints:             NodeWithImpairedVolumes=true:NoExecute
6. Note the pods for sync and ovs,sdn pods
      All the pods are running without any crashing

Comment 19 errata-xmlrpc 2018-12-12 14:15:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3743

Note You need to log in before you can comment on or make changes to this bug.