Bug 1960446

Summary:	nmstate operator doesn't handle nodes with taints
Product:	OpenShift Container Platform	Reporter:	Alan Chan <alchan>
Component:	Networking	Assignee:	Ben Nemec <bnemec>
Networking sub component:	kubernetes-nmstate-operator	QA Contact:	Oleg Sher <osher>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	aos-bugs, eparis, imelofer, jokerman, sdasu, tsedovic, welin
Version:	4.7	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.8.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Incorrect toleration setting on nmstate-handler pod. Consequence: Handler pods could not be deployed on infra nodes with NoSchedule taints, which made it impossible to configure networking on such nodes with the nmstate-operator. Fix: Handler pod toleration was changed to allow deployment on all nodes. Result: The nmstate-operator can now be used to configure networking on all nodes, regardless of taint.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-10 11:27:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1970127

Description Alan Chan 2021-05-13 21:26:01 UTC

Description of problem:

Default tolerations of the nmstate-handler daemonset doesn't deploy to nodes with taints:

$ oc get ds nmstate-handler -oyaml | yq e '.spec.template.spec.tolerations' -
- effect: NoSchedule
  key: node-role.kubernetes.io/master
  operator: Exists


Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.7.9
Server Version: 4.7.9
Kubernetes Version: v1.20.0+7d0a2b2

$ oc -n openshift-nmstate get csv | grep nmstate
kubernetes-nmstate-operator.v4.7.0   Kubernetes NMState Operator        4.7.0-202104250659.p0                                Succeeded


How reproducible:

1. Have nodes with taints, then deploy the nmstate operator and create a nmstate instance.

2. Then check out the pods where it gets deployed.

$ oc get nodes -o custom-columns=NODE:.metadata.name,TAINTS:.spec.taints
NODE                                         TAINTS
ip-10-0-135-5.us-east-2.compute.internal     <none>
ip-10-0-139-144.us-east-2.compute.internal   [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
ip-10-0-145-42.us-east-2.compute.internal    [map[effect:NoSchedule key:infra value:reserved] map[effect:NoExecute key:infra value:reserved]]
ip-10-0-152-162.us-east-2.compute.internal   [map[effect:NoSchedule key:node.ocs.openshift.io/storage value:true]]
ip-10-0-165-116.us-east-2.compute.internal   [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
ip-10-0-170-249.us-east-2.compute.internal   <none>
ip-10-0-171-65.us-east-2.compute.internal    [map[effect:NoSchedule key:infra value:reserved] map[effect:NoExecute key:infra value:reserved]]
ip-10-0-178-136.us-east-2.compute.internal   [map[effect:NoSchedule key:node.ocs.openshift.io/storage value:true]]
ip-10-0-196-0.us-east-2.compute.internal     [map[effect:NoSchedule key:infra value:reserved] map[effect:NoExecute key:infra value:reserved]]
ip-10-0-202-245.us-east-2.compute.internal   <none>
ip-10-0-207-208.us-east-2.compute.internal   [map[effect:NoSchedule key:node.ocs.openshift.io/storage value:true]]
ip-10-0-218-204.us-east-2.compute.internal   [map[effect:NoSchedule key:node-role.kubernetes.io/master]]

$ oc -n openshift-nmstate get pod -l name=nmstate-handler -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName --sort-by='{.spec.nodeName}'
NAME                    NODE
nmstate-handler-77kzd   ip-10-0-135-5.us-east-2.compute.internal
nmstate-handler-jb6cb   ip-10-0-139-144.us-east-2.compute.internal
nmstate-handler-nwpxt   ip-10-0-165-116.us-east-2.compute.internal
nmstate-handler-6cxl6   ip-10-0-170-249.us-east-2.compute.internal
nmstate-handler-tv7xv   ip-10-0-202-245.us-east-2.compute.internal
nmstate-handler-78nsx   ip-10-0-218-204.us-east-2.compute.internal

There are only 6 pods running, but there are 12 nodes total. You can see that nmstate-handler pods only run on the master or worker nodes with no taints. 


Expected results:

- Needs to better handle nodes with taints. Maybe need to add the tolerations api in nmstate kind resource. Something like this somewhere to tolerate all taints:

      tolerations:
      - operator: "Exists"

Comment 1 kevin 2021-05-30 05:37:09 UTC

hello everyone, We also face the same issue, NMState cannot handle the node with taint!

Actural results:

nmstate-handler-rwvqd
0/8 nodes are available: 3 node(s) had taint {node.ocs.openshift.io/storage: true}, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity.

Have any plan to solve this issue?

Comment 2 kevin 2021-05-30 05:38:59 UTC

our openshift version is 4.7.11, NMState Operator version is 4.7.0-202104250659.p0

Comment 3 Ben Nemec 2021-06-01 22:05:27 UTC

A fix merged upstream recently: https://github.com/nmstate/kubernetes-nmstate/pull/755

That will need to be pulled in downstream and backported to 4.7.

Comment 6 Oleg Sher 2021-06-29 16:29:20 UTC

currrent nmstate version installed from operatorHub is
kubernetes-nmstate-operator.v4.7.0   Kubernetes NMState Operator   4.7.0-202103010125.p0 

and bug was opened for  
$ oc -n openshift-nmstate get csv | grep nmstate
kubernetes-nmstate-operator.v4.7.0   Kubernetes NMState Operator   4.7.0-202104250659.p0 

so the fix cant be verified.

Comment 8 Ben Nemec 2021-07-06 16:20:21 UTC

*** Bug 1977577 has been marked as a duplicate of this bug. ***

Comment 9 Tomas Sedovic 2021-07-20 16:27:31 UTC

The pull request linked to this BZ has been merged in the 4.8 branch so it's fixed it in 4.9 and 4.8. We still need a backport to 4.7 which is also linked in this BZ.

Comment 15 errata-xmlrpc 2021-08-10 11:27:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.4 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2983