Bug 1960446 - nmstate operator doesn't handle nodes with taints
Summary: nmstate operator doesn't handle nodes with taints
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.8.z
Assignee: Ben Nemec
QA Contact: Oleg Sher
URL:
Whiteboard:
: 1977577 (view as bug list)
Depends On:
Blocks: 1970127
TreeView+ depends on / blocked
 
Reported: 2021-05-13 21:26 UTC by Alan Chan
Modified: 2021-08-10 11:28 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Incorrect toleration setting on nmstate-handler pod. Consequence: Handler pods could not be deployed on infra nodes with NoSchedule taints, which made it impossible to configure networking on such nodes with the nmstate-operator. Fix: Handler pod toleration was changed to allow deployment on all nodes. Result: The nmstate-operator can now be used to configure networking on all nodes, regardless of taint.
Clone Of:
Environment:
Last Closed: 2021-08-10 11:27:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes-nmstate pull 192 0 None closed Bug 1960446: UPSTREAM: <carry>: Change handler toleration to "operator: exists" (#… 2021-07-20 16:25:55 UTC
Red Hat Product Errata RHSA-2021:2983 0 None None None 2021-08-10 11:28:24 UTC

Description Alan Chan 2021-05-13 21:26:01 UTC
Description of problem:

Default tolerations of the nmstate-handler daemonset doesn't deploy to nodes with taints:

$ oc get ds nmstate-handler -oyaml | yq e '.spec.template.spec.tolerations' -
- effect: NoSchedule
  key: node-role.kubernetes.io/master
  operator: Exists


Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.7.9
Server Version: 4.7.9
Kubernetes Version: v1.20.0+7d0a2b2

$ oc -n openshift-nmstate get csv | grep nmstate
kubernetes-nmstate-operator.v4.7.0   Kubernetes NMState Operator        4.7.0-202104250659.p0                                Succeeded


How reproducible:

1. Have nodes with taints, then deploy the nmstate operator and create a nmstate instance.

2. Then check out the pods where it gets deployed.

$ oc get nodes -o custom-columns=NODE:.metadata.name,TAINTS:.spec.taints
NODE                                         TAINTS
ip-10-0-135-5.us-east-2.compute.internal     <none>
ip-10-0-139-144.us-east-2.compute.internal   [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
ip-10-0-145-42.us-east-2.compute.internal    [map[effect:NoSchedule key:infra value:reserved] map[effect:NoExecute key:infra value:reserved]]
ip-10-0-152-162.us-east-2.compute.internal   [map[effect:NoSchedule key:node.ocs.openshift.io/storage value:true]]
ip-10-0-165-116.us-east-2.compute.internal   [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
ip-10-0-170-249.us-east-2.compute.internal   <none>
ip-10-0-171-65.us-east-2.compute.internal    [map[effect:NoSchedule key:infra value:reserved] map[effect:NoExecute key:infra value:reserved]]
ip-10-0-178-136.us-east-2.compute.internal   [map[effect:NoSchedule key:node.ocs.openshift.io/storage value:true]]
ip-10-0-196-0.us-east-2.compute.internal     [map[effect:NoSchedule key:infra value:reserved] map[effect:NoExecute key:infra value:reserved]]
ip-10-0-202-245.us-east-2.compute.internal   <none>
ip-10-0-207-208.us-east-2.compute.internal   [map[effect:NoSchedule key:node.ocs.openshift.io/storage value:true]]
ip-10-0-218-204.us-east-2.compute.internal   [map[effect:NoSchedule key:node-role.kubernetes.io/master]]

$ oc -n openshift-nmstate get pod -l name=nmstate-handler -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName --sort-by='{.spec.nodeName}'
NAME                    NODE
nmstate-handler-77kzd   ip-10-0-135-5.us-east-2.compute.internal
nmstate-handler-jb6cb   ip-10-0-139-144.us-east-2.compute.internal
nmstate-handler-nwpxt   ip-10-0-165-116.us-east-2.compute.internal
nmstate-handler-6cxl6   ip-10-0-170-249.us-east-2.compute.internal
nmstate-handler-tv7xv   ip-10-0-202-245.us-east-2.compute.internal
nmstate-handler-78nsx   ip-10-0-218-204.us-east-2.compute.internal

There are only 6 pods running, but there are 12 nodes total. You can see that nmstate-handler pods only run on the master or worker nodes with no taints. 


Expected results:

- Needs to better handle nodes with taints. Maybe need to add the tolerations api in nmstate kind resource. Something like this somewhere to tolerate all taints:

      tolerations:
      - operator: "Exists"

Comment 1 kevin 2021-05-30 05:37:09 UTC
hello everyone, We also face the same issue, NMState cannot handle the node with taint!

Actural results:

nmstate-handler-rwvqd
0/8 nodes are available: 3 node(s) had taint {node.ocs.openshift.io/storage: true}, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity.

Have any plan to solve this issue?

Comment 2 kevin 2021-05-30 05:38:59 UTC
our openshift version is 4.7.11, NMState Operator version is 4.7.0-202104250659.p0

Comment 3 Ben Nemec 2021-06-01 22:05:27 UTC
A fix merged upstream recently: https://github.com/nmstate/kubernetes-nmstate/pull/755

That will need to be pulled in downstream and backported to 4.7.

Comment 6 Oleg Sher 2021-06-29 16:29:20 UTC
currrent nmstate version installed from operatorHub is
kubernetes-nmstate-operator.v4.7.0   Kubernetes NMState Operator   4.7.0-202103010125.p0 

and bug was opened for  
$ oc -n openshift-nmstate get csv | grep nmstate
kubernetes-nmstate-operator.v4.7.0   Kubernetes NMState Operator   4.7.0-202104250659.p0 

so the fix cant be verified.

Comment 8 Ben Nemec 2021-07-06 16:20:21 UTC
*** Bug 1977577 has been marked as a duplicate of this bug. ***

Comment 9 Tomas Sedovic 2021-07-20 16:27:31 UTC
The pull request linked to this BZ has been merged in the 4.8 branch so it's fixed it in 4.9 and 4.8. We still need a backport to 4.7 which is also linked in this BZ.

Comment 15 errata-xmlrpc 2021-08-10 11:27:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.4 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2983


Note You need to log in before you can comment on or make changes to this bug.