Bug 2012920

Summary: nncp in progressing state forever when cluster is having Windows node
Product: Container Native Virtualization (CNV) Reporter: nijin ashok <nashok>
Component: NetworkingAssignee: Quique Llorente <ellorent>
Status: CLOSED ERRATA QA Contact: Adi Zavalkovsky <azavalko>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.6.6CC: azavalko, cnv-qe-bugs, danken, ellorent, gveitmic
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kubernetes-nmstate-handler-container-v4.10.0-19 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-16 15:56:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2021-10-11 15:30:34 UTC
Description of problem:

By default, the "nodeselector" for nmstate-handler ds is "beta.kubernetes.io/arch=amd64". This also matches Windows nodes added to the cluster. So it will try to start "nmstate-handler" also on Windows nodes and will be in "pending" status.

If I understand the code correctly, the NodesRunningNmstate is calculated by comparing "get nodes --selector=beta.kubernetes.io/arch=amd64" and the pod.Spec.NodeName from the "get pods --selector 'component=kubernetes-nmstate-handler'". This also counts Windows node since nmstate-handler is also scheduled on these nodes although the status is "pending".

So when any nncp is created, it will wait for the nnce to get created on the Windows node as well. Since nmstate-handler will be always in pending status in Windows node, nnce will never get created and hence the state of the nncp will be "ConfigurationProgressing" forever.

~~~
    message: Policy is progressing 7/9 nodes finished
    reason: ConfigurationProgressing
    status: Unknown
    type: Available
~~~

Here these two nodes are Windows nodes.


Version-Release number of selected component (if applicable):

v2.6.6

How reproducible:

100%

Steps to Reproduce:

Try to create nncp on an Openshift cluster that has Windows nodes.

Actual results:

nncp in progressing state forever when cluster is having Windows node

Expected results:

Probably, we also have to add "beta.kubernetes.io/os: linux" in nodeSelector for nmstate-handler ds since nmstate won't work with Windows nodes.

Additional info:

Comment 1 Dan Kenigsberg 2021-10-11 20:53:14 UTC
Thanks for filing this bug, Nijin. We do not support OpenShift Virtualization on Windows workers. virt-handler has kubernetes.io/os=linux as its nodeSelector. It makes sense to add it to the network DaemonSets: bridge-marker, kube-cni-linux-bridge-plugin, nmstate-handler, ovs-cni-amd64.

Comment 2 nijin ashok 2021-10-12 02:34:16 UTC
(In reply to Dan Kenigsberg from comment #1)
> Thanks for filing this bug, Nijin. We do not support OpenShift
> Virtualization on Windows workers.

Thank you, Dan. Yes, but I think we should be able to ignore the Windows worker nodes if added to the same cluster. 

> virt-handler has kubernetes.io/os=linux
> as its nodeSelector. It makes sense to add it to the network DaemonSets:
> bridge-marker, kube-cni-linux-bridge-plugin, nmstate-handler, ovs-cni-amd64.

I think that will help here.

Comment 3 Adi Zavalkovsky 2021-12-21 11:54:56 UTC
network Daemonsets still hold the old nodeSelector - "beta.kubernetes.io/arch=amd64"

Comment 4 Petr Horáček 2022-01-20 12:11:25 UTC
It seems that https://github.com/nmstate/kubernetes-nmstate/pull/856 did not fix the issue. It may be due to CNAO overwriting the placement configuration https://github.com/kubevirt/cluster-network-addons-operator/blob/8d0037553962ff72226a817036214b6017fcce20/data/nmstate/operand/operator.yaml#L28.

Comment 5 Petr Horáček 2022-01-20 13:23:51 UTC
Grooming: Meni raised that a better and more explicit approach would be to fail NNCP if its selector matches non-supported (Windows) nodes instead of silently ignoring them

Comment 6 Quique Llorente 2022-01-24 13:19:40 UTC
We need to put linux as the default placement configuration at CNAO https://github.com/kubevirt/cluster-network-addons-operator/blob/main/pkg/network/placement_configuration.go#L54-L56

Comment 7 Quique Llorente 2022-01-24 13:32:37 UTC
Looks like we did already at CNAO just for nmstate, https://github.com/kubevirt/cluster-network-addons-operator/pull/1124, are we sure we are testing it ?

Comment 8 Quique Llorente 2022-01-24 13:39:08 UTC
Checking an OCP 4.10 cluster looks like os: linux is there

nodeSelector:
  beta.kubernetes.io/arch: amd64
  kubernetes.io/os: linux

I think we can close this bz.

Comment 9 Adi Zavalkovsky 2022-01-31 08:05:13 UTC
Verified on OCP 4.10, k-nmstate-handler-4.10.0-41.

Same results as Quique - 
nodeSelector:
  beta.kubernetes.io/arch: amd64
  kubernetes.io/os: linux

Nodes hold kubernetes.io/os: linux label.

Comment 14 errata-xmlrpc 2022-03-16 15:56:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947