2012920 – nncp in progressing state forever when cluster is having Windows node

Bug 2012920 - nncp in progressing state forever when cluster is having Windows node

Summary: nncp in progressing state forever when cluster is having Windows node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	2.6.6
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Quique Llorente
QA Contact:	Adi Zavalkovsky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-11 15:30 UTC by nijin ashok
Modified:	2024-12-20 21:22 UTC (History)
CC List:	5 users (show)
Fixed In Version:	kubernetes-nmstate-handler-container-v4.10.0-19
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-16 15:56:11 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	nmstate kubernetes-nmstate pull 856	0	None	Merged	limit nmstate-handler to linux nodes	2021-10-18 13:23:29 UTC
Red Hat Product Errata	RHSA-2022:0947	0	None	None	None	2022-03-16 15:56:20 UTC

Description nijin ashok 2021-10-11 15:30:34 UTC

Description of problem:

By default, the "nodeselector" for nmstate-handler ds is "beta.kubernetes.io/arch=amd64". This also matches Windows nodes added to the cluster. So it will try to start "nmstate-handler" also on Windows nodes and will be in "pending" status.

If I understand the code correctly, the NodesRunningNmstate is calculated by comparing "get nodes --selector=beta.kubernetes.io/arch=amd64" and the pod.Spec.NodeName from the "get pods --selector 'component=kubernetes-nmstate-handler'". This also counts Windows node since nmstate-handler is also scheduled on these nodes although the status is "pending".

So when any nncp is created, it will wait for the nnce to get created on the Windows node as well. Since nmstate-handler will be always in pending status in Windows node, nnce will never get created and hence the state of the nncp will be "ConfigurationProgressing" forever.

~~~
message: Policy is progressing 7/9 nodes finished
reason: ConfigurationProgressing
status: Unknown
type: Available
~~~

Here these two nodes are Windows nodes.

Version-Release number of selected component (if applicable):

v2.6.6

How reproducible:

100%

Steps to Reproduce:

Try to create nncp on an Openshift cluster that has Windows nodes.

Actual results:

nncp in progressing state forever when cluster is having Windows node

Expected results:

Probably, we also have to add "beta.kubernetes.io/os: linux" in nodeSelector for nmstate-handler ds since nmstate won't work with Windows nodes.

Additional info:

Comment 1 Dan Kenigsberg 2021-10-11 20:53:14 UTC

Thanks for filing this bug, Nijin. We do not support OpenShift Virtualization on Windows workers. virt-handler has kubernetes.io/os=linux as its nodeSelector. It makes sense to add it to the network DaemonSets: bridge-marker, kube-cni-linux-bridge-plugin, nmstate-handler, ovs-cni-amd64.

Comment 2 nijin ashok 2021-10-12 02:34:16 UTC

(In reply to Dan Kenigsberg from comment #1)
> Thanks for filing this bug, Nijin. We do not support OpenShift
> Virtualization on Windows workers.

Thank you, Dan. Yes, but I think we should be able to ignore the Windows worker nodes if added to the same cluster. 

> virt-handler has kubernetes.io/os=linux
> as its nodeSelector. It makes sense to add it to the network DaemonSets:
> bridge-marker, kube-cni-linux-bridge-plugin, nmstate-handler, ovs-cni-amd64.

I think that will help here.

Comment 3 Adi Zavalkovsky 2021-12-21 11:54:56 UTC

network Daemonsets still hold the old nodeSelector - "beta.kubernetes.io/arch=amd64"

Comment 4 Petr Horáček 2022-01-20 12:11:25 UTC

It seems that https://github.com/nmstate/kubernetes-nmstate/pull/856 did not fix the issue. It may be due to CNAO overwriting the placement configuration https://github.com/kubevirt/cluster-network-addons-operator/blob/8d0037553962ff72226a817036214b6017fcce20/data/nmstate/operand/operator.yaml#L28.

Comment 5 Petr Horáček 2022-01-20 13:23:51 UTC

Grooming: Meni raised that a better and more explicit approach would be to fail NNCP if its selector matches non-supported (Windows) nodes instead of silently ignoring them

Comment 6 Quique Llorente 2022-01-24 13:19:40 UTC

We need to put linux as the default placement configuration at CNAO https://github.com/kubevirt/cluster-network-addons-operator/blob/main/pkg/network/placement_configuration.go#L54-L56

Comment 7 Quique Llorente 2022-01-24 13:32:37 UTC

Looks like we did already at CNAO just for nmstate, https://github.com/kubevirt/cluster-network-addons-operator/pull/1124, are we sure we are testing it ?

Comment 8 Quique Llorente 2022-01-24 13:39:08 UTC

Checking an OCP 4.10 cluster looks like os: linux is there

nodeSelector:
  beta.kubernetes.io/arch: amd64
  kubernetes.io/os: linux

I think we can close this bz.

Comment 9 Adi Zavalkovsky 2022-01-31 08:05:13 UTC

Verified on OCP 4.10, k-nmstate-handler-4.10.0-41.

Same results as Quique - 
nodeSelector:
  beta.kubernetes.io/arch: amd64
  kubernetes.io/os: linux

Nodes hold kubernetes.io/os: linux label.

Comment 14 errata-xmlrpc 2022-03-16 15:56:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947

Note You need to log in before you can comment on or make changes to this bug.