2055950 – cnv installation should set empty node selector for openshift-cnv namespace

Bug 2055950 - cnv installation should set empty node selector for openshift-cnv namespace

Summary: cnv installation should set empty node selector for openshift-cnv namespace

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	4.9.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.1
Assignee:	Simone Tiraboschi
QA Contact:	Debarati Basu-Nag
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-18 02:29 UTC by Debarati Basu-Nag
Modified:	2025-04-04 14:18 UTC (History)
CC List:	5 users (show)
Fixed In Version:	hco-bundle-registry-container-v4.10.1-6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-18 20:27:03 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt hyperconverged-cluster-operator pull 1799	None	Merged	Opt-out from cluster-wide default node selector	2022-03-14 07:24:08 UTC
Github	kubevirt hyperconverged-cluster-operator pull 1816	None	Merged	[release-1.6] Opt-out from cluster-wide default node selector (#1799)	2022-03-14 07:24:23 UTC
Red Hat Product Errata	RHSA-2022:4668	None	None	None	2022-05-18 20:27:23 UTC

Description Debarati Basu-Nag 2022-02-18 02:29:47 UTC

Description of problem: If cluster admin sets defaultNodeSelector on the cluster, subsequent cnv upgrade could get stuck since multiple daemonset/deployments currently has specific nodeSelectors, that could be in conflict with defaultNodeSelector selection. 


Version-Release number of selected component (if applicable):
tested this against 4.9.1, but this problem likely exists in other versions as well

How reproducible:
100%

Steps to Reproduce:
1. Set defaultNodeSelector: node-role.kubernetes.io/worker= on scheduler.spec
2. Start a cnv upgrade
3. csv.status.conditions shows that the upgrade is stuck, multiple pods stays in pending state 

Actual results:
following pods are in pending state
==============================
[cnv-qe-jenkins@iuo-dbn1-491-nc8v7-executor ~]$ kubectl get pods -n openshift-cnv -o wide | grep -v Running
NAME                                                   READY   STATUS    RESTARTS      AGE   IP              NODE                                NOMINATED NODE   READINESS GATES
bridge-marker-5x25c                                    0/1     Pending   0             35m   <none>          <none>                              <none>           <none>
kube-cni-linux-bridge-plugin-4bjl2                     0/1     Pending   0             34m   <none>          <none>                              <none>           <none>
kube-cni-linux-bridge-plugin-kzd9p                     0/1     Pending   0             35m   <none>          <none>                              <none>           <none>
kubemacpool-mac-controller-manager-79756b9774-sgwxk    0/1     Pending   0             34m   <none>          <none>                              <none>           <none>
nmstate-cert-manager-778d998dd-7sg8r                   0/1     Pending   0             34m   <none>          <none>                              <none>           <none>
nmstate-handler-q2d5c                                  0/1     Pending   0             34m   <none>          <none>                              <none>           <none>
nmstate-webhook-64c4d67888-ck56x                       0/1     Pending   0             34m   <none>          <none>                              <none>           <none>
nmstate-webhook-64c4d67888-vbpzs                       0/1     Pending   0             34m   <none>          <none>                              <none>           <none>
[cnv-qe-jenkins@iuo-dbn1-491-nc8v7-executor ~]$ 
==============================

hco.status.conditions:
================
{
    "lastTransitionTime": "2022-02-18T00:22:03Z",
    "message": "NetworkAddonsConfig is progressing: DaemonSet \"openshift-cnv/kube-cni-linux-bridge-plugin\" is not available (awaiting 2 nodes)\nDaemonSet \"openshift-cnv/bridge-marker\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-cnv/nmstate-handler\" is not available (awaiting 1 nodes)\nDeployment \"openshift-cnv/kubemacpool-mac-controller-manager\" is not available (awaiting 1 nodes)\nDeployment \"openshift-cnv/nmstate-webhook\" is not available (awaiting 2 nodes)\nDeployment \"openshift-cnv/nmstate-cert-manager\" is not available (awaiting 1 nodes)",
    "observedGeneration": 2,
    "reason": "NetworkAddonsConfigProgressing",
    "status": "False",
    "type": "Upgradeable"
========================
Expected results:


Additional info:
Workaround is: On annotating openshift-cnv project with empty node selector and deleting the pending pods, upgrade successfully completes.

======
kubectl annotate namespace openshift-cnv openshift.io/node-selector=
namespace/openshift-cnv annotated
[cnv-qe-jenkins@iuo-dbn1-491-nc8v7-executor ~]$ kubectl get hco kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.versions"
[
  {
    "name": "operator",
    "version": "v4.9.2"
  }
]

Logging this based on Dan's suggestion post customer case.

Comment 2 Debarati Basu-Nag 2022-04-27 22:10:56 UTC

On a cluster when scheduler.spec.defaultNodeSelector is set to node-role.kubernetes.io/worker=, and upgrade is initiated, after a while ip was created as expected:
================
[cnv-qe-jenkins@c01-dbn-prod2-n8kx2-executor ~]$ kubectl get ip -A
NAMESPACE                 NAME            CSV                                          APPROVAL    APPROVED
openshift-cnv             install-cwzs4   kubevirt-hyperconverged-operator.v4.10.1     Manual      true
openshift-cnv             install-jvntf   kubevirt-hyperconverged-operator.v4.10.0     Manual      true
openshift-local-storage   install-jmgpp   local-storage-operator.4.10.0-202204090935   Automatic   true
openshift-storage         install-dkm2q   mcg-operator.v4.10.1                         Automatic   true
[cnv-qe-jenkins@c01-dbn-prod2-n8kx2-executor ~]$ 
================

On approving it associated csv was seen in pending state
================
[cnv-qe-jenkins@c01-dbn-prod2-n8kx2-executor ~]$ kubectl get csv -n openshift-cnv kubevirt-hyperconverged-operator.v4.10.1 -o json | jq ".status.conditions" 
[
  {
    "lastTransitionTime": "2022-04-27T21:35:10Z",
    "lastUpdateTime": "2022-04-27T21:35:10Z",
    "message": "requirements not yet checked",
    "phase": "Pending",
    "reason": "RequirementsUnknown"
  },
  {
    "lastTransitionTime": "2022-04-27T21:35:10Z",
    "lastUpdateTime": "2022-04-27T21:35:10Z",
    "message": "one or more requirements couldn't be found",
    "phase": "Pending",
    "reason": "RequirementsNotMet"
  }
]
=================
Cluster never upgraded.

Comment 3 Simone Tiraboschi 2022-04-28 08:15:30 UTC

Debarati, basically you reproduced the bug that prevented you from testing the new code.

We introduced a fix only in the code of HCO 4.10.1, and the effect of the bug was exactly to prevent the upgrade to the next version.
So, due to the effect of this bug that for sure is present in 4.10.0 code, HCO operator cannot reach 4.10.1 and so it's not able to consume the fresh code with the fix.

The only option is to manually execute the workaround (manually annotating openshift-cnv project with empty node selector and deleting the pending pods) to let the upgrade complete.
We can only fix it for the future but on past versions the bug will still happen.

Comment 4 Debarati Basu-Nag 2022-05-02 20:12:20 UTC

Validated by upgrading from 4.10.1->4.11.0.
=====================
[cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ kubectl get csv -A
I0502 20:10:22.417428  119255 request.go:665] Waited for 1.116228093s due to client-side throttling, not priority and fairness, request: GET:https://api.c01-dbn-prod1.cnv-qe.rhcloud.com:6443/apis/ssp.kubevirt.io/v1beta1?timeout=32s
NAMESPACE                              NAME                                         DISPLAY                       VERSION               REPLACES                                   PHASE
openshift-cnv                          kubevirt-hyperconverged-operator.v4.11.0     OpenShift Virtualization      4.11.0                kubevirt-hyperconverged-operator.v4.10.1   Succeeded
openshift-local-storage                local-storage-operator.4.10.0-202204090935   Local Storage                 4.10.0-202204090935                                              Succeeded
openshift-operator-lifecycle-manager   packageserver                                Package Server                0.19.0                                                           Succeeded
openshift-storage                      mcg-operator.v4.10.1                         NooBaa Operator               4.10.1                mcg-operator.v4.10.0                       Succeeded
openshift-storage                      ocs-operator.v4.10.1                         OpenShift Container Storage   4.10.1                ocs-operator.v4.10.0                       Succeeded
openshift-storage                      odf-operator.v4.10.1                         OpenShift Data Foundation     4.10.1                odf-operator.v4.10.0                       Succeeded
[cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ kubectl get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  creationTimestamp: "2022-04-28T11:16:01Z"
  generation: 2
  name: cluster
  resourceVersion: "6003867"
  uid: 7bd899d3-208f-4cd8-871d-270bbcd10bc2
spec:
  defaultNodeSelector: node-role.kubernetes.io/worker=
  mastersSchedulable: false
  policy:
    name: ""
status: {}
[cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ kubectl get namespace openshift-cnv -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/node-selector: ""
    openshift.io/sa.scc.mcs: s0:c26,c15
    openshift.io/sa.scc.supplemental-groups: 1000680000/10000
    openshift.io/sa.scc.uid-range: 1000680000/10000
  creationTimestamp: "2022-04-28T12:22:52Z"
  labels:
    kubernetes.io/metadata.name: openshift-cnv
    name: openshift-cnv
    olm.operatorgroup.uid/627c6cf8-cbe6-425e-aec6-3924e02a634c: ""
    olm.operatorgroup.uid/712fdaa1-2db1-4543-bf73-3b9852b481ae: ""
    openshift.io/cluster-monitoring: "true"
  name: openshift-cnv
  resourceVersion: "5981609"
  uid: a67d2f2a-6115-47e6-bd66-5f71c4e02ab0
spec:
  finalizers:
  - kubernetes
status:
  phase: Active
[cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ 
[cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ kubectl get pods -n openshift-cnv | grep -v Running
NAME                                                              READY   STATUS    RESTARTS   AGE
[cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$

Comment 10 errata-xmlrpc 2022-05-18 20:27:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.1 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4668

Note You need to log in before you can comment on or make changes to this bug.