Description of problem: If cluster admin sets defaultNodeSelector on the cluster, subsequent cnv upgrade could get stuck since multiple daemonset/deployments currently has specific nodeSelectors, that could be in conflict with defaultNodeSelector selection. Version-Release number of selected component (if applicable): tested this against 4.9.1, but this problem likely exists in other versions as well How reproducible: 100% Steps to Reproduce: 1. Set defaultNodeSelector: node-role.kubernetes.io/worker= on scheduler.spec 2. Start a cnv upgrade 3. csv.status.conditions shows that the upgrade is stuck, multiple pods stays in pending state Actual results: following pods are in pending state ============================== [cnv-qe-jenkins@iuo-dbn1-491-nc8v7-executor ~]$ kubectl get pods -n openshift-cnv -o wide | grep -v Running NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES bridge-marker-5x25c 0/1 Pending 0 35m <none> <none> <none> <none> kube-cni-linux-bridge-plugin-4bjl2 0/1 Pending 0 34m <none> <none> <none> <none> kube-cni-linux-bridge-plugin-kzd9p 0/1 Pending 0 35m <none> <none> <none> <none> kubemacpool-mac-controller-manager-79756b9774-sgwxk 0/1 Pending 0 34m <none> <none> <none> <none> nmstate-cert-manager-778d998dd-7sg8r 0/1 Pending 0 34m <none> <none> <none> <none> nmstate-handler-q2d5c 0/1 Pending 0 34m <none> <none> <none> <none> nmstate-webhook-64c4d67888-ck56x 0/1 Pending 0 34m <none> <none> <none> <none> nmstate-webhook-64c4d67888-vbpzs 0/1 Pending 0 34m <none> <none> <none> <none> [cnv-qe-jenkins@iuo-dbn1-491-nc8v7-executor ~]$ ============================== hco.status.conditions: ================ { "lastTransitionTime": "2022-02-18T00:22:03Z", "message": "NetworkAddonsConfig is progressing: DaemonSet \"openshift-cnv/kube-cni-linux-bridge-plugin\" is not available (awaiting 2 nodes)\nDaemonSet \"openshift-cnv/bridge-marker\" is not available (awaiting 1 nodes)\nDaemonSet \"openshift-cnv/nmstate-handler\" is not available (awaiting 1 nodes)\nDeployment \"openshift-cnv/kubemacpool-mac-controller-manager\" is not available (awaiting 1 nodes)\nDeployment \"openshift-cnv/nmstate-webhook\" is not available (awaiting 2 nodes)\nDeployment \"openshift-cnv/nmstate-cert-manager\" is not available (awaiting 1 nodes)", "observedGeneration": 2, "reason": "NetworkAddonsConfigProgressing", "status": "False", "type": "Upgradeable" ======================== Expected results: Additional info: Workaround is: On annotating openshift-cnv project with empty node selector and deleting the pending pods, upgrade successfully completes. ====== kubectl annotate namespace openshift-cnv openshift.io/node-selector= namespace/openshift-cnv annotated [cnv-qe-jenkins@iuo-dbn1-491-nc8v7-executor ~]$ kubectl get hco kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.versions" [ { "name": "operator", "version": "v4.9.2" } ] Logging this based on Dan's suggestion post customer case.
On a cluster when scheduler.spec.defaultNodeSelector is set to node-role.kubernetes.io/worker=, and upgrade is initiated, after a while ip was created as expected: ================ [cnv-qe-jenkins@c01-dbn-prod2-n8kx2-executor ~]$ kubectl get ip -A NAMESPACE NAME CSV APPROVAL APPROVED openshift-cnv install-cwzs4 kubevirt-hyperconverged-operator.v4.10.1 Manual true openshift-cnv install-jvntf kubevirt-hyperconverged-operator.v4.10.0 Manual true openshift-local-storage install-jmgpp local-storage-operator.4.10.0-202204090935 Automatic true openshift-storage install-dkm2q mcg-operator.v4.10.1 Automatic true [cnv-qe-jenkins@c01-dbn-prod2-n8kx2-executor ~]$ ================ On approving it associated csv was seen in pending state ================ [cnv-qe-jenkins@c01-dbn-prod2-n8kx2-executor ~]$ kubectl get csv -n openshift-cnv kubevirt-hyperconverged-operator.v4.10.1 -o json | jq ".status.conditions" [ { "lastTransitionTime": "2022-04-27T21:35:10Z", "lastUpdateTime": "2022-04-27T21:35:10Z", "message": "requirements not yet checked", "phase": "Pending", "reason": "RequirementsUnknown" }, { "lastTransitionTime": "2022-04-27T21:35:10Z", "lastUpdateTime": "2022-04-27T21:35:10Z", "message": "one or more requirements couldn't be found", "phase": "Pending", "reason": "RequirementsNotMet" } ] ================= Cluster never upgraded.
Debarati, basically you reproduced the bug that prevented you from testing the new code. We introduced a fix only in the code of HCO 4.10.1, and the effect of the bug was exactly to prevent the upgrade to the next version. So, due to the effect of this bug that for sure is present in 4.10.0 code, HCO operator cannot reach 4.10.1 and so it's not able to consume the fresh code with the fix. The only option is to manually execute the workaround (manually annotating openshift-cnv project with empty node selector and deleting the pending pods) to let the upgrade complete. We can only fix it for the future but on past versions the bug will still happen.
Validated by upgrading from 4.10.1->4.11.0. ===================== [cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ kubectl get csv -A I0502 20:10:22.417428 119255 request.go:665] Waited for 1.116228093s due to client-side throttling, not priority and fairness, request: GET:https://api.c01-dbn-prod1.cnv-qe.rhcloud.com:6443/apis/ssp.kubevirt.io/v1beta1?timeout=32s NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-cnv kubevirt-hyperconverged-operator.v4.11.0 OpenShift Virtualization 4.11.0 kubevirt-hyperconverged-operator.v4.10.1 Succeeded openshift-local-storage local-storage-operator.4.10.0-202204090935 Local Storage 4.10.0-202204090935 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded openshift-storage mcg-operator.v4.10.1 NooBaa Operator 4.10.1 mcg-operator.v4.10.0 Succeeded openshift-storage ocs-operator.v4.10.1 OpenShift Container Storage 4.10.1 ocs-operator.v4.10.0 Succeeded openshift-storage odf-operator.v4.10.1 OpenShift Data Foundation 4.10.1 odf-operator.v4.10.0 Succeeded [cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ kubectl get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2022-04-28T11:16:01Z" generation: 2 name: cluster resourceVersion: "6003867" uid: 7bd899d3-208f-4cd8-871d-270bbcd10bc2 spec: defaultNodeSelector: node-role.kubernetes.io/worker= mastersSchedulable: false policy: name: "" status: {} [cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ kubectl get namespace openshift-cnv -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/node-selector: "" openshift.io/sa.scc.mcs: s0:c26,c15 openshift.io/sa.scc.supplemental-groups: 1000680000/10000 openshift.io/sa.scc.uid-range: 1000680000/10000 creationTimestamp: "2022-04-28T12:22:52Z" labels: kubernetes.io/metadata.name: openshift-cnv name: openshift-cnv olm.operatorgroup.uid/627c6cf8-cbe6-425e-aec6-3924e02a634c: "" olm.operatorgroup.uid/712fdaa1-2db1-4543-bf73-3b9852b481ae: "" openshift.io/cluster-monitoring: "true" name: openshift-cnv resourceVersion: "5981609" uid: a67d2f2a-6115-47e6-bd66-5f71c4e02ab0 spec: finalizers: - kubernetes status: phase: Active [cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ [cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$ kubectl get pods -n openshift-cnv | grep -v Running NAME READY STATUS RESTARTS AGE [cnv-qe-jenkins@c01-dbn-prod1-zzqtf-executor ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.1 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4668