Description of problem: If a workload placement policy is set for for kubevirt it is possible for changes to the node resource to produce a rouge VM. Version-Release number of selected component (if applicable): All How reproducible: Every time Steps to Reproduce: 1. label workload nodes 2. configure HCO or KubeVirt CR `Spec.Workloads` to run workload only on workload nodes 3. start VMs 4. remove the workload label from one of the nodes running VMs Actual results: virt-handler is removed from the node, leaving an unmanaged VM. Expected results: we should not allow the node label to change in the case that it will remove the virt-handler pod from a node with a running VM(s). Additional info:
add event to alert human operator: https://github.com/kubevirt/kubevirt/pull/4952
*** Bug 1931803 has been marked as a duplicate of this bug. ***
To reproduce: follow steps in description: 1) start some VMs 2) add label to workload nodes 3) specify hco.spec.workloads.nodeplacement.nodeselector to match the label in step 1 4) remove label from one node (with VMs on it) 5) observe that virt-handler is still running on that node after attempted label change.
note that in addition to the alert it will in the future also be possible to change the spec.workload selectors withouth shutting down VMs (https://github.com/kubevirt/kubevirt/pull/5221), to actually resolve issues reported by the alert.
Summary: virt-handler is NO LONGER running on the node after attempted label change. [kbidarka@localhost secureboot]$ oc get pods -n openshift-cnv | grep virt-handler virt-handler-5cw4k 1/1 Running 0 4d23h virt-handler-fjb2t 1/1 Running 0 4d23h virt-handler-kcnw9 1/1 Running 0 4d23h [kbidarka@localhost secureboot]$ oc get nodes NAME STATUS ROLES AGE VERSION node02.redhat.com Ready master 5d v1.21.0-rc.0+c656d63 node03.redhat.com Ready master 5d v1.21.0-rc.0+c656d63 node04.redhat.com Ready master 5d v1.21.0-rc.0+c656d63 node05.redhat.com Ready worker 5d v1.21.0-rc.0+c656d63 node06.redhat.com Ready worker 5d v1.21.0-rc.0+c656d63 node07.redhat.com Ready worker 5d v1.21.0-rc.0+c656d63 oc label node node05.redhat.com workload-comp=gpu-workload oc label node node06.redhat.com workload-comp=gpu-workload oc label node node07.redhat.com workload-comp=gpu-workload --- [kbidarka@localhost secureboot]$ oc get node node06.redhat.com -o yaml | grep workload workload-comp: gpu-workload [kbidarka@localhost secureboot]$ oc get node node05.redhat.com -o yaml | grep workload workload-comp: gpu-workload [kbidarka@localhost secureboot]$ oc get node node07.redhat.com -o yaml | grep workload workload-comp: gpu-workload [kbidarka@localhost secureboot]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel84 24s Running 70.xxx.2.236 node06.redhat.com vm2-rhel84 23s Running 70.xxx.2.227 node07.redhat.com vm2-rhel84-secref 22s Running 70.xxx.2.169 node05.redhat.com --- [kbidarka@localhost secureboot]$ oc get pods -n openshift-cnv | grep virt-handler virt-handler-c4sv4 1/1 Running 0 5m26s virt-handler-qsfpx 1/1 Running 0 4m34s virt-handler-wct2d 1/1 Running 0 67s [kbidarka@localhost secureboot]$ oc label node node07.redhat.com workload-comp- node/node07.redhat.com labeled [kbidarka@localhost secureboot]$ oc get node node07.redhat.com -o yaml | grep workload [kbidarka@localhost secureboot]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel84 4m2s Running 70.xxx.2.236 node06.redhat.com vm2-rhel84 4m1s Running 70.xxx.2.227 node07.redhat.com vm2-rhel84-secref 4m Running 70.xxx.2.169 node05.redhat.com [kbidarka@localhost secureboot]$ oc get pods -n openshift-cnv | grep virt-handler virt-handler-c4sv4 1/1 Running 0 21m virt-handler-qsfpx 1/1 Running 0 21m @Ashley: As the virt-handler is "no longer" running, after removal of the label as seen above. ^ 1)Can you please suggest, What is the expected behaviour here? 2) Looking at the linked PR's in the bug, "Should we only expect to see an Alert in the UI" notifying the Admin that there is an 'orphaned vmi' ? 3) Assuming, that we would still continue to see "No virt-handler running" for the Node, for which the label is removed. Need Confirmation. --- NOTE: 1) Tried looking at the UI, for the Alerts but see the below message: AlertmanagerReceiversNotConfigured: "Alerts are not configured to be sent to a notification system, meaning that you may not be notified in a timely fashion when important failures occur. " 2) Will try to configure the "AlertmanagerReceivers".
We do see an alert fired in prometheus after 1 hr.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920