Description of problem: Applying a nodemaintenance manifest migrates a VM twice. [kbidarka@localhost node-drain]$ cat maintenance.yaml apiVersion: nodemaintenance.kubevirt.io/v1beta1 kind: NodeMaintenance metadata: name: nodemaintenance-test spec: nodeName: kbid25vrm-hnmbm-worker-0-gshsn reason: "Test node maintenance" (cnv-tests) [kbidarka@localhost node-drain]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-nfs-7xn9c 1/1 Running 0 2m39s (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmim No resources found. (cnv-tests) [kbidarka@localhost node-drain]$ oc get nodes NAME STATUS ROLES AGE VERSION kbid25vrm-hnmbm-master-0 Ready master 8d v1.19.0+db1fc96 kbid25vrm-hnmbm-master-1 Ready master 8d v1.19.0+db1fc96 kbid25vrm-hnmbm-master-2 Ready master 8d v1.19.0+db1fc96 kbid25vrm-hnmbm-worker-0-c298l Ready worker 8d v1.19.0+db1fc96 kbid25vrm-hnmbm-worker-0-gshsn Ready worker 8d v1.19.0+db1fc96 kbid25vrm-hnmbm-worker-0-t8644 Ready worker 8d v1.19.0+db1fc96 (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82-nfs 3m19s Running 10.128.3.91 kbid25vrm-hnmbm-worker-0-gshsn (cnv-tests) [kbidarka@localhost node-drain]$ oc apply -f maintenance.yaml nodemaintenance.nodemaintenance.kubevirt.io/nodemaintenance-test created (cnv-tests) [kbidarka@localhost node-drain]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-nfs-7xn9c 1/1 Running 0 4m50s virt-launcher-vm-rhel82-nfs-g5cxs 0/1 Init:0/1 0 3s (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82-nfs 4m54s Running 10.128.3.91 kbid25vrm-hnmbm-worker-0-gshsn (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmim NAME AGE kubevirt-evacuation-c9m27 15s (cnv-tests) [kbidarka@localhost node-drain]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-nfs-7xn9c 1/1 Running 0 5m7s virt-launcher-vm-rhel82-nfs-g5cxs 1/1 Running 0 20s (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82-nfs 5m11s Running 10.129.2.70 kbid25vrm-hnmbm-worker-0-c298l (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmim NAME AGE kubevirt-evacuation-c9m27 27s kubevirt-evacuation-zgn5s 3s (cnv-tests) [kbidarka@localhost node-drain]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-nfs-g5cxs 1/1 Running 0 30s virt-launcher-vm-rhel82-nfs-jvt5b 1/1 Running 0 6s (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82-nfs 5m22s Running 10.129.2.70 kbid25vrm-hnmbm-worker-0-c298l (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmim NAME AGE kubevirt-evacuation-c9m27 40s kubevirt-evacuation-zgn5s 16s (cnv-tests) [kbidarka@localhost node-drain]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-nfs-g5cxs 0/1 Completed 0 43s virt-launcher-vm-rhel82-nfs-jvt5b 1/1 Running 0 19s (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82-nfs 5m36s Running 10.131.0.94 kbid25vrm-hnmbm-worker-0-t8644 (cnv-tests) [kbidarka@localhost node-drain]$ oc get vmim NAME AGE kubevirt-evacuation-c9m27 53s kubevirt-evacuation-zgn5s 29s Version-Release number of selected component (if applicable): CNV-2.5 How reproducible: [kbidarka@localhost node-drain]$ cat maintenance.yaml apiVersion: nodemaintenance.kubevirt.io/v1beta1 kind: NodeMaintenance metadata: name: nodemaintenance-test spec: nodeName: kbid25vrm-hnmbm-worker-0-gshsn reason: "Test node maintenance" [kbidarka@localhost node-drain]$ oc apply -f maintenance.yaml nodemaintenance.nodemaintenance.kubevirt.io/nodemaintenance-test created Steps to Reproduce: 1. create a VM which is migratable, that is with storage accessMode of RWX 2. Migrate the VM using the NodeMaintenance 3. Actual results: Applying a nodemaintenance manifest migrates a VM twice. Expected results: Applying a nodemaintenance manifest should not migrate a VM twice. Additional info: In the end, the VMI is migrated twice, that is the VMI moves from source_node to second_node and then to the third node.
Marc, is this something you could be taking?
Hi, this strange, we never saw something like this so far. I can at least try to help here, but we need much more info I guess. - is this reproducible? - can you please provide: - complete node maintenance CR(s) (are there multiple by any chance?) - NMO logs - everything from KubeVirt which can help figuring out why the 2nd eviction happens, probably: - virt-controller logs - virt-launcher logs of the 2nd VM - full migration CRs (not sure what's being tracked there, but maybe something useful) Thanks
Created attachment 1726213 [details] migration logs via oc adm drain we see 2 VM pods and 2 vmim objects even for migration triggered via "oc adm drain"
> even for migration triggered via "oc adm drain" So it's not a NMO issue. I'd suggest to ask someone familiar with the migration / eviction code to have a look at the logs.
KubeVirt can migrate VMs away via configuring a specific taint. As far as I know HCO explicitly configure the not-schedulable taint as the migration trigger. HCO should just stop setting that taint now, since now we have the eviction webhook which works much more precise.
Can you confirm this behavior still manifests on the latest builds available to QE? The taint mentioned in Comment #6 was addressed here (since Oct 28th): https://github.com/kubevirt/hyperconverged-cluster-operator/pull/904 But it's not clear what HCO version was used to create the logs in Comment #3 (Nov 3rd).
Created attachment 1728755 [details] Upstream fix
While working with "oc adm drain": ----------------------------------- Red Hat Enterprise Linux 8.2 (Ootpa) Kernel 4.18.0-193.19.1.el8_2.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm-rhel82 login: cloud-user Password: [cloud-user@vm-rhel82 ~]$ [cloud-user@vm-rhel82 ~]$ [kbidarka@localhost windows10]$ [kbidarka@localhost windows10]$ [kbidarka@localhost windows10]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-gspc4 1/1 Running 0 116s [kbidarka@localhost windows10]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82 2m Running 10.131.0.59 kbid25vz-mbng4-worker-0-prkk5 [kbidarka@localhost windows10]$ oc get vmim No resources found in default namespace. [kbidarka@localhost windows10]$ [kbidarka@localhost windows10]$ oc adm drain kbid25vz-mbng4-worker-0-prkk5 --delete-local-data --ignore-daemonsets=true --force node/kbid25vz-mbng4-worker-0-prkk5 cordoned WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-pl777, openshift-cnv/bridge-marker-ccj9c, openshift-cnv/hostpath-provisioner-m5jtt, openshift-cnv/kube-cni-linux-bridge-plugin-phvn8, openshift-cnv/kubevirt-node-labeller-w9cxf, openshift-cnv/nmstate-handler-sshc2, openshift-cnv/ovs-cni-amd64-4w74c, openshift-cnv/virt-handler-qmg7n, openshift-dns/dns-default-gkttv, openshift-image-registry/node-ca-x4b88, openshift-local-storage/local-block-local-diskmaker-7sdqj, openshift-local-storage/local-block-local-provisioner-x9v7l, openshift-machine-config-operator/machine-config-daemon-d6sh7, openshift-manila-csi-driver/csi-nodeplugin-nfsplugin-24tlx, openshift-manila-csi-driver/openstack-manila-csi-nodeplugin-whk9l, openshift-monitoring/node-exporter-76k62, openshift-multus/multus-mr5j9, openshift-multus/network-metrics-daemon-rdf9l, openshift-sdn/ovs-5swxp, openshift-sdn/sdn-clh8s, openshift-storage/csi-cephfsplugin-zw6dv, openshift-storage/csi-rbdplugin-8fptl; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: openshift-marketplace/ocs-catalogsource-gcn2s evicting pod openshift-storage/noobaa-operator-7f47757d65-4dvcr evicting pod openshift-storage/rook-ceph-mgr-a-849dbb44b-kp5xh evicting pod openshift-storage/rook-ceph-mon-a-78655795b8-nl9jw evicting pod openshift-storage/rook-ceph-osd-2-5f7777d864-gsf6t evicting pod openshift-local-storage/local-storage-operator-7b588b45db-cm2cw evicting pod openshift-storage/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5575665lqjwg evicting pod openshift-storage/ocs-metrics-exporter-7589bff9-sx5lj evicting pod openshift-storage/rook-ceph-osd-prepare-ocs-deviceset-0-data-0-mxlc2-gmbtl evicting pod openshift-storage/noobaa-db-0 evicting pod openshift-storage/rook-ceph-drain-canary-kbid25vz-mbng4-worker-0-prkk5-57999p4nlp evicting pod openshift-storage/noobaa-core-0 evicting pod default/virt-launcher-vm-rhel82-gspc4 evicting pod openshift-marketplace/af127f59fc5834664c72391f64f2d521092ee545d832d33c10000ffefds276x evicting pod openshift-storage/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-ddfb75658t8k6 evicting pod openshift-marketplace/55aca83220d7b2cf5747b8a71c6ef1fa14edee8acdf16b3fffa774b02dgcj5f evicting pod recycle-pvs/recycle-pvs-7785b654c8-jxfpv evicting pod openshift-marketplace/ocs-catalogsource-gcn2s evicting pod openshift-storage/rook-ceph-operator-54d9586bf8-lrsxj evicting pod openshift-storage/csi-cephfsplugin-provisioner-ccc98c6d9-vc4fh evicting pod openshift-storage/csi-rbdplugin-provisioner-774c48d46-ffx6r evicting pod openshift-storage/rook-ceph-crashcollector-kbid25vz-mbng4-worker-0-prkk5-dcbqtghj evicting pod openshift-storage/noobaa-endpoint-6b44f96598-5qpdk evicting pod openshift-storage/ocs-operator-69f76f978b-6pxhs error when evicting pod "rook-ceph-osd-2-5f7777d864-gsf6t" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. pod/rook-ceph-osd-prepare-ocs-deviceset-0-data-0-mxlc2-gmbtl evicted pod/55aca83220d7b2cf5747b8a71c6ef1fa14edee8acdf16b3fffa774b02dgcj5f evicted I1117 23:27:03.662129 31761 request.go:645] Throttling request took 1.000225705s, request: POST:https://api.kbid25vz.cnv-qe.rhcloud.com:6443/api/v1/namespaces/openshift-storage/pods/rook-ceph-drain-canary-kbid25vz-mbng4-worker-0-prkk5-57999p4nlp/eviction error when evicting pod "virt-launcher-vm-rhel82-gspc4" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. pod/af127f59fc5834664c72391f64f2d521092ee545d832d33c10000ffefds276x evicted pod/ocs-catalogsource-gcn2s evicted pod/noobaa-operator-7f47757d65-4dvcr evicted evicting pod openshift-storage/rook-ceph-osd-2-5f7777d864-gsf6t error when evicting pod "rook-ceph-osd-2-5f7777d864-gsf6t" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/virt-launcher-vm-rhel82-gspc4 error when evicting pod "virt-launcher-vm-rhel82-gspc4" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. pod/rook-ceph-operator-54d9586bf8-lrsxj evicted pod/rook-ceph-mgr-a-849dbb44b-kp5xh evicted pod/rook-ceph-mon-a-78655795b8-nl9jw evicted pod/rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5575665lqjwg evicted pod/csi-cephfsplugin-provisioner-ccc98c6d9-vc4fh evicted pod/ocs-metrics-exporter-7589bff9-sx5lj evicted pod/noobaa-db-0 evicted I1117 23:27:13.807073 31761 request.go:645] Throttling request took 2.304445657s, request: GET:https://api.kbid25vz.cnv-qe.rhcloud.com:6443/api/v1/namespaces/openshift-storage/pods/rook-ceph-drain-canary-kbid25vz-mbng4-worker-0-prkk5-57999p4nlp evicting pod openshift-storage/rook-ceph-osd-2-5f7777d864-gsf6t pod/rook-ceph-drain-canary-kbid25vz-mbng4-worker-0-prkk5-57999p4nlp evicted error when evicting pod "rook-ceph-osd-2-5f7777d864-gsf6t" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-ddfb75658t8k6 evicted pod/ocs-operator-69f76f978b-6pxhs evicted evicting pod default/virt-launcher-vm-rhel82-gspc4 pod/csi-rbdplugin-provisioner-774c48d46-ffx6r evicted pod/local-storage-operator-7b588b45db-cm2cw evicted error when evicting pod "virt-launcher-vm-rhel82-gspc4" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod openshift-storage/rook-ceph-osd-2-5f7777d864-gsf6t error when evicting pod "rook-ceph-osd-2-5f7777d864-gsf6t" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/virt-launcher-vm-rhel82-gspc4 error when evicting pod "virt-launcher-vm-rhel82-gspc4" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod openshift-storage/rook-ceph-osd-2-5f7777d864-gsf6t error when evicting pod "rook-ceph-osd-2-5f7777d864-gsf6t" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/virt-launcher-vm-rhel82-gspc4 pod/virt-launcher-vm-rhel82-gspc4 evicted evicting pod openshift-storage/rook-ceph-osd-2-5f7777d864-gsf6t error when evicting pod "rook-ceph-osd-2-5f7777d864-gsf6t" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod openshift-storage/rook-ceph-osd-2-5f7777d864-gsf6t pod/noobaa-endpoint-6b44f96598-5qpdk evicted pod/rook-ceph-crashcollector-kbid25vz-mbng4-worker-0-prkk5-dcbqtghj evicted pod/recycle-pvs-7785b654c8-jxfpv evicted pod/noobaa-core-0 evicted pod/rook-ceph-osd-2-5f7777d864-gsf6t evicted node/kbid25vz-mbng4-worker-0-prkk5 evicted After node drain too, we now only see 1 pod and only 1 vmim object. [kbidarka@localhost windows10]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-bw8mn 1/1 Running 0 54s [kbidarka@localhost windows10]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82 4m6s Running 10.128.2.40 kbid25vz-mbng4-worker-0-bnlvs [kbidarka@localhost windows10]$ oc get vmim NAME AGE kubevirt-evacuation-phlxt 66s As seen below, the migration is also successful as seen from the uptime. [kbidarka@localhost windows10]$ virtctl console vm-rhel82 Successfully connected to vm-rhel82 console. The escape sequence is ^] [cloud-user@vm-rhel82 ~]$ [cloud-user@vm-rhel82 ~]$ uptime 12:59:28 up 5 min, 1 user, load average: 0.10, 0.50, 0.28 [cloud-user@vm-rhel82 ~]$ [kbidarka@localhost windows10]$ [kbidarka@localhost windows10]$
While working with "Node Maintenance": --------------------------------------- [kbidarka@localhost final]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82 32m Running 10.128.2.40 kbid25vz-mbng4-worker-0-bnlvs [kbidarka@localhost final]$ oc get vmim NAME AGE kubevirt-evacuation-phlxt 29m [kbidarka@localhost final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-bw8mn 1/1 Running 0 29m [kbidarka@localhost final]$ cat maintenance.yaml apiVersion: nodemaintenance.kubevirt.io/v1beta1 kind: NodeMaintenance metadata: name: nodemaintenance-test spec: nodeName: kbid25vz-mbng4-worker-0-bnlvs reason: "Test node maintenance" [kbidarka@localhost final]$ oc apply -f maintenance.yaml nodemaintenance.nodemaintenance.kubevirt.io/nodemaintenance-test created [kbidarka@localhost final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-9sz7g 0/1 PodInitializing 0 8s virt-launcher-vm-rhel82-bw8mn 1/1 Running 0 30m [kbidarka@localhost final]$ oc get vmim NAME AGE kubevirt-evacuation-phlxt 30m kubevirt-evacuation-wgwlw 16s [kbidarka@localhost final]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82 33m Running 10.128.2.40 kbid25vz-mbng4-worker-0-bnlvs [kbidarka@localhost final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-9sz7g 1/1 Running 0 26s virt-launcher-vm-rhel82-bw8mn 1/1 Running 0 30m [kbidarka@localhost final]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82 33m Running 10.128.2.40 kbid25vz-mbng4-worker-0-bnlvs [kbidarka@localhost final]$ oc get vmim NAME AGE kubevirt-evacuation-phlxt 30m kubevirt-evacuation-wgwlw 36s [kbidarka@localhost final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-9sz7g 1/1 Running 0 39s virt-launcher-vm-rhel82-bw8mn 0/1 Completed 0 30m [kbidarka@localhost final]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.5.1 OpenShift Virtualization 2.5.1 kubevirt-hyperconverged-operator.v2.5.0 Succeeded [kbidarka@localhost final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel82-9sz7g 1/1 Running 0 8m43s [kbidarka@localhost final]$ oc get vmim NAME AGE kubevirt-evacuation-phlxt 39m kubevirt-evacuation-wgwlw 8m52s [kbidarka@localhost final]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel82 42m Running 10.131.0.70 kbid25vz-mbng4-worker-0-prkk5
Verified with container-native-virtualization/virt-operator/images/v2.5.1-1