Description of problem: The evacuation controller hits out-of-bound slice access, which leads to virt-controller panic. The issue is that the index calculations in the evacuation controller is not protected against an occasion of negative values. Thus it can happen that the index would be negative. Version-Release number of selected component (if applicable): 4.13.0 How reproducible: 100% Steps to Reproduce: 0. IMPORTANT: make sure that Kubevirt control-plane components are deployed on infra nodes, not on worker nodes. The reproduction involves drain, and we don't want the controllers to be evicted during reproduction. 1. In Kubevirt configuration, set spec.configuration.migrations.parallelMigrationsPerCluster: 200 spec.configuration.migrations.parallelOutboundMigrationsPerNode: 100 2. Add custom label on one of the worker nodes, for example "type=worker001" 3. Create 5 migratable VMIs with nodeSelector of "type=worker001" 4. Drain the worker node with the label "type=worker001" 5. Make sure you see 5 pending VM instance migrations "oc get vmim" 6. Wait 4-5 minutes, observe the virt-controller pods status Actual results: Expected results: Additional info:
test with build:CNV-v4.13.0.rhel9-1884 step: 1. check control-plane components not in worker nodes $ oc get nodes NAME STATUS ROLES AGE VERSION c01-zpeng-413-dff6b-master-0 Ready control-plane,master 43h v1.26.2+dc93b13 c01-zpeng-413-dff6b-master-1 Ready control-plane,master 43h v1.26.2+dc93b13 c01-zpeng-413-dff6b-master-2 Ready control-plane,master 43h v1.26.2+dc93b13 c01-zpeng-413-dff6b-worker-0-fdmgv Ready worker 43h v1.26.2+dc93b13 c01-zpeng-413-dff6b-worker-0-j6bj6 Ready worker 43h v1.26.2+dc93b13 c01-zpeng-413-dff6b-worker-0-jfjgb Ready worker 43h v1.26.2+dc93b13 2. set migration config in kubevirt cr migrations: allowAutoConverge: false allowPostCopy: false completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 200 parallelOutboundMigrationsPerNode: 100 progressTimeout: 150 3. add label in worker node $ oc label node c01-zpeng-413-dff6b-worker-0-fdmgv type=worker001 node/c01-zpeng-413-dff6b-worker-0-fdmgv labeled 4. create 5 vms and add nodeSelector spec: nodeSelector: type: worker001 $ oc get vmi NAME AGE PHASE IP NODENAME READY vm-fedora1 17m Running 10.131.0.231 c01-zpeng-413-dff6b-worker-0-fdmgv True vm-fedora2 15m Running 10.131.0.232 c01-zpeng-413-dff6b-worker-0-fdmgv True vm-fedora3 11m Running 10.131.0.234 c01-zpeng-413-dff6b-worker-0-fdmgv True vm-fedora4 8m13s Running 10.131.0.235 c01-zpeng-413-dff6b-worker-0-fdmgv True vm-fedora5 4m37s Running 10.131.0.236 c01-zpeng-413-dff6b-worker-0-fdmgv True 5. Drain the worker node with label $ oc adm cordon c01-zpeng-413-dff6b-worker-0-fdmgv node/c01-zpeng-413-dff6b-worker-0-fdmgv cordoned $ oc adm drain c01-zpeng-413-dff6b-worker-0-fdmgv --ignore-daemonsets=true --delete-emptydir-data=true 6. make sure there are 5 pending migration $ oc get vmim NAME PHASE VMI kubevirt-evacuation-2hvw6 Scheduling vm-fedora1 kubevirt-evacuation-5tfgc Scheduling vm-fedora2 kubevirt-evacuation-6zkst Scheduling vm-fedora5 kubevirt-evacuation-gzwbx Scheduling vm-fedora4 kubevirt-evacuation-h2tlv Scheduling vm-fedora3 wait more than 10mins , observe the virt-controller pods status $ oc get pods -n openshift-cnv | grep virt-controller virt-controller-5cc6f78f8f-nvd59 1/1 Running 0 14m virt-controller-5cc6f78f8f-s2wdb 1/1 Running 0 43h no panic happened. move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3205