Bug 2171395

Summary: virt-controller crashes because of out-of-bound slice access in evacuation controller
Product: Container Native Virtualization (CNV) Reporter: Igor Bezukh <ibezukh>
Component: VirtualizationAssignee: Igor Bezukh <ibezukh>
Status: CLOSED ERRATA QA Contact: zhe peng <zpeng>
Severity: high Docs Contact:
Priority: high    
Version: 4.13.0CC: acardace, kbidarka
Target Milestone: ---   
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hco-bundle-registry-container-v4.13.0.rhel9-1689 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2185068 (view as bug list) Environment:
Last Closed: 2023-05-18 02:57:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2185068    

Description Igor Bezukh 2023-02-20 09:46:27 UTC
Description of problem:
The evacuation controller hits out-of-bound slice access, which leads to virt-controller panic. The issue is that the index calculations in the evacuation controller is not protected against an occasion of negative values. Thus it can happen that the index would be negative.


Version-Release number of selected component (if applicable):
4.13.0

How reproducible:
100%

Steps to Reproduce:
0. IMPORTANT: make sure that Kubevirt control-plane components are deployed on infra nodes, not on worker nodes. The reproduction involves drain, and we don't want the controllers to be evicted during reproduction.
1. In Kubevirt configuration, set 
  spec.configuration.migrations.parallelMigrationsPerCluster: 200
  spec.configuration.migrations.parallelOutboundMigrationsPerNode: 100

2. Add custom label on one of the worker nodes, for example "type=worker001"
3. Create 5 migratable VMIs with nodeSelector of "type=worker001"
4. Drain the worker node with the label "type=worker001"
5. Make sure you see 5 pending VM instance migrations "oc get vmim"
6. Wait 4-5 minutes, observe the virt-controller pods status

Actual results:


Expected results:


Additional info:

Comment 1 zhe peng 2023-03-30 07:19:26 UTC
test with build:CNV-v4.13.0.rhel9-1884

step: 
1. check control-plane components not in worker nodes
$ oc get nodes
NAME                                 STATUS   ROLES                  AGE   VERSION
c01-zpeng-413-dff6b-master-0         Ready    control-plane,master   43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-master-1         Ready    control-plane,master   43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-master-2         Ready    control-plane,master   43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-fdmgv   Ready    worker                 43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-j6bj6   Ready    worker                 43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-jfjgb   Ready    worker                 43h   v1.26.2+dc93b13

2. set migration config in kubevirt cr
 migrations:
      allowAutoConverge: false
      allowPostCopy: false
      completionTimeoutPerGiB: 800
      parallelMigrationsPerCluster: 200
      parallelOutboundMigrationsPerNode: 100
      progressTimeout: 150

3. add label in worker node
$ oc label node c01-zpeng-413-dff6b-worker-0-fdmgv type=worker001
node/c01-zpeng-413-dff6b-worker-0-fdmgv labeled

4. create 5 vms and add nodeSelector 
spec:
      nodeSelector:
        type: worker001

$ oc get vmi
NAME         AGE     PHASE     IP             NODENAME                             READY
vm-fedora1   17m     Running   10.131.0.231   c01-zpeng-413-dff6b-worker-0-fdmgv   True
vm-fedora2   15m     Running   10.131.0.232   c01-zpeng-413-dff6b-worker-0-fdmgv   True
vm-fedora3   11m     Running   10.131.0.234   c01-zpeng-413-dff6b-worker-0-fdmgv   True
vm-fedora4   8m13s   Running   10.131.0.235   c01-zpeng-413-dff6b-worker-0-fdmgv   True
vm-fedora5   4m37s   Running   10.131.0.236   c01-zpeng-413-dff6b-worker-0-fdmgv   True

5. Drain the worker node with label
$ oc adm cordon c01-zpeng-413-dff6b-worker-0-fdmgv
node/c01-zpeng-413-dff6b-worker-0-fdmgv cordoned

$ oc adm drain c01-zpeng-413-dff6b-worker-0-fdmgv --ignore-daemonsets=true --delete-emptydir-data=true

6. make sure there are 5 pending migration
$ oc get vmim
NAME                        PHASE        VMI
kubevirt-evacuation-2hvw6   Scheduling   vm-fedora1
kubevirt-evacuation-5tfgc   Scheduling   vm-fedora2
kubevirt-evacuation-6zkst   Scheduling   vm-fedora5
kubevirt-evacuation-gzwbx   Scheduling   vm-fedora4
kubevirt-evacuation-h2tlv   Scheduling   vm-fedora3

wait more than 10mins , observe the virt-controller pods status
$ oc get pods -n openshift-cnv | grep virt-controller
virt-controller-5cc6f78f8f-nvd59                                  1/1     Running   0             14m
virt-controller-5cc6f78f8f-s2wdb                                  1/1     Running   0             43h

no panic happened. 
move to verified.

Comment 3 errata-xmlrpc 2023-05-18 02:57:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205