Bug 2171395 - virt-controller crashes because of out-of-bound slice access in evacuation controller
Summary: virt-controller crashes because of out-of-bound slice access in evacuation co...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.13.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.13.0
Assignee: Igor Bezukh
QA Contact: zhe peng
URL:
Whiteboard:
Depends On:
Blocks: 2185068
TreeView+ depends on / blocked
 
Reported: 2023-02-20 09:46 UTC by Igor Bezukh
Modified: 2023-05-18 02:58 UTC (History)
2 users (show)

Fixed In Version: hco-bundle-registry-container-v4.13.0.rhel9-1689
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2185068 (view as bug list)
Environment:
Last Closed: 2023-05-18 02:57:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 9338 0 None Merged [virt-controller] fix out-of-bound slice index in evacuation controller 2023-03-03 15:01:59 UTC
Red Hat Issue Tracker CNV-25883 0 None None None 2023-02-20 09:48:11 UTC
Red Hat Product Errata RHSA-2023:3205 0 None None None 2023-05-18 02:58:01 UTC

Description Igor Bezukh 2023-02-20 09:46:27 UTC
Description of problem:
The evacuation controller hits out-of-bound slice access, which leads to virt-controller panic. The issue is that the index calculations in the evacuation controller is not protected against an occasion of negative values. Thus it can happen that the index would be negative.


Version-Release number of selected component (if applicable):
4.13.0

How reproducible:
100%

Steps to Reproduce:
0. IMPORTANT: make sure that Kubevirt control-plane components are deployed on infra nodes, not on worker nodes. The reproduction involves drain, and we don't want the controllers to be evicted during reproduction.
1. In Kubevirt configuration, set 
  spec.configuration.migrations.parallelMigrationsPerCluster: 200
  spec.configuration.migrations.parallelOutboundMigrationsPerNode: 100

2. Add custom label on one of the worker nodes, for example "type=worker001"
3. Create 5 migratable VMIs with nodeSelector of "type=worker001"
4. Drain the worker node with the label "type=worker001"
5. Make sure you see 5 pending VM instance migrations "oc get vmim"
6. Wait 4-5 minutes, observe the virt-controller pods status

Actual results:


Expected results:


Additional info:

Comment 1 zhe peng 2023-03-30 07:19:26 UTC
test with build:CNV-v4.13.0.rhel9-1884

step: 
1. check control-plane components not in worker nodes
$ oc get nodes
NAME                                 STATUS   ROLES                  AGE   VERSION
c01-zpeng-413-dff6b-master-0         Ready    control-plane,master   43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-master-1         Ready    control-plane,master   43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-master-2         Ready    control-plane,master   43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-fdmgv   Ready    worker                 43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-j6bj6   Ready    worker                 43h   v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-jfjgb   Ready    worker                 43h   v1.26.2+dc93b13

2. set migration config in kubevirt cr
 migrations:
      allowAutoConverge: false
      allowPostCopy: false
      completionTimeoutPerGiB: 800
      parallelMigrationsPerCluster: 200
      parallelOutboundMigrationsPerNode: 100
      progressTimeout: 150

3. add label in worker node
$ oc label node c01-zpeng-413-dff6b-worker-0-fdmgv type=worker001
node/c01-zpeng-413-dff6b-worker-0-fdmgv labeled

4. create 5 vms and add nodeSelector 
spec:
      nodeSelector:
        type: worker001

$ oc get vmi
NAME         AGE     PHASE     IP             NODENAME                             READY
vm-fedora1   17m     Running   10.131.0.231   c01-zpeng-413-dff6b-worker-0-fdmgv   True
vm-fedora2   15m     Running   10.131.0.232   c01-zpeng-413-dff6b-worker-0-fdmgv   True
vm-fedora3   11m     Running   10.131.0.234   c01-zpeng-413-dff6b-worker-0-fdmgv   True
vm-fedora4   8m13s   Running   10.131.0.235   c01-zpeng-413-dff6b-worker-0-fdmgv   True
vm-fedora5   4m37s   Running   10.131.0.236   c01-zpeng-413-dff6b-worker-0-fdmgv   True

5. Drain the worker node with label
$ oc adm cordon c01-zpeng-413-dff6b-worker-0-fdmgv
node/c01-zpeng-413-dff6b-worker-0-fdmgv cordoned

$ oc adm drain c01-zpeng-413-dff6b-worker-0-fdmgv --ignore-daemonsets=true --delete-emptydir-data=true

6. make sure there are 5 pending migration
$ oc get vmim
NAME                        PHASE        VMI
kubevirt-evacuation-2hvw6   Scheduling   vm-fedora1
kubevirt-evacuation-5tfgc   Scheduling   vm-fedora2
kubevirt-evacuation-6zkst   Scheduling   vm-fedora5
kubevirt-evacuation-gzwbx   Scheduling   vm-fedora4
kubevirt-evacuation-h2tlv   Scheduling   vm-fedora3

wait more than 10mins , observe the virt-controller pods status
$ oc get pods -n openshift-cnv | grep virt-controller
virt-controller-5cc6f78f8f-nvd59                                  1/1     Running   0             14m
virt-controller-5cc6f78f8f-s2wdb                                  1/1     Running   0             43h

no panic happened. 
move to verified.

Comment 3 errata-xmlrpc 2023-05-18 02:57:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205


Note You need to log in before you can comment on or make changes to this bug.