Bug 1903667

Summary:	VMIs get created with an undefined state field, which is confusing to the user
Product:	Container Native Virtualization (CNV)	Reporter:	guy chen <guchen>
Component:	Virtualization	Assignee:	Jed Lejosne <jlejosne>
Status:	CLOSED ERRATA	QA Contact:	guy chen <guchen>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.5.0	CC:	cnv-qe-bugs, fdeutsch, ipinto, jlejosne, kbidarka, sgott
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	hco-bundle-registry-container-v4.8.0-347 virt-operator-container-v4.8.0-58	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 14:21:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description guy chen 2020-12-02 15:39:44 UTC

Description of problem:
Starting a CNV VM creates a VMI, which is tracked by its "state" field.
The VMI state is typically expected to go from "Pending" to "Scheduling" to "Scheduled" and finally "Running".
However, on creation, VMIs do not have a state defined, which translates to a blank state shown to the user.
When starting only 1 VMI, the state is only blank for a fraction of a second, but this becomes more of a problem when starting large amounts of VMIs at once, which leads to VMIs having a blank state for up to a whole minute.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Start a VM/VMI
2. Very quickly look at the VMI state
3.

Actual results:
Blank state

Expected results:
Pending state

Additional info:

Comment 1 guy chen 2020-12-02 15:53:42 UTC

Description of problem:

I have run a load of start 1K VMS batch on the system - i've run 1K the command virtctl start {VMS}.
There are 283 VMI that have no status : 

[root]# oc get vmi | grep -c Run
106
[root]# oc get vmi | grep -c Failed
330
[root]# oc get vmi | grep -c Scheduling
137
[root]# oc get vmi | grep -c Pending
144

Thus we have 283 VMI that have no status.


Version-Release number of selected component (if applicable):
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-rc.4   True        False         21d     Cluster version is 4.6.0-rc.4
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.5.0   OpenShift Virtualization   2.5.0     kubevirt-hyperconverged-operator.v2.4.3   Succeeded


How reproducible:
Always

Steps to Reproduce:
1.Create 1K PVC
2.Create 1K VM
3.Run virtctl start VM_NAME 1K times

Actual results:
Some VMI has no status 

Expected results:
All VMI has status 

Additional info:

No status looks like the below vm-109 :

oc get vmi | more

NAME      AGE     PHASE        IP             NODENAME
vm-1      6d2h    Running      10.131.3.93    f25-h03-000-r730xd.rdu2.scalelab.redhat.com
vm-10     6d2h    Running      10.128.5.241   f25-h11-000-r730xd.rdu2.scalelab.redhat.com
vm-100    6d2h    Running      10.130.2.164   f25-h09-000-r730xd.rdu2.scalelab.redhat.com
vm-1000   12m     Pending                     
vm-101    6d2h    Running      10.128.5.253   f25-h11-000-r730xd.rdu2.scalelab.redhat.com
vm-102    6d2h    Running      10.128.5.254   f25-h11-000-r730xd.rdu2.scalelab.redhat.com
vm-103    6d2h    Running      10.129.2.7     f19-h27-000-r620.rdu2.scalelab.redhat.com
vm-104    6d2h    Running      10.130.4.193   f25-h07-000-r730xd.rdu2.scalelab.redhat.com
vm-105    6d2h    Running      10.128.4.10    f25-h11-000-r730xd.rdu2.scalelab.redhat.com
vm-106    6d2h    Running      10.129.4.23    f25-h05-000-r730xd.rdu2.scalelab.redhat.com
vm-107    19m     Failed                      
vm-108    19m     Failed                      
vm-109    5m40s                               
vm-11     6d2h    Running      10.130.4.179   f25-h07-000-r730xd.rdu2.scalelab.redhat.com
vm-110    5m17s                               
vm-111    9m52s   Pending                     
vm-112    22m     Failed                      
vm-113    23m     Failed                      
vm-114    5m19s                               
vm-115    4m40s

Comment 3 sgott 2020-12-09 13:26:55 UTC

With these VMIs that don't have status, are you able to interact with them at all? e.g. stop the VMI, connect to it via console?

My impression, just looking at the timestamps are that there's a point in time where VMIs (days earlier) were fine, a point where they were failing, a point where they're still pending and a point where we're just not getting status at all. With that, I'm of the impression that these VMIs never ran at all.

Comment 7 Kedar Bidarkar 2021-02-24 13:50:51 UTC

This PR might be also related to this issue, https://github.com/kubevirt/kubevirt/pull/4772

Comment 8 Jed Lejosne 2021-03-10 19:06:57 UTC

@kbidarka the PR you mentioned does reduce the amount of time a VMI can be in a blank state, but https://github.com/kubevirt/kubevirt/pull/4705 actually fixes the issue by creating VMIs as Pending right away.

Comment 9 Kedar Bidarkar 2021-05-31 16:32:16 UTC

@Guy, While working with populating multiple VM's during scale testing, were you able to reproduce this bug with 4.8.0 ?

Comment 10 guy chen 2021-06-02 19:09:56 UTC

The bug fix was not merged to the version I have tested, Once I will deploy a version with the fix in it I will verify it.

Comment 13 Israel Pinto 2021-06-27 13:58:23 UTC

Verify with: virt-operator-container-v4.8.0-63

Create 1000 VMs and check the VM status.
script: 
----
# /bin/bash
for number in {501..1500}
do
echo "vm: $number "
cp vm-cirros.yaml vm-cirros-new.yaml
sed -i 's|vm-cirros|vm-cirros-'"${number}"'|g' vm-cirros-new.yaml
oc apply -f vm-cirros-new.yaml -n bug-status
done
exit 0

Vms are with status: Running,Scheduling,Pending,Scheduled

Did not see any VM with no status.

Comment 16 errata-xmlrpc 2021-07-27 14:21:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920