1937609 – VM cannot be restarted

Bug 1937609 - VM cannot be restarted

Summary: VM cannot be restarted

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Prita Narayan
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-11 06:48 UTC by Guohua Ouyang
Modified:	2023-11-13 08:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:	hco-bundle-registry-container-v4.11.0-491
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-14 19:28:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vm-example (6.04 KB, text/plain) 2021-03-23 02:37 UTC, Guohua Ouyang	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 7494	None	open	VM with RunStrategyHalted now accepts manual stop request...	2022-05-27 13:32:54 UTC
Github	kubevirt kubevirt pull 7860	None	open	[release-0.53] VM with RunStrategyHalted now accepts manual stop request...	2022-06-07 11:41:37 UTC
Red Hat Issue Tracker	CNV-10900	None	None	None	2023-11-13 08:19:19 UTC
Red Hat Product Errata	RHSA-2022:6526	None	None	None	2022-09-14 19:28:30 UTC

Description Guohua Ouyang 2021-03-11 06:48:02 UTC

Description of problem:
VM cannot be restarted by multiple times.


Version-Release number of selected component (if applicable):
$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.0   OpenShift Virtualization   4.8.0     kubevirt-hyperconverged-operator.v2.5.3   Succeeded


How reproducible:
100%

Steps to Reproduce:
1. $ oc get vmi vm-example
NAME         AGE   PHASE     IP            NODENAME
vm-example   68s   Running   10.129.3.36   uit02-qm9jh-worker-0-wzgtv

2. $ virtctl restart vm-example
VM vm-example was scheduled to restart

3. $ oc get vmi vm-example
NAME         AGE   PHASE        IP    NODENAME
vm-example   3s    Scheduling

$ oc get vmi vm-example
NAME         AGE   PHASE       IP    NODENAME
vm-example   8s    Scheduled         uit02-qm9jh-worker-0-wzgtv

$ oc get vmi vm-example
NAME         AGE   PHASE     IP            NODENAME
vm-example   15s   Running   10.129.3.37   uit02-qm9jh-worker-0-wzgtv
4. $ virtctl restart vm-example
VM vm-example was scheduled to restart

5. $ virtctl restart  vm-example
Error restarting VirtualMachine Internal error occurred: unable to complete request: stop/start already underway


Actual results:
VM is never restarted at step 4 even through it reports the VM was scheduled to restart.
It reports errors at step 5.

Expected results:
VM can be restarted multiple times.

Additional info:

Comment 1 sgott 2021-03-15 21:19:14 UTC

In step 4, KubeVirt reported that the VMI was scheduled to restart. Did you wait for it? Did it successfully re-start?

As written, step 5 is expected behavior. We designed it this way because it would be even more confusing (and not very useful) to queue up multiple re-starts in succession.

Comment 2 Guohua Ouyang 2021-03-16 00:06:51 UTC

(In reply to sgott from comment #1)
> In step 4, KubeVirt reported that the VMI was scheduled to restart. Did you
> wait for it? Did it successfully re-start?
> 
> As written, step 5 is expected behavior. We designed it this way because it
> would be even more confusing (and not very useful) to queue up multiple
> re-starts in succession.

======
Actual results:
VM is never restarted at step 4(keeping in running) even through it reports the VM was scheduled to restart.

Comment 3 sgott 2021-03-16 12:44:14 UTC

Thanks for clarifying. I've altered the title of this BZ to reflect the problem more directly: the VM cannot be restarted at all.

David, is restart subject to the same rules as shutdown? i.e. does it use ACPI events and await a graceful stop? Kill the VMI after a grace period timeout?

Comment 4 Kedar Bidarkar 2021-03-17 13:20:36 UTC

Try to reproduce this bug.

Comment 5 Kedar Bidarkar 2021-03-18 20:10:00 UTC

Was trying to reproduce the bug, but unable to with CNV-v4.8.0


 [kbidarka@localhost nfs]$ oc get vm 
NAME            AGE     VOLUME
vm-nfs-rhel83   7m55s   
 [kbidarka@localhost nfs]$ oc get vmi 
No resources found in default namespace.
 [kbidarka@localhost nfs]$ virtctl start vm-nfs-rhel83
VM vm-nfs-rhel83 was scheduled to start
 [kbidarka@localhost nfs]$ oc get vmi 
NAME            AGE   PHASE        IP    NODENAME
vm-nfs-rhel83   3s    Scheduling         
 [kbidarka@localhost nfs]$ oc get vmi 
NAME            AGE   PHASE     IP             NODENAME
vm-nfs-rhel83   13s   Running   xx.yy.zz.142   cnv-qe.redhat.com
 [kbidarka@localhost nfs]$ oc get vmi 
NAME            AGE   PHASE     IP             NODENAME
vm-nfs-rhel83   71s   Running   xx.yy.zz.142   cnv-qe.redhat.com
 [kbidarka@localhost nfs]$ #virtctl restart vm-nfs-rhel83
 [kbidarka@localhost nfs]$ virtctl console vm-nfs-rhel83
Successfully connected to vm-nfs-rhel83 console. The escape sequence is ^]

Red Hat Enterprise Linux 8.3 (Ootpa)
Kernel 4.18.0-240.12.1.el8_3.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm-nfs-rhel83 login: cloud-user
Password: 
[cloud-user@vm-nfs-rhel83 ~]$ sudo su - 
[root@vm-nfs-rhel83 ~]#  [kbidarka@localhost nfs]$ 
 [kbidarka@localhost nfs]$ 
 [kbidarka@localhost nfs]$ virtctl restart vm-nfs-rhel83
VM vm-nfs-rhel83 was scheduled to restart
 [kbidarka@localhost nfs]$ oc get vmi 
NAME            AGE     PHASE       IP             NODENAME
vm-nfs-rhel83   2m54s   Succeeded   xx.yy.zz.142   cnv-qe.redhat.com
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE     IP    NODENAME   LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   0s    Pending                                      
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE     IP             NODENAME                                         LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   6s    Running   xx.yy.zz.143   cnv-qe.redhat.com   True              
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE     IP             NODENAME                                         LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   63s   Running   xx.yy.zz.143   cnv-qe.redhat.com   True              
 [kbidarka@localhost nfs]$ virtctl console vm-nfs-rhel83
Successfully connected to vm-nfs-rhel83 console. The escape sequence is ^]

Red Hat Enterprise Linux 8.3 (Ootpa)
Kernel 4.18.0-240.12.1.el8_3.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm-nfs-rhel83 login: cloud-user
Password: 
Last login: Thu Mar 18 15:46:49 on ttyS0
[cloud-user@vm-nfs-rhel83 ~]$  [kbidarka@localhost nfs]$ 
 [kbidarka@localhost nfs]$ virtctl restart vm-nfs-rhel83
VM vm-nfs-rhel83 was scheduled to restart
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE       IP             NODENAME                                         LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   96s   Succeeded   xx.yy.zz.143   cnv-qe.redhat.com   True              
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE        IP    NODENAME   LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   0s    Scheduling                                      
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE       IP    NODENAME                                         LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   3s    Scheduled         cnv-qe.redhat.com   True              
vm-nfs-rhel83   49s   Running   xx.yy.zz.144   cnv-qe.redhat.com   True              
 [kbidarka@localhost nfs]$ virtctl console vm-nfs-rhel83
Successfully connected to vm-nfs-rhel83 console. The escape sequence is ^]

Red Hat Enterprise Linux 8.3 (Ootpa)
Kernel 4.18.0-240.12.1.el8_3.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket

vm-nfs-rhel83 login: cloud-user
Password: 
Last login: Thu Mar 18 15:49:22 on ttyS0
[cloud-user@vm-nfs-rhel83 ~]$  [kbidarka@localhost nfs]$ 
 [kbidarka@localhost nfs]$ virtctl restart vm-nfs-rhel83
VM vm-nfs-rhel83 was scheduled to restart
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE       IP             NODENAME                                         LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   77s   Succeeded   xx.yy.zz.144   cnv-qe.redhat.com   True              
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE   IP    NODENAME   LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   0s                                               
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE        IP    NODENAME   LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   2s    Scheduling                                      
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE       IP    NODENAME                                         LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   5s    Scheduled         cnv-qe.redhat.com   True              
 [kbidarka@localhost nfs]$ oc get vmi -o wide 
NAME            AGE   PHASE     IP             NODENAME                                         LIVE-MIGRATABLE   PAUSED
vm-nfs-rhel83   16s   Running   xx.yy.zz.146   cnv-qe.redhat.com   True              


--------------------------------------------------------

"upstream-version": "0.39.0-rc.0-55-ge95b9bc",
"url": "cnv/virt-operator/images/v4.8.0-15",

Comment 6 Guohua Ouyang 2021-03-19 01:35:28 UTC

Could you try to restart the VM just after it becomes running? is this is valid scenario?

Comment 7 sgott 2021-03-22 12:48:48 UTC

Jed, can you take a look at this?

Comment 8 Jed Lejosne 2021-03-22 13:05:06 UTC

Sure thing!

Comment 9 Jed Lejosne 2021-03-22 13:49:00 UTC

@gouyang could you please include the yaml for that VM?
After a few attempts with a simple VM, I wasn't able to reproduce this issue.

Comment 10 Guohua Ouyang 2021-03-23 02:37:55 UTC

Created attachment 1765437 [details]
vm-example

provide the default vm yaml from console, create wizard -> 'With YAML'.

I also didn't see this problem on a simple VM, it seems only happen with VMs are using common templates.

Comment 12 Jed Lejosne 2021-03-23 12:31:58 UTC

For some reason, that VM takes about 4 minutes to boot, probably something to investigate!
I started the VM and after a few seconds requested a restart, and I had to wait the whole 4 minutes for the VM to finish booting and finally do its graceful restart.

@gouyang could you please make sure the VM is not just taking a (very) long time to restart?
If not, could you please include a capture of the VNC console after ~5 minutes?

@sgott I'd love an answer to your question "is restart subject to the same rules as shutdown? i.e. does it use ACPI events and await a graceful stop? Kill the VMI after a grace period timeout?". I'll see if I can find an answer in the code!

Comment 13 Guohua Ouyang 2021-03-24 00:43:56 UTC

(In reply to Jed Lejosne from comment #12)
> For some reason, that VM takes about 4 minutes to boot, probably something
> to investigate!
> I started the VM and after a few seconds requested a restart, and I had to
> wait the whole 4 minutes for the VM to finish booting and finally do its
> graceful restart.
> 

It takes about 3 minutes for the VM to restart as the graceful termination defined in the template.

ref: https://github.com/kubevirt/common-templates/blob/f30ca1cac08e600bc4102516f8c504b08543413d/templates/fedora.tpl.yaml#L132

> @gouyang could you please make sure the VM is not just taking a
> (very) long time to restart?

- If wait long enough time to restart the VM, it has no this problem, like monitoring the VNC console to wait for login prompt appearing.
- If restart the VM immediately once it becomes 'Running', it hits the problem.

The simple VM has no this problem because no `terminationGracePeriodSeconds` defined in it.

> If not, could you please include a capture of the VNC console after ~5
> minutes?
> 
> @sgott I'd love an answer to your question "is restart subject to
> the same rules as shutdown? i.e. does it use ACPI events and await a
> graceful stop? Kill the VMI after a grace period timeout?". I'll see if I
> can find an answer in the code!

It looks like the same rules are in use as shutdown.

Comment 14 sgott 2021-04-01 17:25:28 UTC

Based on Comment #13, we're closing this as notabug--since the VM does restart. Please re-open if you feel this is in error.

Comment 15 Guohua Ouyang 2021-04-02 01:17:40 UTC

Cann't agree this is not a bug.
It shows clearly that the VM cannot be restarted just after the VM becomes `running`, why should user wait for sometime to perform restart?

If we don't consider improve this problem, at least we need to document about this.

Comment 16 sgott 2021-04-05 14:43:25 UTC

Guohua, it's not clear what we can usefully do for you here. VMI restart is implemented as a shutdown followed by a start. This means we need to use ACPI events to signal a graceful shutdown (or that fails and TerminationGracePeriodSeconds is exceeded and shutdown occurs anyways). This means the OS needs to be capable of responding to ACPI events--in particular it needs to be running.

In your case, the guest you're using takes an unusually long time to boot up--before it can respond to ACPI events and shut back down. It's not clear why your guest takes so long to boot, but the problem appears to be your VMI.

The problem here is generalized advice of "if you restart right away, things will take a long time to reboot" just isn't universally true. The fact that reboot takes four minutes is specific to this VMI.

So let's recap.

This VMI does restart.
Rebooting while a reboot is in progress isn't supposed to work.
This VMI takes a long time to boot up.

What's the path forward in your view?

Comment 17 Guohua Ouyang 2021-04-08 02:32:11 UTC

From my perspective, restart/shutdown/delete a 'Running' VM is very normal, it's not expecting to wait for some time to see the action is actually performed. 
1. What do you think to have an extra status to indicate the OS is capable of responding to ACPI events, so operations like restart/delete/shutdown can be performed smooethly.
2. If we don't improve the core/backend, do we need to take some actions on UI, like preventing user to perform such actions until the OS is capable of responding to ACPI events.
3. Do we need to document this behavior?

Comment 18 Guohua Ouyang 2021-04-08 02:55:30 UTC

(In reply to sgott from comment #16)

> This VMI does restart.
> Rebooting while a reboot is in progress isn't supposed to work.
> This VMI takes a long time to boot up.
> 
> What's the path forward in your view?

An improvement here can be: can the VM be restarted once the OS is capable of responding to ACPI events(20s), not wait for the TerminationGracePeriodSeconds(3 minutes)?

Comment 19 sgott 2022-01-28 20:31:14 UTC

Guohua, we're planning on allowing "--grace-period=0 --force" to be called for a VM that's already being shut down. That is to say a second API call with a shorter timeout will be honored right away. I think that might also address this issue in a reasonable way. For cases such as this where the user needs a VM to shut down faster, this would be a mechanism to make that happen.

Comment 20 Guohua Ouyang 2022-02-07 01:50:50 UTC

It's good that the backend supporting a second API call with a shorter timeout, the CLI(virtctl) and UI should have a way to pass the timeout parameter, I'm glad to test it again once it's ready.

Comment 21 sgott 2022-05-27 13:32:41 UTC

Re-assigning this to Prita as this appears to be fixed by https://github.com/kubevirt/kubevirt/pull/7494

Comment 22 sgott 2022-06-07 11:41:37 UTC

Backport PR is still open. Moving this back to POST.

Comment 24 Guohua Ouyang 2022-06-22 02:00:54 UTC

verified the bug on latest CNV 4.11, the VM can be restarted normally.
It stops the VM firstly and the VM becomes running after sometime at every restart.

Comment 26 errata-xmlrpc 2022-09-14 19:28:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526

Note You need to log in before you can comment on or make changes to this bug.