Description of problem: VM cannot be restarted by multiple times. Version-Release number of selected component (if applicable): $ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.8.0 OpenShift Virtualization 4.8.0 kubevirt-hyperconverged-operator.v2.5.3 Succeeded How reproducible: 100% Steps to Reproduce: 1. $ oc get vmi vm-example NAME AGE PHASE IP NODENAME vm-example 68s Running 10.129.3.36 uit02-qm9jh-worker-0-wzgtv 2. $ virtctl restart vm-example VM vm-example was scheduled to restart 3. $ oc get vmi vm-example NAME AGE PHASE IP NODENAME vm-example 3s Scheduling $ oc get vmi vm-example NAME AGE PHASE IP NODENAME vm-example 8s Scheduled uit02-qm9jh-worker-0-wzgtv $ oc get vmi vm-example NAME AGE PHASE IP NODENAME vm-example 15s Running 10.129.3.37 uit02-qm9jh-worker-0-wzgtv 4. $ virtctl restart vm-example VM vm-example was scheduled to restart 5. $ virtctl restart vm-example Error restarting VirtualMachine Internal error occurred: unable to complete request: stop/start already underway Actual results: VM is never restarted at step 4 even through it reports the VM was scheduled to restart. It reports errors at step 5. Expected results: VM can be restarted multiple times. Additional info:
In step 4, KubeVirt reported that the VMI was scheduled to restart. Did you wait for it? Did it successfully re-start? As written, step 5 is expected behavior. We designed it this way because it would be even more confusing (and not very useful) to queue up multiple re-starts in succession.
(In reply to sgott from comment #1) > In step 4, KubeVirt reported that the VMI was scheduled to restart. Did you > wait for it? Did it successfully re-start? > > As written, step 5 is expected behavior. We designed it this way because it > would be even more confusing (and not very useful) to queue up multiple > re-starts in succession. ====== Actual results: VM is never restarted at step 4(keeping in running) even through it reports the VM was scheduled to restart.
Thanks for clarifying. I've altered the title of this BZ to reflect the problem more directly: the VM cannot be restarted at all. David, is restart subject to the same rules as shutdown? i.e. does it use ACPI events and await a graceful stop? Kill the VMI after a grace period timeout?
Try to reproduce this bug.
Was trying to reproduce the bug, but unable to with CNV-v4.8.0 [kbidarka@localhost nfs]$ oc get vm NAME AGE VOLUME vm-nfs-rhel83 7m55s [kbidarka@localhost nfs]$ oc get vmi No resources found in default namespace. [kbidarka@localhost nfs]$ virtctl start vm-nfs-rhel83 VM vm-nfs-rhel83 was scheduled to start [kbidarka@localhost nfs]$ oc get vmi NAME AGE PHASE IP NODENAME vm-nfs-rhel83 3s Scheduling [kbidarka@localhost nfs]$ oc get vmi NAME AGE PHASE IP NODENAME vm-nfs-rhel83 13s Running xx.yy.zz.142 cnv-qe.redhat.com [kbidarka@localhost nfs]$ oc get vmi NAME AGE PHASE IP NODENAME vm-nfs-rhel83 71s Running xx.yy.zz.142 cnv-qe.redhat.com [kbidarka@localhost nfs]$ #virtctl restart vm-nfs-rhel83 [kbidarka@localhost nfs]$ virtctl console vm-nfs-rhel83 Successfully connected to vm-nfs-rhel83 console. The escape sequence is ^] Red Hat Enterprise Linux 8.3 (Ootpa) Kernel 4.18.0-240.12.1.el8_3.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm-nfs-rhel83 login: cloud-user Password: [cloud-user@vm-nfs-rhel83 ~]$ sudo su - [root@vm-nfs-rhel83 ~]# [kbidarka@localhost nfs]$ [kbidarka@localhost nfs]$ [kbidarka@localhost nfs]$ virtctl restart vm-nfs-rhel83 VM vm-nfs-rhel83 was scheduled to restart [kbidarka@localhost nfs]$ oc get vmi NAME AGE PHASE IP NODENAME vm-nfs-rhel83 2m54s Succeeded xx.yy.zz.142 cnv-qe.redhat.com [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 0s Pending [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 6s Running xx.yy.zz.143 cnv-qe.redhat.com True [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 63s Running xx.yy.zz.143 cnv-qe.redhat.com True [kbidarka@localhost nfs]$ virtctl console vm-nfs-rhel83 Successfully connected to vm-nfs-rhel83 console. The escape sequence is ^] Red Hat Enterprise Linux 8.3 (Ootpa) Kernel 4.18.0-240.12.1.el8_3.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm-nfs-rhel83 login: cloud-user Password: Last login: Thu Mar 18 15:46:49 on ttyS0 [cloud-user@vm-nfs-rhel83 ~]$ [kbidarka@localhost nfs]$ [kbidarka@localhost nfs]$ virtctl restart vm-nfs-rhel83 VM vm-nfs-rhel83 was scheduled to restart [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 96s Succeeded xx.yy.zz.143 cnv-qe.redhat.com True [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 0s Scheduling [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 3s Scheduled cnv-qe.redhat.com True vm-nfs-rhel83 49s Running xx.yy.zz.144 cnv-qe.redhat.com True [kbidarka@localhost nfs]$ virtctl console vm-nfs-rhel83 Successfully connected to vm-nfs-rhel83 console. The escape sequence is ^] Red Hat Enterprise Linux 8.3 (Ootpa) Kernel 4.18.0-240.12.1.el8_3.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm-nfs-rhel83 login: cloud-user Password: Last login: Thu Mar 18 15:49:22 on ttyS0 [cloud-user@vm-nfs-rhel83 ~]$ [kbidarka@localhost nfs]$ [kbidarka@localhost nfs]$ virtctl restart vm-nfs-rhel83 VM vm-nfs-rhel83 was scheduled to restart [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 77s Succeeded xx.yy.zz.144 cnv-qe.redhat.com True [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 0s [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 2s Scheduling [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 5s Scheduled cnv-qe.redhat.com True [kbidarka@localhost nfs]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE PAUSED vm-nfs-rhel83 16s Running xx.yy.zz.146 cnv-qe.redhat.com True -------------------------------------------------------- "upstream-version": "0.39.0-rc.0-55-ge95b9bc", "url": "cnv/virt-operator/images/v4.8.0-15",
Could you try to restart the VM just after it becomes running? is this is valid scenario?
Jed, can you take a look at this?
Sure thing!
@gouyang could you please include the yaml for that VM? After a few attempts with a simple VM, I wasn't able to reproduce this issue.
Created attachment 1765437 [details] vm-example provide the default vm yaml from console, create wizard -> 'With YAML'. I also didn't see this problem on a simple VM, it seems only happen with VMs are using common templates.
For some reason, that VM takes about 4 minutes to boot, probably something to investigate! I started the VM and after a few seconds requested a restart, and I had to wait the whole 4 minutes for the VM to finish booting and finally do its graceful restart. @gouyang could you please make sure the VM is not just taking a (very) long time to restart? If not, could you please include a capture of the VNC console after ~5 minutes? @sgott I'd love an answer to your question "is restart subject to the same rules as shutdown? i.e. does it use ACPI events and await a graceful stop? Kill the VMI after a grace period timeout?". I'll see if I can find an answer in the code!
(In reply to Jed Lejosne from comment #12) > For some reason, that VM takes about 4 minutes to boot, probably something > to investigate! > I started the VM and after a few seconds requested a restart, and I had to > wait the whole 4 minutes for the VM to finish booting and finally do its > graceful restart. > It takes about 3 minutes for the VM to restart as the graceful termination defined in the template. ref: https://github.com/kubevirt/common-templates/blob/f30ca1cac08e600bc4102516f8c504b08543413d/templates/fedora.tpl.yaml#L132 > @gouyang could you please make sure the VM is not just taking a > (very) long time to restart? - If wait long enough time to restart the VM, it has no this problem, like monitoring the VNC console to wait for login prompt appearing. - If restart the VM immediately once it becomes 'Running', it hits the problem. The simple VM has no this problem because no `terminationGracePeriodSeconds` defined in it. > If not, could you please include a capture of the VNC console after ~5 > minutes? > > @sgott I'd love an answer to your question "is restart subject to > the same rules as shutdown? i.e. does it use ACPI events and await a > graceful stop? Kill the VMI after a grace period timeout?". I'll see if I > can find an answer in the code! It looks like the same rules are in use as shutdown.
Based on Comment #13, we're closing this as notabug--since the VM does restart. Please re-open if you feel this is in error.
Cann't agree this is not a bug. It shows clearly that the VM cannot be restarted just after the VM becomes `running`, why should user wait for sometime to perform restart? If we don't consider improve this problem, at least we need to document about this.
Guohua, it's not clear what we can usefully do for you here. VMI restart is implemented as a shutdown followed by a start. This means we need to use ACPI events to signal a graceful shutdown (or that fails and TerminationGracePeriodSeconds is exceeded and shutdown occurs anyways). This means the OS needs to be capable of responding to ACPI events--in particular it needs to be running. In your case, the guest you're using takes an unusually long time to boot up--before it can respond to ACPI events and shut back down. It's not clear why your guest takes so long to boot, but the problem appears to be your VMI. The problem here is generalized advice of "if you restart right away, things will take a long time to reboot" just isn't universally true. The fact that reboot takes four minutes is specific to this VMI. So let's recap. This VMI does restart. Rebooting while a reboot is in progress isn't supposed to work. This VMI takes a long time to boot up. What's the path forward in your view?
From my perspective, restart/shutdown/delete a 'Running' VM is very normal, it's not expecting to wait for some time to see the action is actually performed. 1. What do you think to have an extra status to indicate the OS is capable of responding to ACPI events, so operations like restart/delete/shutdown can be performed smooethly. 2. If we don't improve the core/backend, do we need to take some actions on UI, like preventing user to perform such actions until the OS is capable of responding to ACPI events. 3. Do we need to document this behavior?
(In reply to sgott from comment #16) > This VMI does restart. > Rebooting while a reboot is in progress isn't supposed to work. > This VMI takes a long time to boot up. > > What's the path forward in your view? An improvement here can be: can the VM be restarted once the OS is capable of responding to ACPI events(20s), not wait for the TerminationGracePeriodSeconds(3 minutes)?
Guohua, we're planning on allowing "--grace-period=0 --force" to be called for a VM that's already being shut down. That is to say a second API call with a shorter timeout will be honored right away. I think that might also address this issue in a reasonable way. For cases such as this where the user needs a VM to shut down faster, this would be a mechanism to make that happen.
It's good that the backend supporting a second API call with a shorter timeout, the CLI(virtctl) and UI should have a way to pass the timeout parameter, I'm glad to test it again once it's ready.
Re-assigning this to Prita as this appears to be fixed by https://github.com/kubevirt/kubevirt/pull/7494
Backport PR is still open. Moving this back to POST.
verified the bug on latest CNV 4.11, the VM can be restarted normally. It stops the VM firstly and the VM becomes running after sometime at every restart.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6526