Bug 2037216

Summary: VM imported from configuration is stuck in WaitForLaunch once removed quickly after power-off
Product: [oVirt] ovirt-engine Reporter: Yury.Panchenko
Component: BLL.VirtAssignee: Arik <ahadas>
Status: CLOSED CURRENTRELEASE QA Contact: Qin Yuan <qiyuan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.9.5CC: ahadas, bugs, dfodor, michal.skrivanek, nsoffer, Yury.Panchenko
Target Milestone: ovirt-4.4.10-1Flags: pm-rhel: ovirt-4.4+
michal.skrivanek: exception+
Target Release: 4.4.10.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.10.5 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-04 10:33:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Engine and vdsm logs none

Description Yury.Panchenko 2022-01-05 09:28:39 UTC
Created attachment 1849004 [details]
Engine and vdsm logs

Description of problem:
User does restore to original through Veeam RHV backup application.
Restore to original means - the app deletes current vm from an Ovirt and then restores the same vm with the same vm id.
if the original vm has a snapshot, the restored vm won't start

Version-Release number of selected component (if applicable):
Ovirt 4.4.9
RHV 4.4.9

How reproducible:
Always

Steps to Reproduce:
1.Create a snapshot on the original vm
2.Delete the original vm through API 2022-01-04 19:46:09 EVENT_ID: USER_REMOVE_VM_FINISHED(113)
3. Put xml configuration from the vm backup and import it 2022-01-04 19:46:19 Running command: ImportVmFromConfigurationCommand
4. Delete from imported vm all nics and disks
5. Attach new disks and upload the content through image transfer.
6. Finish vm restore
7. Start the restored VM

Actual results:
The vm can't start

Expected results:
The vm's started sucessfully

Additional info:
2022-01-04 20:42:08,833+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] VM '53f312ba-113c-413b-ac06-7212aceee65a'(null) moved from 'WaitForLaunch' --> 'Up'
2022-01-04 20:42:08,833+01 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] Failed during monitoring vm: 53f312ba-113c-413b-ac06-7212aceee65a , error is: {}: java.lang.NullPointerException

2022-01-04 20:42:08,833+01 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] Exception:: java.lang.NullPointerException

Comment 1 Yury.Panchenko 2022-01-10 17:35:42 UTC
Hello there.
It's really problematic issue, because customer can't easy workaround it.
Could you try to fix it in the next release?
thanks.

Comment 2 Arik 2022-01-10 19:09:27 UTC
(In reply to Yury.Panchenko from comment #1)
> Hello there.
> It's really problematic issue, because customer can't easy workaround it.
> Could you try to fix it in the next release?
> thanks.

Hi Yury,
Can you please help us understand why is it a problematic issue from your side?
To me it seems like a rare race between handling powered-off vm and removing it, not a common scenario (we didn't get reports on that and this code was introduced couple of years ago), that can be resolved by restarting ovirt-engine
Did customers reported it or is it something that was identified in your automation?

Comment 3 Yury.Panchenko 2022-01-11 12:06:40 UTC
Hello Arik,
> Can you please help us understand why is it a problematic issue from your side?
> To me it seems like a rare race between handling powered-off vm and removing it, not a common scenario (we didn't get reports on that and this code was introduced couple of years ago), that can be resolved by restarting ovirt-engine
Cutstomer usually does restore in some urgent cases when he needs the data now, so in this case he can't power on the vm and he can't get access to the data.
Restart whole ovirt isn't a good idea, because it will cause some downtime for other ovirt users and customer or restore admin sometimes can't have access to the ovirt vm

> Did customers reported it or is it something that was identified in your automation?
we've got this in our testing

Comment 5 Arik 2022-01-11 14:43:04 UTC
OK, so it's more likely to be an issue for automation than issue we'll face in the "real-world"
Yet, as the fix is easy to backport and it relatively safe, we'll backport to 4.4.10 nevertheless

Comment 6 Yury.Panchenko 2022-01-11 16:06:51 UTC
> Yet, as the fix is easy to backport and it relatively safe, we'll backport to 4.4.10 nevertheless
That's good, thanks.

> So it's more likely to be an issue for automation than issue we'll face in the "real-world"
No, for RHV backup customer just enough to click few steps in a vm restore wizard.

Comment 7 Arik 2022-01-11 16:54:53 UTC
(In reply to Yury.Panchenko from comment #6)
> > So it's more likely to be an issue for automation than issue we'll face in the "real-world"
> No, for RHV backup customer just enough to click few steps in a vm restore
> wizard.

So would an existing VM be powered off and removed following the restore wizard? if so, how do you determine that the VM has been powered off?

Comment 8 Yury.Panchenko 2022-01-11 17:38:00 UTC
> So would an existing VM be powered off and removed following the restore wizard? if so, how do you determine that the VM has been powered off?
It's a part of restore process,
restore algorythm:
1) Check that original vm is exist
2) If vm exist and powered on - power it off
3) if vm exist - delete it
4) Restore the vm from backup

Comment 9 Arik 2022-01-11 19:44:43 UTC
I see, thanks
It can take up to 15s until the VM is set to Down (and only then can be removed) but practically it's almost immediate because we get an event (note that we don't always process the event and in case we don't, the call to delete/remove vm in (3) would fail)
It would be safer to loop between (2) and (3) until the VM is in Down state and to add some delay there to ensure we handle (i.e., clean after) the powered off VM (otherwise, there may be some leftovers)

Comment 10 Arik 2022-01-11 19:46:25 UTC
(In reply to Arik from comment #9)
> I see, thanks
> It can take up to 15s until the VM is set to Down (and only then can be
> removed)

Actually in the worse case it might even take longer (if there are many VMs running on the host and we get many events that make us not only ignore the event for the transition to down but also polling cycles for the host)

Comment 11 Qin Yuan 2022-01-18 10:02:49 UTC
The issue can still be reproduced on ovirt-engine-4.4.10.4 with the following steps:

1. create a VM from template
2. copy the disk(make a backup)
3. run the VM
4. take a snapshot on the VM
5. Retrieve the configuration data of the VM from the data attribute under the snapshot
GET /api/vms/{vm:id}/snapshots/{snapshot:id} HTTP/1.1
All-Content: true
Accept: application/xml
Content-type: application/xml

6. delete the VM right after stopping operation is launched
7. import new VM from xml
POST /api/vms/ HTTP/1.1
Accept: application/xml
Content-type: application/xml

<vm>
    <cluster>
        <name>cluster_name</name>
    </cluster>
    <name>_NAME_</name>
      <initialization>
      <configuration>
  <data>
  <!-- the ovf data is the one got in step5 -->
  </data>
      <type>ovf</type>
      </configuration>
      </initialization>
    ...
</vm>
8. attach the copied disk in step 2 to the new VM
9. Run the new VM


engine.log:
2022-01-18 12:01:07,697+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] VM '2ecab6f3-3b8f-4944-ac66-6f771e50f554'(null) moved from 'WaitForLaunch' --> 'Up'
2022-01-18 12:01:07,697+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] Failed during monitoring vm: 2ecab6f3-3b8f-4944-ac66-6f771e50f554 , error is: {}: java.lang.NullPointerException

2022-01-18 12:01:07,697+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] Exception:: java.lang.NullPointerException

Comment 13 Qin Yuan 2022-01-25 11:11:10 UTC
Verified with:
ovirt-engine-4.4.10.5-0.2.el8ev.noarch

Steps:
Run the steps in comment #11 30 times

Results:
The issue didn't occur in 30 tests.

engine log of one test:

2022-01-25 10:04:21,759+02 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (default task-15) [995c53e0-368d-4039-932c-e984734e6b72] Running command: StopVmCommand internal: false. Entities affected :  ID: 4c68e027-f386-406a-8aea-4617740bfb3e Type: VMAction group STOP_VM with role type USER
2022-01-25 10:04:23,184+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-19) [2adbcb45] VM '4c68e027-f386-406a-8aea-4617740bfb3e'(test) moved from 'Up' --> 'Down'
2022-01-25 10:04:23,267+02 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (default task-15) [5c9ca588-ff53-40de-96e2-60b150e07d1d] Lock Acquired to object 'EngineLock:{exclusiveLocks='[4c68e027-f386-406a-8aea-4617740bfb3e=VM]', sharedLocks=''}'
2022-01-25 10:04:24,894+02 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (EE-ManagedThreadFactory-engine-Thread-5269) [5c9ca588-ff53-40de-96e2-60b150e07d1d] Ending command 'org.ovirt.engine.core.bll.RemoveVmCommand' successfully.
2022-01-25 10:05:04,654+02 INFO  [org.ovirt.engine.core.bll.exportimport.ImportVmFromConfigurationCommand] (default task-15) [6833d81b-326f-4f14-b314-733d205248c9] Lock Acquired to object 'EngineLock:{exclusiveLocks='[test=VM_NAME, 4c68e027-f386-406a-8aea-4617740bfb3e=VM]', sharedLocks='[4c68e027-f386-406a-8aea-4617740bfb3e=REMOTE_VM]'}'
2022-01-25 10:05:34,999+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [cb419012-ec75-45f2-ab07-c3b553fe8820] EVENT_ID: USER_ATTACH_DISK_TO_VM(2,016), Disk latest-rhel-guest-image-8.5-infra was successfully attached to VM test by admin@internal-authz.
2022-01-25 10:06:05,132+02 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-15) [fbdc497e-2772-4291-aa09-4825bd41df5f] Lock Acquired to object 'EngineLock:{exclusiveLocks='[4c68e027-f386-406a-8aea-4617740bfb3e=VM]', sharedLocks=''}'
2022-01-25 10:06:07,555+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-19) [2adbcb45] VM '4c68e027-f386-406a-8aea-4617740bfb3e'(test) moved from 'WaitForLaunch' --> 'PoweringUp'
2022-01-25 10:06:20,469+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-23) [2adbcb45] VM '4c68e027-f386-406a-8aea-4617740bfb3e'(test) moved from 'PoweringUp' --> 'Up'
2022-01-25 10:06:20,481+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-23) [2adbcb45] EVENT_ID: USER_RUN_VM(32), VM test started on Host host_mixed_2