Bug 2037216 - VM imported from configuration is stuck in WaitForLaunch once removed quickly after power-off
Summary: VM imported from configuration is stuck in WaitForLaunch once removed quickly...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.4.9.5
Hardware: Unspecified
OS: Unspecified
unspecified
high vote
Target Milestone: ovirt-4.4.10-1
: 4.4.10.5
Assignee: Arik
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-05 09:28 UTC by Yury.Panchenko
Modified: 2022-02-04 10:33 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.4.10.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-04 10:33:37 UTC
oVirt Team: Virt
pm-rhel: ovirt-4.4+
michal.skrivanek: exception+


Attachments (Terms of Use)
Engine and vdsm logs (1.18 MB, application/x-7z-compressed)
2022-01-05 09:28 UTC, Yury.Panchenko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44423 0 None None None 2022-01-05 09:29:55 UTC
oVirt gerrit 118220 0 master MERGED core: avoid creating vm-manager when vm is not in db 2022-01-10 14:25:05 UTC
oVirt gerrit 118231 0 ovirt-engine-4.4 MERGED core: avoid creating vm-manager when vm is not in db 2022-01-11 15:20:07 UTC
oVirt gerrit 118296 0 master MERGED core: avoid creating vm-manager when vm is not in db part 2 2022-01-19 18:12:19 UTC
oVirt gerrit 118308 0 ovirt-engine-4.4 MERGED core: avoid creating vm-manager when vm is not in db part 2 2022-01-19 18:51:16 UTC

Description Yury.Panchenko 2022-01-05 09:28:39 UTC
Created attachment 1849004 [details]
Engine and vdsm logs

Description of problem:
User does restore to original through Veeam RHV backup application.
Restore to original means - the app deletes current vm from an Ovirt and then restores the same vm with the same vm id.
if the original vm has a snapshot, the restored vm won't start

Version-Release number of selected component (if applicable):
Ovirt 4.4.9
RHV 4.4.9

How reproducible:
Always

Steps to Reproduce:
1.Create a snapshot on the original vm
2.Delete the original vm through API 2022-01-04 19:46:09 EVENT_ID: USER_REMOVE_VM_FINISHED(113)
3. Put xml configuration from the vm backup and import it 2022-01-04 19:46:19 Running command: ImportVmFromConfigurationCommand
4. Delete from imported vm all nics and disks
5. Attach new disks and upload the content through image transfer.
6. Finish vm restore
7. Start the restored VM

Actual results:
The vm can't start

Expected results:
The vm's started sucessfully

Additional info:
2022-01-04 20:42:08,833+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] VM '53f312ba-113c-413b-ac06-7212aceee65a'(null) moved from 'WaitForLaunch' --> 'Up'
2022-01-04 20:42:08,833+01 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] Failed during monitoring vm: 53f312ba-113c-413b-ac06-7212aceee65a , error is: {}: java.lang.NullPointerException

2022-01-04 20:42:08,833+01 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] Exception:: java.lang.NullPointerException

Comment 1 Yury.Panchenko 2022-01-10 17:35:42 UTC
Hello there.
It's really problematic issue, because customer can't easy workaround it.
Could you try to fix it in the next release?
thanks.

Comment 2 Arik 2022-01-10 19:09:27 UTC
(In reply to Yury.Panchenko from comment #1)
> Hello there.
> It's really problematic issue, because customer can't easy workaround it.
> Could you try to fix it in the next release?
> thanks.

Hi Yury,
Can you please help us understand why is it a problematic issue from your side?
To me it seems like a rare race between handling powered-off vm and removing it, not a common scenario (we didn't get reports on that and this code was introduced couple of years ago), that can be resolved by restarting ovirt-engine
Did customers reported it or is it something that was identified in your automation?

Comment 3 Yury.Panchenko 2022-01-11 12:06:40 UTC
Hello Arik,
> Can you please help us understand why is it a problematic issue from your side?
> To me it seems like a rare race between handling powered-off vm and removing it, not a common scenario (we didn't get reports on that and this code was introduced couple of years ago), that can be resolved by restarting ovirt-engine
Cutstomer usually does restore in some urgent cases when he needs the data now, so in this case he can't power on the vm and he can't get access to the data.
Restart whole ovirt isn't a good idea, because it will cause some downtime for other ovirt users and customer or restore admin sometimes can't have access to the ovirt vm

> Did customers reported it or is it something that was identified in your automation?
we've got this in our testing

Comment 5 Arik 2022-01-11 14:43:04 UTC
OK, so it's more likely to be an issue for automation than issue we'll face in the "real-world"
Yet, as the fix is easy to backport and it relatively safe, we'll backport to 4.4.10 nevertheless

Comment 6 Yury.Panchenko 2022-01-11 16:06:51 UTC
> Yet, as the fix is easy to backport and it relatively safe, we'll backport to 4.4.10 nevertheless
That's good, thanks.

> So it's more likely to be an issue for automation than issue we'll face in the "real-world"
No, for RHV backup customer just enough to click few steps in a vm restore wizard.

Comment 7 Arik 2022-01-11 16:54:53 UTC
(In reply to Yury.Panchenko from comment #6)
> > So it's more likely to be an issue for automation than issue we'll face in the "real-world"
> No, for RHV backup customer just enough to click few steps in a vm restore
> wizard.

So would an existing VM be powered off and removed following the restore wizard? if so, how do you determine that the VM has been powered off?

Comment 8 Yury.Panchenko 2022-01-11 17:38:00 UTC
> So would an existing VM be powered off and removed following the restore wizard? if so, how do you determine that the VM has been powered off?
It's a part of restore process,
restore algorythm:
1) Check that original vm is exist
2) If vm exist and powered on - power it off
3) if vm exist - delete it
4) Restore the vm from backup

Comment 9 Arik 2022-01-11 19:44:43 UTC
I see, thanks
It can take up to 15s until the VM is set to Down (and only then can be removed) but practically it's almost immediate because we get an event (note that we don't always process the event and in case we don't, the call to delete/remove vm in (3) would fail)
It would be safer to loop between (2) and (3) until the VM is in Down state and to add some delay there to ensure we handle (i.e., clean after) the powered off VM (otherwise, there may be some leftovers)

Comment 10 Arik 2022-01-11 19:46:25 UTC
(In reply to Arik from comment #9)
> I see, thanks
> It can take up to 15s until the VM is set to Down (and only then can be
> removed)

Actually in the worse case it might even take longer (if there are many VMs running on the host and we get many events that make us not only ignore the event for the transition to down but also polling cycles for the host)

Comment 11 Qin Yuan 2022-01-18 10:02:49 UTC
The issue can still be reproduced on ovirt-engine-4.4.10.4 with the following steps:

1. create a VM from template
2. copy the disk(make a backup)
3. run the VM
4. take a snapshot on the VM
5. Retrieve the configuration data of the VM from the data attribute under the snapshot
GET /api/vms/{vm:id}/snapshots/{snapshot:id} HTTP/1.1
All-Content: true
Accept: application/xml
Content-type: application/xml

6. delete the VM right after stopping operation is launched
7. import new VM from xml
POST /api/vms/ HTTP/1.1
Accept: application/xml
Content-type: application/xml

<vm>
    <cluster>
        <name>cluster_name</name>
    </cluster>
    <name>_NAME_</name>
      <initialization>
      <configuration>
  <data>
  <!-- the ovf data is the one got in step5 -->
  </data>
      <type>ovf</type>
      </configuration>
      </initialization>
    ...
</vm>
8. attach the copied disk in step 2 to the new VM
9. Run the new VM


engine.log:
2022-01-18 12:01:07,697+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] VM '2ecab6f3-3b8f-4944-ac66-6f771e50f554'(null) moved from 'WaitForLaunch' --> 'Up'
2022-01-18 12:01:07,697+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] Failed during monitoring vm: 2ecab6f3-3b8f-4944-ac66-6f771e50f554 , error is: {}: java.lang.NullPointerException

2022-01-18 12:01:07,697+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] Exception:: java.lang.NullPointerException

Comment 13 Qin Yuan 2022-01-25 11:11:10 UTC
Verified with:
ovirt-engine-4.4.10.5-0.2.el8ev.noarch

Steps:
Run the steps in comment #11 30 times

Results:
The issue didn't occur in 30 tests.

engine log of one test:

2022-01-25 10:04:21,759+02 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (default task-15) [995c53e0-368d-4039-932c-e984734e6b72] Running command: StopVmCommand internal: false. Entities affected :  ID: 4c68e027-f386-406a-8aea-4617740bfb3e Type: VMAction group STOP_VM with role type USER
2022-01-25 10:04:23,184+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-19) [2adbcb45] VM '4c68e027-f386-406a-8aea-4617740bfb3e'(test) moved from 'Up' --> 'Down'
2022-01-25 10:04:23,267+02 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (default task-15) [5c9ca588-ff53-40de-96e2-60b150e07d1d] Lock Acquired to object 'EngineLock:{exclusiveLocks='[4c68e027-f386-406a-8aea-4617740bfb3e=VM]', sharedLocks=''}'
2022-01-25 10:04:24,894+02 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (EE-ManagedThreadFactory-engine-Thread-5269) [5c9ca588-ff53-40de-96e2-60b150e07d1d] Ending command 'org.ovirt.engine.core.bll.RemoveVmCommand' successfully.
2022-01-25 10:05:04,654+02 INFO  [org.ovirt.engine.core.bll.exportimport.ImportVmFromConfigurationCommand] (default task-15) [6833d81b-326f-4f14-b314-733d205248c9] Lock Acquired to object 'EngineLock:{exclusiveLocks='[test=VM_NAME, 4c68e027-f386-406a-8aea-4617740bfb3e=VM]', sharedLocks='[4c68e027-f386-406a-8aea-4617740bfb3e=REMOTE_VM]'}'
2022-01-25 10:05:34,999+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [cb419012-ec75-45f2-ab07-c3b553fe8820] EVENT_ID: USER_ATTACH_DISK_TO_VM(2,016), Disk latest-rhel-guest-image-8.5-infra was successfully attached to VM test by admin@internal-authz.
2022-01-25 10:06:05,132+02 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (default task-15) [fbdc497e-2772-4291-aa09-4825bd41df5f] Lock Acquired to object 'EngineLock:{exclusiveLocks='[4c68e027-f386-406a-8aea-4617740bfb3e=VM]', sharedLocks=''}'
2022-01-25 10:06:07,555+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-19) [2adbcb45] VM '4c68e027-f386-406a-8aea-4617740bfb3e'(test) moved from 'WaitForLaunch' --> 'PoweringUp'
2022-01-25 10:06:20,469+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-23) [2adbcb45] VM '4c68e027-f386-406a-8aea-4617740bfb3e'(test) moved from 'PoweringUp' --> 'Up'
2022-01-25 10:06:20,481+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-23) [2adbcb45] EVENT_ID: USER_RUN_VM(32), VM test started on Host host_mixed_2


Note You need to log in before you can comment on or make changes to this bug.