Bug 1610917 - External VMs automatically deleted when powered down
Summary: External VMs automatically deleted when powered down
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.2.3.5
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ovirt-4.3.1
: ---
Assignee: Francesco Romani
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-01 15:49 UTC by Andy G
Modified: 2019-03-19 10:07 UTC (History)
6 users (show)

Fixed In Version: v4.30.9
Doc Type: If docs needed, set a value
Doc Text:
Vdsm takes ownership of all the VMs running on a given host. This includes any VM defined outside the System, for example by the user using command line tools. Vdsm also undefines the VM when they are shut down - e.g. it removes their configuration from libvirt running on the host. This is now done only for VMs created within the system. In other words, Vdsm now manages (stops, migrate...) externally defined VMs, but leaves them defined on shut down.
Clone Of:
Environment:
Last Closed: 2019-03-19 10:07:09 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.3+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 95550 0 master MERGED virt: don't undefine external VMs 2020-07-20 11:47:57 UTC
oVirt gerrit 95568 0 ovirt-4.2 ABANDONED virt: don't undefine external VMs 2020-07-20 11:47:57 UTC

Description Andy G 2018-08-01 15:49:27 UTC
I have an oVirt host machine running oVirt installed on top of CentOS 7.5.  I cannot be 100% sure what version of oVirt is running on the host, but the engine VM is reporting 4.2.3.5-1.el7.centos for itself and it is likely that the host is on the same version.  oVirt cockpit reports the following for the host:

OS Version:        RHEL - 7 - 5.1804.el7.centos
OS Description:    CentOS Linux 7 (Core)
Kernel Version:    3.10.0 - 862.2.3.el7.x86_64
KVM Version:       2.10.0 - 21.el7_5.2.1
LIBVIRT Version:   libvirt-3.9.0-14.el7_5.2
VDSM Version:      vdsm-4.20.27.1-1.el7.centos
SPICE Version:     0.14.0 - 2.el7
GlusterFS Version: [N/A]
CEPH Version:      librbd1-0.94.5-2.el7
Kernel Features:   PTI: 1, IBRS: 0, RETP: 1

The problem I have is that I need to be able to create a VM directly on the host machine, not managed by oVirt at all.  On the previous version of oVirt (i.e. 4.1) this was possible, but now that I've upgraded to 4.2 when the newly created external VM shuts down, it is deleted from the host entirely.

I am using "virsh define MyVm.xml" as the method to create the external VM on the host.

I have tested this also with a new server built from scratch running first Centos alone without oVirt installed, then with oVirt 4.1 installed, and then with the 4.1 upgraded to 4.2.

In the first two cases, using the above command creates the VM, which can be run and shutdown without it being deleted.  However as soon as oVirt 4.2 is installed on the server, doing a "Shutdown" or "Force Off" from virt-manager, or even doing a "sudo shutdown" from the VM itself, causes the VM definition to be deleted as soon as the VM has shut down.

Now, what is interesting is that an external VM created before oVirt 4.2 is installed, e.g. when oVirt 4.1 is installed, then that VM is safe against deleting.  I.e. even after oVirt 4.2 is installed, starting and stopping the VM does not delete it.  Only VMs created after oVirt 4.2 is installed get deleted in this fashion.

Comment 1 Andy G 2018-08-01 15:59:05 UTC
I note that the XML definition file has the line

  <on_poweroff>destroy</on_poweroff>

But I also note that all the other VMs have the same line, so this seems not to be the factor.  Beyond that, there is no observable difference between the XML of a fully functional VM and one that deletes as soon as the VM is shut down, so I can only think that it is due to some configuration option inside oVirt that I am not aware of.  It is not possible, however, to set the "delete protection" option on the VM from inside oVirt since it complains that it cannot change the settings of an external VM.  Again, none of the external VMs have this flag set, though, so this cannot be the factor either.

Comment 2 Andy G 2018-11-06 08:43:24 UTC
Just updated to oVirt 4.2.7 and now all external VMs are automatically deleted when they close down.  On the plus side, at least the behaviour is consistent; on the down side, it's insane!!!

Anyway, it seems there is now a workaround:

engine-config -s MaintenanceVdsIgnoreExternalVms=true

This configuration prevents the deletion of external VMs.

For me, this is solution enough.  And for anyone else who follows in my footsteps, I hope the solution above also works for you :o)

Comment 3 Andy G 2018-11-06 09:14:47 UTC
Sorry, re-opening bug.

I spoke too soon.  The "fix" above is not a fix.

The external VMs still delete themselves after they are shut down ... if given a few more seconds to do so.  In my testing, I was too quick.

And now all external VMs are doing this, so it is really bad :o(

Comment 4 Andy G 2018-11-08 07:59:03 UTC
Ok, I have some more data (with help from the libvirt mailing list), and I can confirm that it is definitely oVirt that is deleting the external VMs once they close down.

From the vdsm.log file I can see the "acknowledgement" that the VM has been shutdown:

2018-11-08 08:21:30,072+0100 INFO  (libvirt/events) [virt.vm] (vmId='0bab4766-4765-40c1-abaf-a1c1774ac0ff') underlying process disconnected (vm:1062)
2018-11-08 08:21:30,072+0100 INFO  (libvirt/events) [virt.vm] (vmId='0bab4766-4765-40c1-abaf-a1c1774ac0ff') Release VM resources (vm:5283)
2018-11-08 08:21:30,072+0100 INFO  (libvirt/events) [virt.vm] (vmId='0bab4766-4765-40c1-abaf-a1c1774ac0ff') Stopping connection (guestagent:442)
2018-11-08 08:21:30,072+0100 INFO  (libvirt/events) [virt.vm] (vmId='0bab4766-4765-40c1-abaf-a1c1774ac0ff') Stopping connection (guestagent:442)
2018-11-08 08:21:30,072+0100 WARN  (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/0bab4766-4765-40c1-abaf-a1c1774ac0ff.com.redhat.rhevm.vdsm already removed (fileutils:51)
2018-11-08 08:21:30,073+0100 WARN  (libvirt/events) [root] File: /var/lib/libvirt/qemu/channel/target/domain-12-Norah/org.qemu.guest_agent.0 already removed (fileutils:51)
2018-11-08 08:21:30,073+0100 INFO  (libvirt/events) [vdsm.api] START inappropriateDevices(thiefId='0bab4766-4765-40c1-abaf-a1c1774ac0ff') from=internal, task_id=5ab358c8-afd3-4b5c-8df3-933f622e20e6 (api:46)
2018-11-08 08:21:30,075+0100 INFO  (libvirt/events) [vdsm.api] FINISH inappropriateDevices return=None from=internal, task_id=5ab358c8-afd3-4b5c-8df3-933f622e20e6 (api:52)
2018-11-08 08:21:30,076+0100 INFO  (libvirt/events) [virt.vm] (vmId='0bab4766-4765-40c1-abaf-a1c1774ac0ff') Changed state to Down: User shut down from within the guest (code=7) (vm:1693)
2018-11-08 08:21:30,076+0100 INFO  (libvirt/events) [virt.vm] (vmId='0bab4766-4765-40c1-abaf-a1c1774ac0ff') Stopping connection (guestagent:442)

But then 15 seconds later:

2018-11-08 08:21:44,728+0100 INFO  (jsonrpc/5) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:172.16.1.102,56482, vmId=0bab4766-4765-40c1-abaf-a1c1774ac0ff (api:46)
2018-11-08 08:21:44,731+0100 INFO  (jsonrpc/5) [api.virt] FINISH destroy return={'status': {'message': 'Machine destroyed', 'code': 0}} from=::ffff:172.16.1.102,56482, vmId=0bab4766-4765-40c1-abaf-a1c1774ac0ff (api:52)

And the corresponding debug from libvirt:

2018-11-08 08:21:44.729+0000: 49630: debug : virDomainUndefine:6242 : dom=0x7f9d64003420, (VM: name=Norah, uuid=0bab4766-4765-40c1-abaf-a1c1774ac0ff)
2018-11-08 08:21:44.729+0000: 49630: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: modify (vm=0x7f9d8429be10 name=Norah, current job=none async=none)
2018-11-08 08:21:44.729+0000: 49630: debug : qemuDomainObjBeginJobInternal:4666 : Started job: modify (async=none vm=0x7f9d8429be10 name=Norah)
2018-11-08 08:21:44.730+0000: 49630: info : qemuDomainUndefineFlags:7560 : Undefining domain 'Norah'

Please, please, please can there be a way to prevent this behaviour?

Comment 5 Francesco Romani 2018-11-20 11:42:33 UTC
(In reply to Andy G from comment #4)

> Please, please, please can there be a way to prevent this behaviour?

Yes.
There could be a simple way that do not interfere with expected oVirt behaviour. Please be aware that this is a very very specific corner case.
Are you willing to test some Vdsm patches?

The flow will be:
- add some metadata (https://libvirt.org/formatdomain.html#elementsMetadata) to your external VMs that you want oVirt to leave alone
- Vdsm will skip these VMs like it does today for guestfs or external VM in down state (https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/recovery.py#L41)

Comment 6 Francesco Romani 2018-11-20 12:17:47 UTC
One of the basic assumptions of Vdsm is that it should own the libvirt. Every VM running in a host on which Vdsm is installed are meant to be either managed or adopted (e.g. external VMs detected and adopted) by Vdsm.

So, this very BZ falls in an "unsupported" region, even though I reckon the behaviour looks annoying in this (again, VERY specific and uncommon) use case.

However, tools we use and relay on to implement important flows, like guestfs, may need to run VMs that Vdsm should leave alone.

I'd like to take this chance to have an universal and supported way to signal Vdsm such VMs, and this could also solve this issue by side effect.

Comment 7 Andy G 2018-11-20 14:40:49 UTC
(In reply to Francesco Romani from comment #5)
> Yes.
> There could be a simple way that do not interfere with expected oVirt
> behaviour. Please be aware that this is a very very specific corner case.
> Are you willing to test some Vdsm patches?

I would be very happy to test patches.  I appreciate that I'm pushing oVirt at the edges of its normal use.  Thanks for your help :o)

Comment 8 Francesco Romani 2018-11-20 16:20:01 UTC
(In reply to Andy G from comment #7)
> (In reply to Francesco Romani from comment #5)
> > Yes.
> > There could be a simple way that do not interfere with expected oVirt
> > behaviour. Please be aware that this is a very very specific corner case.
> > Are you willing to test some Vdsm patches?
> 
> I would be very happy to test patches.  I appreciate that I'm pushing oVirt
> at the edges of its normal use.  Thanks for your help :o)

Great!

The patch itself: https://gerrit.ovirt.org/#/c/95550/

The steps for testing:

First, amend the domain XML you want to be skipped by oVirt. You need to add this tag in the metadata section:

  <ovirt-vm:ignore/>

example of how it could like:

https://gerrit.ovirt.org/#/c/95550/2/tests/virt/data/domain_ignore.xml

this tag will be ignored by current code from oVirt, but it will be used by the patched code.

Here you can find RPMs for CentOS: https://jenkins.ovirt.org/job/vdsm_4.2_build-artifacts-on-demand-el7-x86_64/61/artifact/
these RPMs contains Vdsm 4.2.7 rc3 + the patch that should fix this issue

Upgrade Vdsm and restart it, that should be enough.

Once Vdsm is upgraded, it should leave alone VMs with that metadata element in their domain XML. If so, you should see something like this in the logs:

  External VM $VMID has 'ignore me' metadata, skipped

meaning that Vdsm detected that VM, and decided to skip it.

Let me know if you need any help, and if the patch fixes your issue, here and/or on the ovirt mailing lists

PS: please be aware that even this patch fixes your issue, because we are dealing with a *very* specific corner case, we will need discussion about the merge.

Comment 9 Andy G 2018-11-20 18:03:49 UTC
Wonderful! Success!

After a few false-starts I found that the metadata section had to be defined as follows:

  <metadata xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-vm:ignore/>
  </metadata>


And it works :o)

Thanks!

Comment 10 Francesco Romani 2018-11-21 08:03:56 UTC
(In reply to Andy G from comment #9)
> Wonderful! Success!
> 
> After a few false-starts I found that the metadata section had to be defined
> as follows:
> 
>   <metadata xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
>     <ovirt-vm:ignore/>
>   </metadata>

You may place the "xmlns" part in few places, and all are legal. The above snippet looks indeed correct

> And it works :o)

Great. Please let's wait few hours/days before to claim succes to make sure the periodic check Vdsm does is working too :)

> Thanks!

No problem! now we (Vdsm developers) will discuss the merge process. Stay tuned.

Comment 11 Andy G 2018-11-22 08:00:35 UTC
(In reply to Francesco Romani from comment #10)
> (In reply to Andy G from comment #9)
> > And it works :o)
> 
> Great. Please let's wait few hours/days before to claim succes to make sure
> the periodic check Vdsm does is working too :)

I think it is all good.  I have left one VM powered off for over 24 hours, and I have had others which have come up and gone down and not been deleted.  So it is looking good.

> > Thanks!
> 
> No problem! now we (Vdsm developers) will discuss the merge process. Stay
> tuned.

Thank you very much.

Comment 12 Francesco Romani 2018-11-22 14:39:38 UTC
(In reply to Andy G from comment #11)
> (In reply to Francesco Romani from comment #10)
> > (In reply to Andy G from comment #9)
> > > And it works :o)
> > 
> > Great. Please let's wait few hours/days before to claim succes to make sure
> > the periodic check Vdsm does is working too :)
> 
> I think it is all good.  I have left one VM powered off for over 24 hours,
> and I have had others which have come up and gone down and not been deleted.
> So it is looking good.
> 
> > > Thanks!
> > 
> > No problem! now we (Vdsm developers) will discuss the merge process. Stay
> > tuned.
> 
> Thank you very much.

Hi! We (vdsm developers) had a chat about how to fix this issue.
We agreed about a different approach: instead of making the external VM invisible to Vdsm,
we choose to keep them visible, but avoid undefining them on destroyed.

Advantages of the new approach
1. oVirt Engine is (and must kept) aware of the external VMs (thus)
2. minimal management is still possible from the oVirt Engine UI

So, here's a new batch of RPMs patched on top of oVirt 4.2.7rc3: http://jenkins.ovirt.org/job/vdsm_4.2_build-artifacts-on-demand-el7-x86_64/62/

To test them:
- upgrading from pristine Vdsm
1. just upgrade the RPM packages and play with it :)

- upgrading from previous patcj
1. install the new RPMs. This may require some forcing, sorry about that
2. metadata tag is no longer needed, but it will be happily ignored

Sorry for the inconvience!

Comment 13 Ryan Barry 2019-01-21 14:53:39 UTC
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

Comment 14 Ryan Barry 2019-01-24 00:17:19 UTC
Francesco, can we push these patches forward?

Comment 15 Andy G 2019-02-05 19:32:04 UTC
From the testing I've done here, the patches seem to be working well, certainly for me.

Comment 16 Francesco Romani 2019-02-06 07:22:17 UTC
(In reply to Andy G from comment #15)
> From the testing I've done here, the patches seem to be working well,
> certainly for me.

Thanks Andy. Did you tried again the updated patches described in https://bugzilla.redhat.com/show_bug.cgi?id=1610917#c12 ?

Comment 17 Andy G 2019-02-08 10:29:45 UTC
Yes, I did.  Thank you :o)

Comment 18 Francesco Romani 2019-02-20 11:15:02 UTC
relevant patch merged

Comment 19 meital avital 2019-03-14 09:52:37 UTC
Verification version:
ovirt-engine-4.3.2.1-0.0.master.20190310172919.git0b3fbad.el7
vdsm-4.40.0-59.git5533158.el7.x86_64
qemu-kvm-ev-2.12.0-18.el7_6.3.1.x86_64
libvirt-client-4.5.0-10.el7_6.4.x86_64

Verification scenario:
1) Create externl vm using host virt-install command.
2) After installation is completed, power off VM.
3) Observe vdsm.log and verify "Will not undefine external VM" is logged for this VM ID. for example:
2019-03-14 11:23:48,971+0200 INFO  (jsonrpc/3) [virt.vm] (vmId='4d5387ed-0de8-4dd9-8898-a8ef5ae7021f') Will not undefine external VM 4d5387ed-0de8-4dd9-8898-a8ef5ae7021f (vm:2
390)
4) Keep the VM powered of for few hours and verify it was not deleted.

Comment 20 Sandro Bonazzola 2019-03-19 10:07:09 UTC
This bugzilla is included in oVirt 4.3.1 release, published on February 28th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.