Bug 1547479 - Make sure hosted engine disks have exclusive access flags even when libvirt XML is used via OVF
Summary: Make sure hosted engine disks have exclusive access flags even when libvirt X...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.HostedEngine
Version: 4.2.2
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.2.2
: 4.2.2.6
Assignee: Andrej Krejcir
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1564873
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-21 12:16 UTC by Martin Sivák
Modified: 2018-04-27 07:22 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.2.2.6
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-04-27 07:22:56 UTC
oVirt Team: SLA
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: blocker+
msivak: devel_ack+


Attachments (Terms of Use)
Vdsm logs (623.42 KB, text/plain)
2018-02-26 18:47 UTC, Andrej Krejcir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 88043 0 master MERGED core: Add lease devices for hosted engine disks to libvirt XML 2018-02-27 17:01:07 UTC
oVirt gerrit 88076 0 master ABANDONED vm: Add extra devices to the domain xml 2018-02-23 09:29:46 UTC
oVirt gerrit 88118 0 master ABANDONED vm: always prepare leases when starting 2018-02-26 13:29:16 UTC
oVirt gerrit 88226 0 master MERGED vm: leases: handle drive leases in fixLeases 2018-02-27 15:30:27 UTC
oVirt gerrit 88233 0 master MERGED vmdevices: refresh only disk devices 2018-02-27 15:49:38 UTC
oVirt gerrit 88264 0 ovirt-4.2 MERGED vm: leases: handle drive leases in fixLeases 2018-02-28 13:21:35 UTC
oVirt gerrit 88265 0 ovirt-4.2 MERGED vmdevices: refresh only disk devices 2018-02-28 13:21:39 UTC
oVirt gerrit 88281 0 ovirt-engine-4.2 MERGED core: Add lease devices for hosted engine disks to libvirt XML 2018-02-28 16:54:48 UTC
oVirt gerrit 88304 0 master MERGED virt: introduce the domxml_preprocess module 2018-03-02 10:52:03 UTC
oVirt gerrit 88305 0 master MERGED virt: domxml_preprocess: move fixers in the module 2018-03-02 10:52:06 UTC
oVirt gerrit 88306 0 master MERGED virt: domxml_preprocess: make devices if missing 2018-03-02 10:52:09 UTC
oVirt gerrit 88307 0 master MERGED tests: domxml_preprocess: kickstart tests 2018-03-02 10:52:13 UTC
oVirt gerrit 88308 0 master ABANDONED virt: domxml_preprocess: handle placeholders early 2018-03-27 07:12:01 UTC
oVirt gerrit 88410 0 master MERGED tests: domxml_preprocess: test passthrough 2018-03-02 19:29:25 UTC
oVirt gerrit 89438 0 v2.2.z MERGED xml: consume libvirt XML from OVF only if safe 2018-03-26 12:41:06 UTC
oVirt gerrit 89448 0 v2.2.z MERGED xml: consume libvirt XML from OVF only if safe 2018-03-26 13:02:03 UTC
oVirt gerrit 89498 0 ovirt-4.2 MERGED virt: introduce the domxml_preprocess module 2018-04-03 09:03:38 UTC
oVirt gerrit 89499 0 ovirt-4.2 MERGED virt: domxml_preprocess: move fixers in the module 2018-04-03 09:03:44 UTC
oVirt gerrit 89500 0 ovirt-4.2 MERGED virt: domxml_preprocess: make devices if missing 2018-04-03 09:03:53 UTC
oVirt gerrit 89501 0 ovirt-4.2 MERGED tests: domxml_preprocess: kickstart tests 2018-04-03 09:04:05 UTC
oVirt gerrit 89502 0 ovirt-4.2 MERGED tests: domxml_preprocess: test passthrough 2018-04-03 09:04:18 UTC

Description Martin Sivák 2018-02-21 12:16:09 UTC
Description of problem:

The engine does not seem to support exclusive access flag for disks we used to pass to vdsm via vm.conf's disk['shared'] = 'exclusive'


Version-Release number of selected component (if applicable):

4.2.2 only

Steps to Reproduce:
1. Install hosted engine, two hosts and enable global maintenance
2. Wait for OVF store to be generated (or trigger it by editing the HE VM)
3. Restart the HE VM on the host where it is running
4. Make sure it can't be started on a second host while it is running on the first one

Actual results:

I expect we will be able to start the second VM, because the exclusive access flag is not set when the VM is started using libvirt xml approach.

Expected results:

Just one VM can run.

Additional info:

The old code did it here: https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/ovf/ovf2VmParams.py#L77

This can be fixed on the agent side (injecting the proper elements to the XML) or on the engine side (by either adding support for exclusive or injecting into HE XML when saving it to the OVF store).

Comment 1 Andrej Krejcir 2018-02-21 13:41:37 UTC
Looking at the VDSM source around [1] and [2]. If the VDSM sees 'shared:exclusive' flag in the vm.conf, it adds a lease device for the disk to the libvirt XML

When the HE VM is started using libvirt XML directly, this lease device has to be there. But currently, the engine does not generate it.


[1] - https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.py#L278
[2] - https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.py#L532

Comment 2 Francesco Romani 2018-02-22 14:42:34 UTC
(In reply to Andrej Krejcir from comment #1)
> Looking at the VDSM source around [1] and [2]. If the VDSM sees
> 'shared:exclusive' flag in the vm.conf, it adds a lease device for the disk
> to the libvirt XML
> 
> When the HE VM is started using libvirt XML directly, this lease device has
> to be there. But currently, the engine does not generate it.
> 
> 
> [1] -
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.
> py#L278
> [2] -
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.
> py#L532

This is correct, but there is more. Let's see how it should work.

First thing, Vdsm yet does *not* pass through the XML from Engine: some devices could need per-host customizations (ports, paths), so Vdsm must read the XML, initialize some data structures (Device objects) and rebuild the XML to submit to libvirt XML.

The XML actually submit to libvirt is expected to be largely identical to what Engine sent, with few changes in some key areas, mostly device-related.

The device that could be changed are thus few: leases, graphics devices, disks, nics.

If Engine sends the leases, Vdsm must parse them and build the corresponding objects, and reflect them in the final XML; otherwise it is a different bug.
So, if Engine (or any client) creates the leases in the first place, it should already work.

Now, about patch http://gerrit.ovirt.org/88076:
We have a bug, meaning Vdsm does not generate the Leases for drives if Engine don't send them. Vdsm did so in the past, so it is a regression and 88076 is a fix, but we need to make sure that it still works if Engine starts to send leases: we don't want to have them twice.

Comment 3 Andrej Krejcir 2018-02-26 18:47:55 UTC
Created attachment 1400983 [details]
Vdsm logs

Attached vdsm logs when starting HE VM from a domain XML that contains a lease with placeholders. The start fails around time 2018-02-26 19:39:00,173
The error is probably because the HE VM uses zero storage pool ID.

Comment 4 Michal Skrivanek 2018-02-27 09:00:00 UTC
the current fixLeases() code handles just vm leases. That can be changed relatively easily to count with volume leases too. Francesco will post that asap

Comment 5 Nikolai Sednev 2018-02-28 08:06:33 UTC
Which type of deployment should be used, ansible or vintage?
Steps to Reproduce:
1. Install hosted engine, two hosts and enable global maintenance
2. Wait for OVF store to be generated (or trigger it by editing the HE VM)
3. Restart the HE VM on the host where it is running
4. Make sure it can't be started on a second host while it is running on the first one

Are steps 2-4 should be executed, while hosts are still in global maintenance?

Comment 6 Martin Sivák 2018-02-28 08:53:35 UTC
Nikolai, the answers are:
- It does not matter, but try Node 0 if possible
- and yes, keep everything in global maintenance to be in full control.

Comment 7 Francesco Romani 2018-03-15 13:42:37 UTC
patch 88308 is not really needed. All other patches are merged, it seems we can move to MODIFIED

Comment 8 RHV bug bot 2018-03-16 15:02:41 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Open patch attached]

For more info please contact: infra

Comment 9 Simone Tiraboschi 2018-03-26 12:47:02 UTC
For QE, please explicitly test also this scenario: deploy with a previous beta (4.2.2 RC1 or RC2), wait for the engine to create the OVF_STORE disks and upgrade the hosts to latest ovirt-hosted-engine-ha without upgrading the engine VM.

Shutdown the engine VM and wait for an host to take over; ensure that we don't have any split brain.

Comment 10 Sandro Bonazzola 2018-03-28 12:46:03 UTC
Moving back to post since not all referenced patches are not merged yet

Comment 11 Francesco Romani 2018-03-28 12:51:36 UTC
Additional patches confused the bot: the newly attached patches aren't needed to fix this BZ. They are related, not needed.

+++

In gerrit, I used the Related-To tag, not the Bug-Url tag. Looks like the bot needs to be dixed

Comment 12 Eyal Edri 2018-04-01 11:07:45 UTC
Can you explain the motivation behind 'related-to' patches?
What purpose do they serve in terms of tracking changes? isn't it enough to have bugs depending on each other in Bugzilla for the same purpose?

We can, of course, fix the bot to handle related-to properly, I just want to understand if there is still a reason behind it and it serves a purpose.

Comment 13 Francesco Romani 2018-04-05 12:44:44 UTC
(In reply to Eyal Edri from comment #12)
> Can you explain the motivation behind 'related-to' patches?
> What purpose do they serve in terms of tracking changes? isn't it enough to
> have bugs depending on each other in Bugzilla for the same purpose?
> 
> We can, of course, fix the bot to handle related-to properly, I just want to
> understand if there is still a reason behind it and it serves a purpose.

'Related-To' don't serve a particular purpose. It is just additional metadata.
We (developers) - or at least me, use them to document why we made a patch. Sometimes, a patch is meant to improve some area of code after we fixed a bug. Sometimes we choose different approach to solve a bug for master branch or stable branch - in "master" we usually have more freedom to refactor things or to clean up.

In general, "Related-To" is like "Bug-Url", but weaker.

Comment 14 Nikolai Sednev 2018-04-08 07:44:46 UTC
I don't really understand the change. In old code sanlock was blocking the engine VM from getting started on another host, while SHE-VM was already running on another. 
I never seen SHE-VM running on both hosts, because of sanlock was not allowing so.
In what the end result will be different from what we had before the fix?

Comment 15 Eyal Edri 2018-04-08 09:11:13 UTC
(In reply to Francesco Romani from comment #13)
> (In reply to Eyal Edri from comment #12)
> > Can you explain the motivation behind 'related-to' patches?
> > What purpose do they serve in terms of tracking changes? isn't it enough to
> > have bugs depending on each other in Bugzilla for the same purpose?
> > 
> > We can, of course, fix the bot to handle related-to properly, I just want to
> > understand if there is still a reason behind it and it serves a purpose.
> 
> 'Related-To' don't serve a particular purpose. It is just additional
> metadata.
> We (developers) - or at least me, use them to document why we made a patch.
> Sometimes, a patch is meant to improve some area of code after we fixed a
> bug. Sometimes we choose different approach to solve a bug for master branch
> or stable branch - in "master" we usually have more freedom to refactor
> things or to clean up.
> 
> In general, "Related-To" is like "Bug-Url", but weaker.

I've opened https://ovirt-jira.atlassian.net/browse/OVIRT-1962 to discuss potential update to hooks logic, please add info there on what do you think the hooks should do in case of 'related-to'.

Comment 16 Martin Sivák 2018-04-08 12:08:30 UTC
(In reply to Nikolai Sednev from comment #14)
> I don't really understand the change. In old code sanlock was blocking the
> engine VM from getting started on another host, while SHE-VM was already
> running on another. 
> I never seen SHE-VM running on both hosts, because of sanlock was not
> allowing so.
> In what the end result will be different from what we had before the fix?

The libvirtxml method needs to configure the lock (different API). That is all. The end result will be the same.

Comment 17 Nikolai Sednev 2018-04-22 15:52:22 UTC
works for me on these components:
ovirt-hosted-engine-setup-2.2.18-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.10-1.el7ev.noarch
rhvm-appliance-4.2-20180420.0.el7.noarch
Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

Comment 18 Nikolai Sednev 2018-04-22 15:52:41 UTC
ovirt-engine-setup-base-4.2.3.2-0.1.el7.noarch

Comment 19 Sandro Bonazzola 2018-04-27 07:22:56 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.