Bug 1547479

Summary: Make sure hosted engine disks have exclusive access flags even when libvirt XML is used via OVF
Product: [oVirt] ovirt-engine Reporter: Martin Sivák <msivak>
Component: BLL.HostedEngineAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.2.2CC: akrejcir, bugs, eedri, fromani, michal.skrivanek, msivak, stirabos
Target Milestone: ovirt-4.2.2Keywords: Triaged
Target Release: 4.2.2.6Flags: rule-engine: ovirt-4.2+
rule-engine: blocker+
msivak: devel_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.2.2.6 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-27 07:22:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1564873    
Bug Blocks:    
Attachments:
Description Flags
Vdsm logs none

Description Martin Sivák 2018-02-21 12:16:09 UTC
Description of problem:

The engine does not seem to support exclusive access flag for disks we used to pass to vdsm via vm.conf's disk['shared'] = 'exclusive'


Version-Release number of selected component (if applicable):

4.2.2 only

Steps to Reproduce:
1. Install hosted engine, two hosts and enable global maintenance
2. Wait for OVF store to be generated (or trigger it by editing the HE VM)
3. Restart the HE VM on the host where it is running
4. Make sure it can't be started on a second host while it is running on the first one

Actual results:

I expect we will be able to start the second VM, because the exclusive access flag is not set when the VM is started using libvirt xml approach.

Expected results:

Just one VM can run.

Additional info:

The old code did it here: https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/ovf/ovf2VmParams.py#L77

This can be fixed on the agent side (injecting the proper elements to the XML) or on the engine side (by either adding support for exclusive or injecting into HE XML when saving it to the OVF store).

Comment 1 Andrej Krejcir 2018-02-21 13:41:37 UTC
Looking at the VDSM source around [1] and [2]. If the VDSM sees 'shared:exclusive' flag in the vm.conf, it adds a lease device for the disk to the libvirt XML

When the HE VM is started using libvirt XML directly, this lease device has to be there. But currently, the engine does not generate it.


[1] - https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.py#L278
[2] - https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.py#L532

Comment 2 Francesco Romani 2018-02-22 14:42:34 UTC
(In reply to Andrej Krejcir from comment #1)
> Looking at the VDSM source around [1] and [2]. If the VDSM sees
> 'shared:exclusive' flag in the vm.conf, it adds a lease device for the disk
> to the libvirt XML
> 
> When the HE VM is started using libvirt XML directly, this lease device has
> to be there. But currently, the engine does not generate it.
> 
> 
> [1] -
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.
> py#L278
> [2] -
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.
> py#L532

This is correct, but there is more. Let's see how it should work.

First thing, Vdsm yet does *not* pass through the XML from Engine: some devices could need per-host customizations (ports, paths), so Vdsm must read the XML, initialize some data structures (Device objects) and rebuild the XML to submit to libvirt XML.

The XML actually submit to libvirt is expected to be largely identical to what Engine sent, with few changes in some key areas, mostly device-related.

The device that could be changed are thus few: leases, graphics devices, disks, nics.

If Engine sends the leases, Vdsm must parse them and build the corresponding objects, and reflect them in the final XML; otherwise it is a different bug.
So, if Engine (or any client) creates the leases in the first place, it should already work.

Now, about patch http://gerrit.ovirt.org/88076:
We have a bug, meaning Vdsm does not generate the Leases for drives if Engine don't send them. Vdsm did so in the past, so it is a regression and 88076 is a fix, but we need to make sure that it still works if Engine starts to send leases: we don't want to have them twice.

Comment 3 Andrej Krejcir 2018-02-26 18:47:55 UTC
Created attachment 1400983 [details]
Vdsm logs

Attached vdsm logs when starting HE VM from a domain XML that contains a lease with placeholders. The start fails around time 2018-02-26 19:39:00,173
The error is probably because the HE VM uses zero storage pool ID.

Comment 4 Michal Skrivanek 2018-02-27 09:00:00 UTC
the current fixLeases() code handles just vm leases. That can be changed relatively easily to count with volume leases too. Francesco will post that asap

Comment 5 Nikolai Sednev 2018-02-28 08:06:33 UTC
Which type of deployment should be used, ansible or vintage?
Steps to Reproduce:
1. Install hosted engine, two hosts and enable global maintenance
2. Wait for OVF store to be generated (or trigger it by editing the HE VM)
3. Restart the HE VM on the host where it is running
4. Make sure it can't be started on a second host while it is running on the first one

Are steps 2-4 should be executed, while hosts are still in global maintenance?

Comment 6 Martin Sivák 2018-02-28 08:53:35 UTC
Nikolai, the answers are:
- It does not matter, but try Node 0 if possible
- and yes, keep everything in global maintenance to be in full control.

Comment 7 Francesco Romani 2018-03-15 13:42:37 UTC
patch 88308 is not really needed. All other patches are merged, it seems we can move to MODIFIED

Comment 8 RHV bug bot 2018-03-16 15:02:41 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Open patch attached]

For more info please contact: infra

Comment 9 Simone Tiraboschi 2018-03-26 12:47:02 UTC
For QE, please explicitly test also this scenario: deploy with a previous beta (4.2.2 RC1 or RC2), wait for the engine to create the OVF_STORE disks and upgrade the hosts to latest ovirt-hosted-engine-ha without upgrading the engine VM.

Shutdown the engine VM and wait for an host to take over; ensure that we don't have any split brain.

Comment 10 Sandro Bonazzola 2018-03-28 12:46:03 UTC
Moving back to post since not all referenced patches are not merged yet

Comment 11 Francesco Romani 2018-03-28 12:51:36 UTC
Additional patches confused the bot: the newly attached patches aren't needed to fix this BZ. They are related, not needed.

+++

In gerrit, I used the Related-To tag, not the Bug-Url tag. Looks like the bot needs to be dixed

Comment 12 Eyal Edri 2018-04-01 11:07:45 UTC
Can you explain the motivation behind 'related-to' patches?
What purpose do they serve in terms of tracking changes? isn't it enough to have bugs depending on each other in Bugzilla for the same purpose?

We can, of course, fix the bot to handle related-to properly, I just want to understand if there is still a reason behind it and it serves a purpose.

Comment 13 Francesco Romani 2018-04-05 12:44:44 UTC
(In reply to Eyal Edri from comment #12)
> Can you explain the motivation behind 'related-to' patches?
> What purpose do they serve in terms of tracking changes? isn't it enough to
> have bugs depending on each other in Bugzilla for the same purpose?
> 
> We can, of course, fix the bot to handle related-to properly, I just want to
> understand if there is still a reason behind it and it serves a purpose.

'Related-To' don't serve a particular purpose. It is just additional metadata.
We (developers) - or at least me, use them to document why we made a patch. Sometimes, a patch is meant to improve some area of code after we fixed a bug. Sometimes we choose different approach to solve a bug for master branch or stable branch - in "master" we usually have more freedom to refactor things or to clean up.

In general, "Related-To" is like "Bug-Url", but weaker.

Comment 14 Nikolai Sednev 2018-04-08 07:44:46 UTC
I don't really understand the change. In old code sanlock was blocking the engine VM from getting started on another host, while SHE-VM was already running on another. 
I never seen SHE-VM running on both hosts, because of sanlock was not allowing so.
In what the end result will be different from what we had before the fix?

Comment 15 Eyal Edri 2018-04-08 09:11:13 UTC
(In reply to Francesco Romani from comment #13)
> (In reply to Eyal Edri from comment #12)
> > Can you explain the motivation behind 'related-to' patches?
> > What purpose do they serve in terms of tracking changes? isn't it enough to
> > have bugs depending on each other in Bugzilla for the same purpose?
> > 
> > We can, of course, fix the bot to handle related-to properly, I just want to
> > understand if there is still a reason behind it and it serves a purpose.
> 
> 'Related-To' don't serve a particular purpose. It is just additional
> metadata.
> We (developers) - or at least me, use them to document why we made a patch.
> Sometimes, a patch is meant to improve some area of code after we fixed a
> bug. Sometimes we choose different approach to solve a bug for master branch
> or stable branch - in "master" we usually have more freedom to refactor
> things or to clean up.
> 
> In general, "Related-To" is like "Bug-Url", but weaker.

I've opened https://ovirt-jira.atlassian.net/browse/OVIRT-1962 to discuss potential update to hooks logic, please add info there on what do you think the hooks should do in case of 'related-to'.

Comment 16 Martin Sivák 2018-04-08 12:08:30 UTC
(In reply to Nikolai Sednev from comment #14)
> I don't really understand the change. In old code sanlock was blocking the
> engine VM from getting started on another host, while SHE-VM was already
> running on another. 
> I never seen SHE-VM running on both hosts, because of sanlock was not
> allowing so.
> In what the end result will be different from what we had before the fix?

The libvirtxml method needs to configure the lock (different API). That is all. The end result will be the same.

Comment 17 Nikolai Sednev 2018-04-22 15:52:22 UTC
works for me on these components:
ovirt-hosted-engine-setup-2.2.18-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.10-1.el7ev.noarch
rhvm-appliance-4.2-20180420.0.el7.noarch
Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

Comment 18 Nikolai Sednev 2018-04-22 15:52:41 UTC
ovirt-engine-setup-base-4.2.3.2-0.1.el7.noarch

Comment 19 Sandro Bonazzola 2018-04-27 07:22:56 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.