Bug 1547479
Summary: | Make sure hosted engine disks have exclusive access flags even when libvirt XML is used via OVF | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Martin Sivák <msivak> | ||||
Component: | BLL.HostedEngine | Assignee: | Andrej Krejcir <akrejcir> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.2.2 | CC: | akrejcir, bugs, eedri, fromani, michal.skrivanek, msivak, stirabos | ||||
Target Milestone: | ovirt-4.2.2 | Keywords: | Triaged | ||||
Target Release: | 4.2.2.6 | Flags: | rule-engine:
ovirt-4.2+
rule-engine: blocker+ msivak: devel_ack+ |
||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ovirt-engine-4.2.2.6 | Doc Type: | No Doc Update | ||||
Doc Text: |
undefined
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-27 07:22:56 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1564873 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Martin Sivák
2018-02-21 12:16:09 UTC
Looking at the VDSM source around [1] and [2]. If the VDSM sees 'shared:exclusive' flag in the vm.conf, it adds a lease device for the disk to the libvirt XML When the HE VM is started using libvirt XML directly, this lease device has to be there. But currently, the engine does not generate it. [1] - https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.py#L278 [2] - https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage.py#L532 (In reply to Andrej Krejcir from comment #1) > Looking at the VDSM source around [1] and [2]. If the VDSM sees > 'shared:exclusive' flag in the vm.conf, it adds a lease device for the disk > to the libvirt XML > > When the HE VM is started using libvirt XML directly, this lease device has > to be there. But currently, the engine does not generate it. > > > [1] - > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage. > py#L278 > [2] - > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vmdevices/storage. > py#L532 This is correct, but there is more. Let's see how it should work. First thing, Vdsm yet does *not* pass through the XML from Engine: some devices could need per-host customizations (ports, paths), so Vdsm must read the XML, initialize some data structures (Device objects) and rebuild the XML to submit to libvirt XML. The XML actually submit to libvirt is expected to be largely identical to what Engine sent, with few changes in some key areas, mostly device-related. The device that could be changed are thus few: leases, graphics devices, disks, nics. If Engine sends the leases, Vdsm must parse them and build the corresponding objects, and reflect them in the final XML; otherwise it is a different bug. So, if Engine (or any client) creates the leases in the first place, it should already work. Now, about patch http://gerrit.ovirt.org/88076: We have a bug, meaning Vdsm does not generate the Leases for drives if Engine don't send them. Vdsm did so in the past, so it is a regression and 88076 is a fix, but we need to make sure that it still works if Engine starts to send leases: we don't want to have them twice. Created attachment 1400983 [details]
Vdsm logs
Attached vdsm logs when starting HE VM from a domain XML that contains a lease with placeholders. The start fails around time 2018-02-26 19:39:00,173
The error is probably because the HE VM uses zero storage pool ID.
the current fixLeases() code handles just vm leases. That can be changed relatively easily to count with volume leases too. Francesco will post that asap Which type of deployment should be used, ansible or vintage? Steps to Reproduce: 1. Install hosted engine, two hosts and enable global maintenance 2. Wait for OVF store to be generated (or trigger it by editing the HE VM) 3. Restart the HE VM on the host where it is running 4. Make sure it can't be started on a second host while it is running on the first one Are steps 2-4 should be executed, while hosts are still in global maintenance? Nikolai, the answers are: - It does not matter, but try Node 0 if possible - and yes, keep everything in global maintenance to be in full control. patch 88308 is not really needed. All other patches are merged, it seems we can move to MODIFIED INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Open patch attached] For more info please contact: infra For QE, please explicitly test also this scenario: deploy with a previous beta (4.2.2 RC1 or RC2), wait for the engine to create the OVF_STORE disks and upgrade the hosts to latest ovirt-hosted-engine-ha without upgrading the engine VM. Shutdown the engine VM and wait for an host to take over; ensure that we don't have any split brain. Moving back to post since not all referenced patches are not merged yet Additional patches confused the bot: the newly attached patches aren't needed to fix this BZ. They are related, not needed. +++ In gerrit, I used the Related-To tag, not the Bug-Url tag. Looks like the bot needs to be dixed Can you explain the motivation behind 'related-to' patches? What purpose do they serve in terms of tracking changes? isn't it enough to have bugs depending on each other in Bugzilla for the same purpose? We can, of course, fix the bot to handle related-to properly, I just want to understand if there is still a reason behind it and it serves a purpose. (In reply to Eyal Edri from comment #12) > Can you explain the motivation behind 'related-to' patches? > What purpose do they serve in terms of tracking changes? isn't it enough to > have bugs depending on each other in Bugzilla for the same purpose? > > We can, of course, fix the bot to handle related-to properly, I just want to > understand if there is still a reason behind it and it serves a purpose. 'Related-To' don't serve a particular purpose. It is just additional metadata. We (developers) - or at least me, use them to document why we made a patch. Sometimes, a patch is meant to improve some area of code after we fixed a bug. Sometimes we choose different approach to solve a bug for master branch or stable branch - in "master" we usually have more freedom to refactor things or to clean up. In general, "Related-To" is like "Bug-Url", but weaker. I don't really understand the change. In old code sanlock was blocking the engine VM from getting started on another host, while SHE-VM was already running on another. I never seen SHE-VM running on both hosts, because of sanlock was not allowing so. In what the end result will be different from what we had before the fix? (In reply to Francesco Romani from comment #13) > (In reply to Eyal Edri from comment #12) > > Can you explain the motivation behind 'related-to' patches? > > What purpose do they serve in terms of tracking changes? isn't it enough to > > have bugs depending on each other in Bugzilla for the same purpose? > > > > We can, of course, fix the bot to handle related-to properly, I just want to > > understand if there is still a reason behind it and it serves a purpose. > > 'Related-To' don't serve a particular purpose. It is just additional > metadata. > We (developers) - or at least me, use them to document why we made a patch. > Sometimes, a patch is meant to improve some area of code after we fixed a > bug. Sometimes we choose different approach to solve a bug for master branch > or stable branch - in "master" we usually have more freedom to refactor > things or to clean up. > > In general, "Related-To" is like "Bug-Url", but weaker. I've opened https://ovirt-jira.atlassian.net/browse/OVIRT-1962 to discuss potential update to hooks logic, please add info there on what do you think the hooks should do in case of 'related-to'. (In reply to Nikolai Sednev from comment #14) > I don't really understand the change. In old code sanlock was blocking the > engine VM from getting started on another host, while SHE-VM was already > running on another. > I never seen SHE-VM running on both hosts, because of sanlock was not > allowing so. > In what the end result will be different from what we had before the fix? The libvirtxml method needs to configure the lock (different API). That is all. The end result will be the same. works for me on these components: ovirt-hosted-engine-setup-2.2.18-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.10-1.el7ev.noarch rhvm-appliance-4.2-20180420.0.el7.noarch Linux 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) ovirt-engine-setup-base-4.2.3.2-0.1.el7.noarch This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |