|Summary:||After upgrade single host HE(engine has one storage domain) from 3.5 to 3.6, master storage domain has inactive state first 2-5 minutes|
|Product:||[oVirt] ovirt-engine||Reporter:||Artyom <alukiano>|
|Component:||BLL.HostedEngine||Assignee:||Yanir Quinn <yquinn>|
|Status:||CLOSED WONTFIX||QA Contact:||meital avital <mavital>|
|Version:||3.6.0||CC:||alukiano, amureini, bugs, dfediuck, laravot, nsoffer, sbonazzo, tnisan, ylavi|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2017-07-26 08:36:56 UTC||Type:||Bug|
|oVirt Team:||SLA||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
|Bug Depends On:||1393902|
Description Artyom 2015-12-27 12:18:36 UTC
Created attachment 1109801 [details] engine and vdsm logs Description of problem: After upgrade single host HE(engine has one storage domain) from 3.5 to 3.6, master storage domain has inactive state first 2-5 minutes Version-Release number of selected component (if applicable): 3.5 - Red Hat Enterprise Virtualization Hypervisor release 7.1 (20151015.0.el7ev) ================================= vdsm-4.16.27-1.el7ev.x86_64 ovirt-hosted-engine-setup-188.8.131.52-1.el7ev.noarch ovirt-hosted-engine-ha-184.108.40.206-1.el7ev.noarch 3.6 - Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151221.1.el7ev) ============================================================================================== ovirt-hosted-engine-setup-220.127.116.11-1.el7ev.noarch ovirt-hosted-engine-ha-18.104.22.168-1.el7ev.noarch vdsm-4.17.13-1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Deploy 3.5 HE on single RHEV-H host 2. Add storage domain to engine 3. Enable global maintenance 4. Upgrade engine from 3.5 to 3.6 5. Poweroff engine vm 6. Upgrade host to 3.6 via PXE or USB-KEY(RHEV-H upgrade process, you can not upgrade it via engine, because you have only one host) Actual results: First 2-5 minutes master storage domain has state inactive and engine shows error messages: 2015-12-22 12:14:59,688 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-35)  IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=00000002-0002-0002-0002-0000000001df, msdUUID=8c5ba712-5b3a-4114-852b-11bf9a6605be' Expected results: I do not sure if it a bug, but I expect, that after upgrade whole system works correct without any error message from vdsm or engine side. Additional info: Tested it also on RHEL, engine shows the same error.
Comment 1 Doron Fediuck 2016-01-10 08:51:12 UTC
Nir, care to look into this issue?
Comment 2 Allon Mureinik 2016-01-13 13:18:09 UTC
Liron, as QA contact, please take a look at this.
Comment 3 Liron Aravot 2016-01-14 08:40:53 UTC
When the host is restarted it loses it's storage connections, therefore when you start the engine again (and the engine assumes that the host is still UP) it takes time to recover. I assume that we can perhaps improve the upgrade process for HE to avoid that issue. adding needinfo? on rgolan.
Comment 4 Liron Aravot 2016-01-14 09:03:49 UTC
Rethinking about this, we can use existing mechanism to solve it. will update here.
Comment 5 Liron Aravot 2016-01-14 14:02:08 UTC
The solution is that the engine start we should make sure that InitVdsOnUp is executed for the host that is running the HE vm immediately.
Comment 6 Allon Mureinik 2016-01-14 14:25:16 UTC
Upgrading a host while it's UP violates vdsm's contract with the engine. Pushing out to 4.0, and we should rethink this as part of a holistic upgrade solution from HE's side.
Comment 7 Red Hat Bugzilla Rules Engine 2016-01-14 14:25:18 UTC
This bug is marked for z-stream, yet the milestone is for a major version, therefore the milestone has been reset. Please set the correct milestone or drop the z stream flag.
Comment 8 Doron Fediuck 2016-01-19 14:06:20 UTC
Roy, can you please review the HE flow? We should not be upgrading a host in UP state.
Comment 9 Roy Golan 2016-02-22 13:57:10 UTC
@Liron isn't this the initVdsOnUp area?
Comment 10 Roy Golan 2016-03-02 13:35:59 UTC
Comment 11 Yaniv Lavi 2016-03-02 13:37:45 UTC
How risky will it be to fix this?
Comment 12 Allon Mureinik 2016-04-13 12:09:27 UTC
Liron, Roy, what going on with this BZ. In comment 5 Liron commented: > The solution is that the engine start we should make sure that InitVdsOnUp > is executed for the host that is running the HE vm immediately. Then Roy asked in comment 9: > @Liron isn't this the initVdsOnUp area? I don't understand the decision here. Please clarify WHAT needs to be done, and in which part of the system. Pushing out until we get a proper answer for this question.
Comment 13 Roy Golan 2016-04-18 07:03:20 UTC
The logs show that an attempt to connect a domain, after this single host reboot, is failing with " Cannot find master domain: " During this time VDSM says: lvm::375::Storage.LVM::(_reloadvgs) lvm vgs failed: 5  [' Volume group "8c5ba712-5b3a-4114-852b-11bf9a6605be" not found', ' Cannot process volume group 8c5ba712-5b3a-4114-852b-11bf9a6605be'] " After few minutes the connection to that domain succeeds, when VG is available. Why does it take time to see the VG on boot?
Comment 14 Liron Aravot 2016-06-16 10:54:52 UTC
rgolan, please see comment https://bugzilla.redhat.com/show_bug.cgi?id=1294348#c3 When a "regular" host is upgraded, its first moved to maintenance mode, then its being upgraded and then activated (InitVdsOnUp runs) so all the storage connections (and other operation done on that flow) are done again. The HE host is being rebooted while in the engine it still appears as UP, so it takes time to the engine to recover the host (as expected). We need to makre sure that on expected HE host reboots the InitVdsOnUp flow is executed again so the host is going through the activation flow in order to avoid the recovery time.
Comment 15 Allon Mureinik 2016-07-06 12:10:53 UTC
(In reply to Liron Aravot from comment #14) > rgolan, please see comment > https://bugzilla.redhat.com/show_bug.cgi?id=1294348#c3 > > When a "regular" host is upgraded, its first moved to maintenance mode, then > its being upgraded and then activated (InitVdsOnUp runs) so all the storage > connections (and other operation done on that flow) are done again. > The HE host is being rebooted while in the engine it still appears as UP, so > it takes time to the engine to recover the host (as expected). > > We need to makre sure that on expected HE host reboots the InitVdsOnUp flow > is executed again so the host is going through the activation flow in order > to avoid the recovery time. So, what's the AI here?
Comment 16 Liron Aravot 2016-07-07 07:41:09 UTC
check out the last part of the comment :) - > We need to makre sure that on expected HE host reboots the InitVdsOnUp flow > is executed again so the host is going through the activation flow in order > to avoid the recovery time. as this was pending for needinfo? for a while, moving to the HE component.
Comment 17 Martin Sivák 2016-10-26 12:26:15 UTC
Please take a look at the needinfo and figure out if we have to do anything with this at all or it should be handled by storage or infra.
Comment 18 Yanir Quinn 2016-10-27 11:14:43 UTC
What are the risks if we use InitVdsOnUp during UP state of the engine ? (on expected HE host reboots) its a bit vague and the consequences aren't clear. In this particular use case of upgrade which doesnt happen a lot , is it worth to add this functionality instead of a 5 minutes downtime ?
Comment 19 Liron Aravot 2016-10-30 11:35:15 UTC
(In reply to Yanir from comment #18) > What are the risks if we use InitVdsOnUp during UP state of the engine ? (on > expected HE host reboots) > its a bit vague and the consequences aren't clear. > That flow is running whenever a host is being activated/recovered in order to init it. There are few nits (like avoiding from moving the HE host to NonOperational status) that we need to look at when running it for the HE host - if it'll be relevant we can look into that. If we do decide that we do want to run it, we'll have to decide how to run it and to handle those cases. > In this particular use case of upgrade which doesnt happen a lot , is it > worth to add this functionality instead of a 5 minutes downtime ? I'm not entirely sure about it - applying the fix will improve things, but I can't answer on whether the current situation is good enough to not handle it currently - it will require some work.
Comment 20 Martin Sivák 2016-11-23 13:44:12 UTC
Moving to 4.2 as we need to cooperate with the storage team on this.
Comment 21 Nir Soffer 2017-05-22 08:30:33 UTC
I think the use case in this bug is not very interesting. With this configuration you have no HA, so downtime is not that important. If you want any HA you should have at least 3 hosts and more than one storage domains. Can we test this flow with proper he setup? Does it still take 2-5 minutes?
Comment 22 Artyom 2017-05-22 10:01:25 UTC
In the case of 2 or more hosts, I can not see such problem.
Comment 23 Doron Fediuck 2017-07-26 08:36:56 UTC
We won't be perusing this issue any more. If someone needs it, patches are welcomed.