Created attachment 1109801 [details] engine and vdsm logs Description of problem: After upgrade single host HE(engine has one storage domain) from 3.5 to 3.6, master storage domain has inactive state first 2-5 minutes Version-Release number of selected component (if applicable): 3.5 - Red Hat Enterprise Virtualization Hypervisor release 7.1 (20151015.0.el7ev) ================================= vdsm-4.16.27-1.el7ev.x86_64 ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch ovirt-hosted-engine-ha-1.2.7.2-1.el7ev.noarch 3.6 - Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151221.1.el7ev) ============================================================================================== ovirt-hosted-engine-setup-1.3.1.3-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch vdsm-4.17.13-1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Deploy 3.5 HE on single RHEV-H host 2. Add storage domain to engine 3. Enable global maintenance 4. Upgrade engine from 3.5 to 3.6 5. Poweroff engine vm 6. Upgrade host to 3.6 via PXE or USB-KEY(RHEV-H upgrade process, you can not upgrade it via engine, because you have only one host) Actual results: First 2-5 minutes master storage domain has state inactive and engine shows error messages: 2015-12-22 12:14:59,688 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-35) [] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=00000002-0002-0002-0002-0000000001df, msdUUID=8c5ba712-5b3a-4114-852b-11bf9a6605be' Expected results: I do not sure if it a bug, but I expect, that after upgrade whole system works correct without any error message from vdsm or engine side. Additional info: Tested it also on RHEL, engine shows the same error.
Nir, care to look into this issue?
Liron, as QA contact, please take a look at this.
When the host is restarted it loses it's storage connections, therefore when you start the engine again (and the engine assumes that the host is still UP) it takes time to recover. I assume that we can perhaps improve the upgrade process for HE to avoid that issue. adding needinfo? on rgolan.
Rethinking about this, we can use existing mechanism to solve it. will update here.
The solution is that the engine start we should make sure that InitVdsOnUp is executed for the host that is running the HE vm immediately.
Upgrading a host while it's UP violates vdsm's contract with the engine. Pushing out to 4.0, and we should rethink this as part of a holistic upgrade solution from HE's side.
This bug is marked for z-stream, yet the milestone is for a major version, therefore the milestone has been reset. Please set the correct milestone or drop the z stream flag.
Roy, can you please review the HE flow? We should not be upgrading a host in UP state.
@Liron isn't this the initVdsOnUp area?
ping
How risky will it be to fix this?
Liron, Roy, what going on with this BZ. In comment 5 Liron commented: > The solution is that the engine start we should make sure that InitVdsOnUp > is executed for the host that is running the HE vm immediately. Then Roy asked in comment 9: > @Liron isn't this the initVdsOnUp area? I don't understand the decision here. Please clarify WHAT needs to be done, and in which part of the system. Pushing out until we get a proper answer for this question.
The logs show that an attempt to connect a domain, after this single host reboot, is failing with " Cannot find master domain: " During this time VDSM says: lvm::375::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "8c5ba712-5b3a-4114-852b-11bf9a6605be" not found', ' Cannot process volume group 8c5ba712-5b3a-4114-852b-11bf9a6605be'] " After few minutes the connection to that domain succeeds, when VG is available. Why does it take time to see the VG on boot?
rgolan, please see comment https://bugzilla.redhat.com/show_bug.cgi?id=1294348#c3 When a "regular" host is upgraded, its first moved to maintenance mode, then its being upgraded and then activated (InitVdsOnUp runs) so all the storage connections (and other operation done on that flow) are done again. The HE host is being rebooted while in the engine it still appears as UP, so it takes time to the engine to recover the host (as expected). We need to makre sure that on expected HE host reboots the InitVdsOnUp flow is executed again so the host is going through the activation flow in order to avoid the recovery time.
(In reply to Liron Aravot from comment #14) > rgolan, please see comment > https://bugzilla.redhat.com/show_bug.cgi?id=1294348#c3 > > When a "regular" host is upgraded, its first moved to maintenance mode, then > its being upgraded and then activated (InitVdsOnUp runs) so all the storage > connections (and other operation done on that flow) are done again. > The HE host is being rebooted while in the engine it still appears as UP, so > it takes time to the engine to recover the host (as expected). > > We need to makre sure that on expected HE host reboots the InitVdsOnUp flow > is executed again so the host is going through the activation flow in order > to avoid the recovery time. So, what's the AI here?
check out the last part of the comment :) - > We need to makre sure that on expected HE host reboots the InitVdsOnUp flow > is executed again so the host is going through the activation flow in order > to avoid the recovery time. as this was pending for needinfo? for a while, moving to the HE component.
Please take a look at the needinfo and figure out if we have to do anything with this at all or it should be handled by storage or infra.
What are the risks if we use InitVdsOnUp during UP state of the engine ? (on expected HE host reboots) its a bit vague and the consequences aren't clear. In this particular use case of upgrade which doesnt happen a lot , is it worth to add this functionality instead of a 5 minutes downtime ?
(In reply to Yanir from comment #18) > What are the risks if we use InitVdsOnUp during UP state of the engine ? (on > expected HE host reboots) > its a bit vague and the consequences aren't clear. > That flow is running whenever a host is being activated/recovered in order to init it. There are few nits (like avoiding from moving the HE host to NonOperational status) that we need to look at when running it for the HE host - if it'll be relevant we can look into that. If we do decide that we do want to run it, we'll have to decide how to run it and to handle those cases. > In this particular use case of upgrade which doesnt happen a lot , is it > worth to add this functionality instead of a 5 minutes downtime ? I'm not entirely sure about it - applying the fix will improve things, but I can't answer on whether the current situation is good enough to not handle it currently - it will require some work.
Moving to 4.2 as we need to cooperate with the storage team on this.
I think the use case in this bug is not very interesting. With this configuration you have no HA, so downtime is not that important. If you want any HA you should have at least 3 hosts and more than one storage domains. Can we test this flow with proper he setup? Does it still take 2-5 minutes?
In the case of 2 or more hosts, I can not see such problem.
We won't be perusing this issue any more. If someone needs it, patches are welcomed.