Bug 920694 - engine: deactivating the master domain and concurrently putting all hosts in maintenance leaves hosts non-op upon activation
Summary: engine: deactivating the master domain and concurrently putting all hosts in ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Liron Aravot
QA Contact: Elad
URL:
Whiteboard: storage
Depends On:
Blocks: 948448
TreeView+ depends on / blocked
 
Reported: 2013-03-12 14:54 UTC by Dafna Ron
Modified: 2018-12-02 15:56 UTC (History)
13 users (show)

Fixed In Version: is2
Doc Type: Bug Fix
Doc Text:
Previously, deactivating the master domain and concurrently putting all hosts in maintenance left the hosts non-operational upon activation. With this update, a host that runs through InitVdsOnUp does not attempt to reconstruct. In case of failure during ConnectStoragePool, the host fails in initializeStorage only if the master domain is not in an inactive or unknown status and if the exception was not an XmlRpcRunTimeException.
Clone Of:
Environment:
Last Closed: 2014-01-21 17:15:06 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (1.05 MB, application/x-gzip)
2013-03-12 14:54 UTC, Dafna Ron
no flags Details
logs (701.57 KB, application/x-gzip)
2013-05-13 14:23 UTC, Dafna Ron
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2014:0038 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.3.0 update 2014-01-21 22:03:06 UTC
oVirt gerrit 13709 0 None None None Never
oVirt gerrit 14767 0 None None None Never

Description Dafna Ron 2013-03-12 14:54:20 UTC
Created attachment 709016 [details]
logs

Description of problem:

it seems that when host#1 is contending to become SPM and the second host will fail connecting to to the pool (since there is no spm yet), engine will send a reconstruct command. 
in this case we are failing reconstruct since the domain is locked 
(on CanDoAction) but if we have more than two hosts I think we can have a race which will cause wrong master version. 

Version-Release number of selected component (if applicable):

sf10
vdsm-4.10.2-11.0.el6ev.x86_64

How reproducible:

100%

Steps to Reproduce:
1. in two hosts cluster with 1 NFS domain put the hsm host and the master domain in maintenance 
2. 
3.
  
Actual results:

we are sending reconstruct while there is no SPM yet when activating host

2013-03-12 04:01:37,146 WARN  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-47) [40598f79] CanDoAction of action ReconstructMasterDomain failed. Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status Locked


Expected results:

we should not send reconstruct - it can cause wrong master domain or version 

Additional info:logs

Comment 2 Liron Aravot 2013-03-14 09:40:15 UTC
what happens here is:
1. we have 2 hosts, one one maintenance, one host up the master domain is in maintenance
2. we activate the master domain, the domain moves to locked, spm election starts
2013-03-12 04:01:23,921 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (ajp-/127.0.0.1:8702-2) Lock Acquired to object EngineLock [exclusiveLocks= ke
y: b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 value: STORAGE

2013-03-12 04:01:23,980 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-3-thread-49) [44e5d289] START, ConnectStorageServerVDSCommand(
HostName = gold-vdsd, HostId = 83834e1f-9e60-41b5-a9cc-16460a8a2fe2, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 7c95f6b
a-088d-48b2-ba29-16f0eef40115, connection: orion.qa.lab.tlv.redhat.com:/export/Dafna/data32, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null
, nfsTimeo: null };]), log id: 5814b864

2013-03-12 04:01:24,093 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-3-thread-48) [6056cb9d] START, ActivateStorageDomainVDSComman
d( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5), l
og id: 38285687


2013-03-12 04:01:28,010 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-3-thread-48) [69ec80bf] starting spm on vds gold-vdsd, storage pool NFS, pre
vId -1, LVER 1


3. we activate the second host, initVdsOnUp runs, as the domain is locked, no storage server 
connections occurs (connect isn't done for locked domains).

2013-03-12 04:01:28,867 INFO  [org.ovirt.engine.core.bll.ActivateVdsCommand] (pool-3-thread-49) [5176bb26] Running command: ActivateVdsCommand internal: false. Entities affected :  ID: 2982e993-2ca5-42bb-86ed-8db10986c47e Type: VDS


2013-03-12 04:01:31,622 INFO  2013-03-12 04:01:31,605 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-52) [1a1b7976] Running command: InitVdsOnUpCommand internal: true.
2013-03-12 04:01:31,610 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-52) [5764c26a] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 Type: StoragePool
2013-03-12 04:01:31,622 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-47) START, ConnectStoragePoolVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, vds_spm_id = 1, masterDomainId = b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5, masterVersion = 1), log id: 38e8264a



4. connect storage pool failed as connect storage server wasn't done.

2013-03-12 04:01:37,142 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-3-thread-47) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: 'spUUID=5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, msdUUID=b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5']]


5. Reconstruct is attempted from the host ,reconstruct failes

6. host moves to non operational
2013-03-12 04:01:37,159 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-52) [47a4d5e2] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: 2982e993-2ca5-42bb-86ed-8db10986c47e Type: VDS
2013-03-12 04:01:37,161 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-52) [47a4d5e2] START, SetVdsStatusVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 7e46c6e6


7. the host is recovered by auto-recovery for hosts after few minutes.
2013-03-12 04:05:00,003 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-31) Autorecovering 1 hosts
2013-03-12 04:05:00,003 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-31) Autorecovering hosts id: 2982e993-2ca5-42bb-86ed-8db10986c47e, name : gold-vdsc




basically the issue reminds bug 917576, possibly we can perform connect storage server also for domains that are locked to avoid such situation in most cases.

Comment 3 Liron Aravot 2013-03-18 16:18:22 UTC
sorry, didn't complete - we can perform the connect also for locked domain, but i don't want to to cause for other issues and possible races (like connecting a host to the pool when deactivating the master for example) - Allon/Ayal, how do we want to proceed with it?

Comment 4 Ayal Baron 2013-03-19 21:42:54 UTC
Liron, reconstruct in initvdsonup should only happen if there are no other hosts that are connected to pool.
There should never be a case where reconstruct is sent while other hosts are already connected to pool.

Comment 5 Liron Aravot 2013-04-10 07:11:28 UTC
I added here a patch that solves the bug and prevents the bug - reconstruct will happen now only if the domain is not active/locked/maintenance - so we won't hit it.

The other issue here is that the host can move to non-operational when domain is locked/being activated/etc and host is activated because of race- this is general issue that should be handled in larger scope in a different bug - we can solve it by connecting to locked domains as well or by adding mutual lock between the connect and the domain flows (which i didn't mention before because i'm not really a fan of it and i don't how much we care about that race) - host should be regardless recovered if it moved to non-op by the host autorecovery.

to sum it up - the provided patch solves the reconstruct issue, if there's another issue that we want to solve that's another BZ.

Comment 6 Dafna Ron 2013-05-13 14:22:51 UTC
tested on sf16.

There was no CanDoAction this time.
my host became non-operational with wrong master domain or versions while domains remain in "active" state and when I activate the host it becomes non-operational
when I added the second host it too becomes non-operational.

1 host with 2 domains.
I put the host in maintenance and deactivated the master domain.
when activating the host we get a wrong master domain or version and we cannot recover.



full logs will be attached.

Comment 7 Dafna Ron 2013-05-13 14:23:18 UTC
Created attachment 747244 [details]
logs

Comment 8 Ayal Baron 2013-05-14 05:06:41 UTC
This is a race between all hosts (single host in this case) going down (user moving host to maintenance) and manual deactivation of the master domain.

Comment 12 Elad 2013-07-15 11:51:29 UTC
Checked on 3.3(is5)

hwith one domain in maintenance, one active host and one host in maintenance:
- activated the domain
- SPM election began
- activated the second host
- second host failed to connect to the poool

Barak, is that the desirable behavior?

Comment 13 Elad 2013-07-15 11:55:18 UTC
Continue of my previous comment:
- the second host failed to connect to the pool and became 'non-operational'
- after few minutes, host became 'UP' after auto-recovery

Comment 14 Allon Mureinik 2013-07-15 12:25:21 UTC
Sounds OK to me.
Barak - am I missing anything?

Comment 16 Elad 2013-08-08 14:20:02 UTC
(In reply to Allon Mureinik from comment #14)
> Sounds OK to me.
> Barak - am I missing anything?

Allon, can I mark as verified?

Comment 17 Allon Mureinik 2013-08-08 14:41:06 UTC
(In reply to Elad from comment #16)
> (In reply to Allon Mureinik from comment #14)
> > Sounds OK to me.
> > Barak - am I missing anything?
> 
> Allon, can I mark as verified?
Please do.

Comment 18 Elad 2013-08-21 11:20:25 UTC
Verified on 3.3 (is5), results are according to my comments #12 , #13

Comment 19 Charlie 2013-11-28 00:25:01 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 21 errata-xmlrpc 2014-01-21 17:15:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html


Note You need to log in before you can comment on or make changes to this bug.