Created attachment 709016 [details] logs Description of problem: it seems that when host#1 is contending to become SPM and the second host will fail connecting to to the pool (since there is no spm yet), engine will send a reconstruct command. in this case we are failing reconstruct since the domain is locked (on CanDoAction) but if we have more than two hosts I think we can have a race which will cause wrong master version. Version-Release number of selected component (if applicable): sf10 vdsm-4.10.2-11.0.el6ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. in two hosts cluster with 1 NFS domain put the hsm host and the master domain in maintenance 2. 3. Actual results: we are sending reconstruct while there is no SPM yet when activating host 2013-03-12 04:01:37,146 WARN [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (pool-3-thread-47) [40598f79] CanDoAction of action ReconstructMasterDomain failed. Reasons:VAR__ACTION__RECONSTRUCT_MASTER,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_STATUS_ILLEGAL2,$status Locked Expected results: we should not send reconstruct - it can cause wrong master domain or version Additional info:logs
what happens here is: 1. we have 2 hosts, one one maintenance, one host up the master domain is in maintenance 2. we activate the master domain, the domain moves to locked, spm election starts 2013-03-12 04:01:23,921 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (ajp-/127.0.0.1:8702-2) Lock Acquired to object EngineLock [exclusiveLocks= ke y: b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5 value: STORAGE 2013-03-12 04:01:23,980 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-3-thread-49) [44e5d289] START, ConnectStorageServerVDSCommand( HostName = gold-vdsd, HostId = 83834e1f-9e60-41b5-a9cc-16460a8a2fe2, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 7c95f6b a-088d-48b2-ba29-16f0eef40115, connection: orion.qa.lab.tlv.redhat.com:/export/Dafna/data32, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null , nfsTimeo: null };]), log id: 5814b864 2013-03-12 04:01:24,093 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-3-thread-48) [6056cb9d] START, ActivateStorageDomainVDSComman d( storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5), l og id: 38285687 2013-03-12 04:01:28,010 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-3-thread-48) [69ec80bf] starting spm on vds gold-vdsd, storage pool NFS, pre vId -1, LVER 1 3. we activate the second host, initVdsOnUp runs, as the domain is locked, no storage server connections occurs (connect isn't done for locked domains). 2013-03-12 04:01:28,867 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (pool-3-thread-49) [5176bb26] Running command: ActivateVdsCommand internal: false. Entities affected : ID: 2982e993-2ca5-42bb-86ed-8db10986c47e Type: VDS 2013-03-12 04:01:31,622 INFO 2013-03-12 04:01:31,605 INFO [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-52) [1a1b7976] Running command: InitVdsOnUpCommand internal: true. 2013-03-12 04:01:31,610 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-52) [5764c26a] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected : ID: 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7 Type: StoragePool 2013-03-12 04:01:31,622 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (pool-3-thread-47) START, ConnectStoragePoolVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, storagePoolId = 5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, vds_spm_id = 1, masterDomainId = b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5, masterVersion = 1), log id: 38e8264a 4. connect storage pool failed as connect storage server wasn't done. 2013-03-12 04:01:37,142 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-3-thread-47) Command org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: 'spUUID=5e1e9d7a-ba64-48cd-84b1-a7e3e67829b7, msdUUID=b37a4aef-09dd-4a4e-ae1e-a3bdb12c4ba5']] 5. Reconstruct is attempted from the host ,reconstruct failes 6. host moves to non operational 2013-03-12 04:01:37,159 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-52) [47a4d5e2] Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 2982e993-2ca5-42bb-86ed-8db10986c47e Type: VDS 2013-03-12 04:01:37,161 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-52) [47a4d5e2] START, SetVdsStatusVDSCommand(HostName = gold-vdsc, HostId = 2982e993-2ca5-42bb-86ed-8db10986c47e, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 7e46c6e6 7. the host is recovered by auto-recovery for hosts after few minutes. 2013-03-12 04:05:00,003 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-31) Autorecovering 1 hosts 2013-03-12 04:05:00,003 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-31) Autorecovering hosts id: 2982e993-2ca5-42bb-86ed-8db10986c47e, name : gold-vdsc basically the issue reminds bug 917576, possibly we can perform connect storage server also for domains that are locked to avoid such situation in most cases.
sorry, didn't complete - we can perform the connect also for locked domain, but i don't want to to cause for other issues and possible races (like connecting a host to the pool when deactivating the master for example) - Allon/Ayal, how do we want to proceed with it?
Liron, reconstruct in initvdsonup should only happen if there are no other hosts that are connected to pool. There should never be a case where reconstruct is sent while other hosts are already connected to pool.
I added here a patch that solves the bug and prevents the bug - reconstruct will happen now only if the domain is not active/locked/maintenance - so we won't hit it. The other issue here is that the host can move to non-operational when domain is locked/being activated/etc and host is activated because of race- this is general issue that should be handled in larger scope in a different bug - we can solve it by connecting to locked domains as well or by adding mutual lock between the connect and the domain flows (which i didn't mention before because i'm not really a fan of it and i don't how much we care about that race) - host should be regardless recovered if it moved to non-op by the host autorecovery. to sum it up - the provided patch solves the reconstruct issue, if there's another issue that we want to solve that's another BZ.
tested on sf16. There was no CanDoAction this time. my host became non-operational with wrong master domain or versions while domains remain in "active" state and when I activate the host it becomes non-operational when I added the second host it too becomes non-operational. 1 host with 2 domains. I put the host in maintenance and deactivated the master domain. when activating the host we get a wrong master domain or version and we cannot recover. full logs will be attached.
Created attachment 747244 [details] logs
This is a race between all hosts (single host in this case) going down (user moving host to maintenance) and manual deactivation of the master domain.
Checked on 3.3(is5) hwith one domain in maintenance, one active host and one host in maintenance: - activated the domain - SPM election began - activated the second host - second host failed to connect to the poool Barak, is that the desirable behavior?
Continue of my previous comment: - the second host failed to connect to the pool and became 'non-operational' - after few minutes, host became 'UP' after auto-recovery
Sounds OK to me. Barak - am I missing anything?
(In reply to Allon Mureinik from comment #14) > Sounds OK to me. > Barak - am I missing anything? Allon, can I mark as verified?
(In reply to Elad from comment #16) > (In reply to Allon Mureinik from comment #14) > > Sounds OK to me. > > Barak - am I missing anything? > > Allon, can I mark as verified? Please do.
Verified on 3.3 (is5), results are according to my comments #12 , #13
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2014-0038.html