//setting as urgent, just because SPM failover can cause other severe failures. //if disagree, change to high. Description of problem: Failure to Attach unreachable ISO domain causes SPM failover. This flow is incorrect. SPM should report to the engine, if it cannot mount the ISO domain, but no reason to initiate SPM failover. Indeed, the error is reported, but then SpmStop is initiated: engine.log: ~~~~~~~~~~~ [mku: issue Attach SD for ISO domain de6fe43b-e85c-48e2-82f8-38963d42fe4c] 2014-09-10 19:53:58,413 INFO [org.ovirt.engine.core.bll.storage.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Running command: AttachStorageDomainToPoolCommand internal: false . Entities affected : ID: de6fe43b-e85c-48e2-82f8-38963d42fe4c Type: Storage 2014-09-10 19:56:03,990 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] START, AttachStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-00000000018a, ignoreFailoverLimit = false, storageDomainId = de6fe43b-e85c-48e2-82f8-38963d42fe4c), log id: 5ce92c8f [mku: fails to attach domain, since unreachable from the host] 2014-09-10 19:56:05,066 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Failed in AttachStorageDomainVDS method 2014-09-10 19:56:05,130 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',), code = 358 [mku: SPM failover??? why?] 2014-09-10 19:56:05,187 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] START, SpmStopVDSCommand(HostName = rhel34-mku, HostId = 49b33266-8770-45e7-904d-bd704a648e72, storagePoolId = 00000002-0002-0002-0002-00000000018a), log id: 7743ee06 2014-09-10 19:56:05,266 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] SpmStopVDSCommand::Stopping SPM on vds rhel34-mku, pool id 00000002-0002-0002-0002-00000000018a 2014-09-10 19:56:05,406 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] FINISH, SpmStopVDSCommand, log id: 7743ee06 2014-09-10 19:56:05,406 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Irs placed on server 49b33266-8770-45e7-904d-bd704a648e72 failed. Proceed Failover vdsm.log: ~~~~~~~~~ Thread-39::INFO::2014-09-10 19:53:59,392::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'rhevm34-mku- lab.usersys.redhat.com:/var/lib/exports/iso', 'iqn': '', 'user': '', 'tpgt': '', 'password': '******', 'id': '8fb87ff9-957e-490d-ac60-46ab5bb4e85d', 'port': ''}], options=None) Thread-39::DEBUG::2014-09-10 19:53:59,410::mount::202::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 rhevm34-mku-lab.usersys.redhat.com:/va r/lib/exports/iso /rhev/data-center/mnt/rhevm34-mku-lab.usersys.redhat.com:_var_lib_exports_iso' (cwd None) Thread-39::ERROR::2014-09-10 19:56:04,492::storageServer::209::StorageServer.MountConnection::(connect) Mount failed: (32, ';mount.nfs: Connection timed out\n') Traceback (most recent call last): File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect self._mount.mount(self.options, self._vfsType) File "/usr/share/vdsm/storage/mount.py", line 198, in mount return self._runcmd(cmd, timeout) File "/usr/share/vdsm/storage/mount.py", line 214, in _runcmd raise MountError(rc, ";".join((out, err))) MountError: (32, ';mount.nfs: Connection timed out\n') Thread-38::INFO::2014-09-10 19:56:04,624::logUtils::44::dispatcher::(wrapper) Run and protect: attachStorageDomain(sdUUID='de6fe43b-e85c-48e2-82f8-38963d42fe4c', spUUID='00000002-0002-0002-0002-00000000018a', options=None) Thread-38::INFO::2014-09-10 19:56:04,630::sp::869::Storage.StoragePool::(attachSD) sdUUID=de6fe43b-e85c-48e2-82f8-38963d42fe4c spUUID=00000002-0002-0002-0002-00000000018a Thread-38::ERROR::2014-09-10 19:56:05,577::sdc::143::Storage.StorageDomainCache::(_findDomain) domain de6fe43b-e85c-48e2-82f8-38963d42fe4c not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',) Thread-38::ERROR::2014-09-10 19:56:05,652::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': "Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',)", 'code': 358}} Thread-39::DEBUG::2014-09-10 19:56:05,760::BindingXMLRPC::251::vds::(wrapper) client [10.10.178.220] flowID [2d32dff6] Thread-39::INFO::2014-09-10 19:56:05,761::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='00000002-0002-0002-0002-00000000018a', options=None) Thread-39::INFO::2014-09-10 19:56:05,762::logUtils::47::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 13638}} Thread-39::DEBUG::2014-09-10 19:56:05,762::task::1185::TaskManager.Task::(prepare) Task=`1fd160db-c69c-47fc-ab5e-729991450fd0`::finished: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 13638}} Thread-39::INFO::2014-09-10 19:56:05,889::logUtils::44::dispatcher::(wrapper) Run and protect: spmStop(spUUID='00000002-0002-0002-0002-00000000018a', options=None) Thread-39::DEBUG::2014-09-10 19:56:05,890::resourceManager::198::ResourceManager.Request::(__init__) ResName=`Storage.00000002-0002-0002-0002-00000000018a`ReqID=`1db2a323-37b5-4d42-b726-20b4e2bec1aa`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '616' at 'spmStop' Version-Release number of selected component (if applicable): 3.4.2 (and before) How reproducible: 100% Steps to Reproduce: 1. Detach ISO domain from the Data Center. 2. Block NFS traffic from nfs servers providing the ISO domain. 3. Try attaching the ISO domain back to the Data Center. Actual results: The operation fails with error in Events: Failed to attach Storage Domain ISO_DOMAIN to Data Center Default. (User: admin) SPM role moved to another host in the Data Center. Expected results: Error should be reported as it is now. But SPM role should not failover.
This happens only when "acquire" succeeds and "renew" fails. The positive flow is quite fast usually (few seconds) but if the iso domains become unreachable between acquire and renew then the host is fenced. So if you try to attach an iso that is not reachable you usually don't get fenced (unless as I said the first write is successful).
Federico and all, This is an engine back (probably). It is IRSBroker that initiates the failover. However, why the exception returned by vdsm is IRSNoMasterDomainException? This is what is happening in engine.log: ~~~ 2014-09-15 16:36:28,408 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] Failed in AttachStorageDomainVDS method 2014-09-15 16:36:28,497 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',), code = 358 2014-09-15 16:36:28,550 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] START, SpmStopVDSCommand(HostName = rhel34-mku, HostId = 49b33266-8770-45e7-904d-bd704a648e72, storagePoolId = 00000002-0002-0002-0002-00000000018a), log id: 2c06f55d 2014-09-15 16:36:28,627 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] SpmStopVDSCommand::Stopping SPM on vds rhel34-mku, pool id 00000002-0002-0002-0002-00000000018a 2014-09-15 16:36:28,748 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] FINISH, SpmStopVDSCommand, log id: 2c06f55d 2014-09-15 16:36:28,748 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] Irs placed on server 49b33266-8770-45e7-904d-bd704a648e72 failed. Proceed Failover ~~~ And this is the piece of IrsBroker exception handler that is happening here: ~~~ catch (IRSNoMasterDomainException ex) { getVDSReturnValue().setExceptionString(ex.toString()); getVDSReturnValue().setExceptionObject(ex); getVDSReturnValue().setVdsError(ex.getVdsError()); log.errorFormat("IrsBroker::Failed::{0}", getCommandName()); log.errorFormat("Exception: {0}", ex.getMessage()); if ((ex.getVdsError() == null || ex.getVdsError().getCode() != VdcBllErrors.StoragePoolWrongMaster) && getCurrentIrsProxyData().getHasVdssForSpmSelection()) { failover(); } else { isStartReconst ~~~
Looking into AttachStorageDomainVDSCommand.java, we see how we form the exception: ~~~ @Override protected VDSExceptionBase createDefaultConcreteException(String errorMessage) { StorageDomain domainFromDb = DbFacade.getInstance().getStorageDomainDao().get(getParameters().getStorageDomainId()); if (domainFromDb == null || domainFromDb.getStorageDomainType() == StorageDomainType.ImportExport) { return new IrsOperationFailedNoFailoverException(errorMessage); } return super.createDefaultConcreteException(errorMessage); } ~~~ Why aren't we checking for ISO domain as well? ISO Domain exists as an option in businessentities/StorageDomainType.java: ~~~ public enum StorageDomainType { Master, Data, ISO, ImportExport, Image, Unknown; ... ~~~ To me sounds like backend bug still. No reason to proceed SPM failover if ISO domain is inaccessible.
verified using vt8 no SPM failover, from event log - "Failed to attach Storage Domain rhevm-3-iso-lion to Data Center Default. (User: admin)"
The doc text was copied from the 3.4.4 clone bug #1157212
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html