Bug 1140430

Summary: Failure to Attach ISO domain causes SPM failover
Product: Red Hat Enterprise Virtualization Manager Reporter: Marina Kalinin <mkalinin>
Component: ovirt-engineAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: high    
Version: 3.4.0CC: acanan, adahms, amureini, bkorren, ecohen, fsimonce, gklein, gwatson, iheim, laravot, lpeer, rbalakri, Rhev-m-bugs, scohen, tnisan, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.5.0   
Hardware: All   
OS: Linux   
Whiteboard: storage
Fixed In Version: org.ovirt.engine-root-3.5.0-18 Doc Type: Bug Fix
Doc Text:
Previously, failing to attach an ISO domain caused the storage pool manager to fail over and start an attempt to select a new storage pool manager. This behavior caused a potential fail over storm if the domain itself was corrupted, leaving the system without a storage pool manager for a prolonged time. With this update, a failed attempt to attach an ISO domain to a data center triggers an error message, but does not cause the storage pool manager to fail over.
Story Points: ---
Clone Of:
: 1157212 (view as bug list) Environment:
Last Closed: 2015-02-11 18:09:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1157212    

Description Marina Kalinin 2014-09-11 00:41:53 UTC
//setting as urgent, just because SPM failover can cause other severe failures.
//if disagree, change to high.

Description of problem:
Failure to Attach unreachable ISO domain causes SPM failover.
This flow is incorrect.
SPM should report to the engine, if it cannot mount the ISO domain, but no reason to initiate SPM failover.
Indeed, the error is reported, but then SpmStop is initiated:


engine.log:
~~~~~~~~~~~
[mku: issue Attach SD for ISO domain de6fe43b-e85c-48e2-82f8-38963d42fe4c]
2014-09-10 19:53:58,413 INFO  [org.ovirt.engine.core.bll.storage.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Running command: AttachStorageDomainToPoolCommand internal: false
. Entities affected :  ID: de6fe43b-e85c-48e2-82f8-38963d42fe4c Type: Storage
2014-09-10 19:56:03,990 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] START, AttachStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-00000000018a, ignoreFailoverLimit = false, storageDomainId = de6fe43b-e85c-48e2-82f8-38963d42fe4c), log id: 5ce92c8f
[mku: fails to attach domain, since unreachable from the host]
2014-09-10 19:56:05,066 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Failed in AttachStorageDomainVDS method
2014-09-10 19:56:05,130 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',), code = 358
[mku: SPM failover??? why?]
2014-09-10 19:56:05,187 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] START, SpmStopVDSCommand(HostName = rhel34-mku, HostId = 49b33266-8770-45e7-904d-bd704a648e72, storagePoolId = 00000002-0002-0002-0002-00000000018a), log id: 7743ee06
2014-09-10 19:56:05,266 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] SpmStopVDSCommand::Stopping SPM on vds rhel34-mku, pool id 00000002-0002-0002-0002-00000000018a
2014-09-10 19:56:05,406 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] FINISH, SpmStopVDSCommand, log id: 7743ee06
2014-09-10 19:56:05,406 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Irs placed on server 49b33266-8770-45e7-904d-bd704a648e72 failed. Proceed Failover

vdsm.log:
~~~~~~~~~
Thread-39::INFO::2014-09-10 19:53:59,392::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'rhevm34-mku-
lab.usersys.redhat.com:/var/lib/exports/iso', 'iqn': '', 'user': '', 'tpgt': '', 'password': '******', 'id': '8fb87ff9-957e-490d-ac60-46ab5bb4e85d', 'port': ''}], options=None)
Thread-39::DEBUG::2014-09-10 19:53:59,410::mount::202::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 rhevm34-mku-lab.usersys.redhat.com:/va
r/lib/exports/iso /rhev/data-center/mnt/rhevm34-mku-lab.usersys.redhat.com:_var_lib_exports_iso' (cwd None)

Thread-39::ERROR::2014-09-10 19:56:04,492::storageServer::209::StorageServer.MountConnection::(connect) Mount failed: (32, ';mount.nfs: Connection timed out\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 198, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 214, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Connection timed out\n')

Thread-38::INFO::2014-09-10 19:56:04,624::logUtils::44::dispatcher::(wrapper) Run and protect: attachStorageDomain(sdUUID='de6fe43b-e85c-48e2-82f8-38963d42fe4c', spUUID='00000002-0002-0002-0002-00000000018a', options=None)
Thread-38::INFO::2014-09-10 19:56:04,630::sp::869::Storage.StoragePool::(attachSD) sdUUID=de6fe43b-e85c-48e2-82f8-38963d42fe4c spUUID=00000002-0002-0002-0002-00000000018a

Thread-38::ERROR::2014-09-10 19:56:05,577::sdc::143::Storage.StorageDomainCache::(_findDomain) domain de6fe43b-e85c-48e2-82f8-38963d42fe4c not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',)
Thread-38::ERROR::2014-09-10 19:56:05,652::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': "Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',)", 'code': 358}}
Thread-39::DEBUG::2014-09-10 19:56:05,760::BindingXMLRPC::251::vds::(wrapper) client [10.10.178.220] flowID [2d32dff6]
Thread-39::INFO::2014-09-10 19:56:05,761::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='00000002-0002-0002-0002-00000000018a', options=None)
Thread-39::INFO::2014-09-10 19:56:05,762::logUtils::47::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 13638}}
Thread-39::DEBUG::2014-09-10 19:56:05,762::task::1185::TaskManager.Task::(prepare) Task=`1fd160db-c69c-47fc-ab5e-729991450fd0`::finished: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 13638}}

Thread-39::INFO::2014-09-10 19:56:05,889::logUtils::44::dispatcher::(wrapper) Run and protect: spmStop(spUUID='00000002-0002-0002-0002-00000000018a', options=None)
Thread-39::DEBUG::2014-09-10 19:56:05,890::resourceManager::198::ResourceManager.Request::(__init__) ResName=`Storage.00000002-0002-0002-0002-00000000018a`ReqID=`1db2a323-37b5-4d42-b726-20b4e2bec1aa`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '616' at 'spmStop'


Version-Release number of selected component (if applicable):
3.4.2 (and before)

How reproducible:
100%

Steps to Reproduce:
1. Detach ISO domain from the Data Center.
2. Block NFS traffic from nfs servers providing the ISO domain.
3. Try attaching the ISO domain back to the Data Center.

Actual results:
The operation fails with error in Events:
Failed to attach Storage Domain ISO_DOMAIN to Data Center Default. (User: admin)
SPM role moved to another host in the Data Center.

Expected results:
Error should be reported as it is now.
But SPM role should not failover.

Comment 2 Federico Simoncelli 2014-09-12 14:26:56 UTC
This happens only when "acquire" succeeds and "renew" fails. The positive flow is quite fast usually (few seconds) but if the iso domains become unreachable between acquire and renew then the host is fenced.

So if you try to attach an iso that is not reachable you usually don't get fenced (unless as I said the first write is successful).

Comment 7 Marina Kalinin 2014-09-15 20:45:16 UTC
Federico and all,
This is an engine back (probably). It is IRSBroker that initiates the failover. However, why the exception returned by vdsm is IRSNoMasterDomainException?

This is what is happening in engine.log:
~~~
2014-09-15 16:36:28,408 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] Failed in AttachStorageDomainVDS method
2014-09-15 16:36:28,497 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',), code = 358
2014-09-15 16:36:28,550 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] START, SpmStopVDSCommand(HostName = rhel34-mku, HostId = 49b33266-8770-45e7-904d-bd704a648e72, storagePoolId = 00000002-0002-0002-0002-00000000018a), log id: 2c06f55d
2014-09-15 16:36:28,627 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] SpmStopVDSCommand::Stopping SPM on vds rhel34-mku, pool id 00000002-0002-0002-0002-00000000018a
2014-09-15 16:36:28,748 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] FINISH, SpmStopVDSCommand, log id: 2c06f55d
2014-09-15 16:36:28,748 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] Irs placed on server 49b33266-8770-45e7-904d-bd704a648e72 failed. Proceed Failover
~~~


And this is the piece of IrsBroker exception handler that is happening here:
~~~
            catch (IRSNoMasterDomainException ex) {
                getVDSReturnValue().setExceptionString(ex.toString());
                getVDSReturnValue().setExceptionObject(ex);
                getVDSReturnValue().setVdsError(ex.getVdsError());
                log.errorFormat("IrsBroker::Failed::{0}", getCommandName());
                log.errorFormat("Exception: {0}", ex.getMessage());

                if ((ex.getVdsError() == null || ex.getVdsError().getCode() != VdcBllErrors.StoragePoolWrongMaster)
                        && getCurrentIrsProxyData().getHasVdssForSpmSelection()) {
                    failover();
                } else {
                    isStartReconst

~~~

Comment 8 Marina Kalinin 2014-09-15 21:26:38 UTC
Looking into AttachStorageDomainVDSCommand.java, we see how we form the exception:
~~~
    @Override
    protected VDSExceptionBase createDefaultConcreteException(String errorMessage) {
        StorageDomain domainFromDb = DbFacade.getInstance().getStorageDomainDao().get(getParameters().getStorageDomainId());
        if (domainFromDb == null || domainFromDb.getStorageDomainType() == StorageDomainType.ImportExport) {
            return new IrsOperationFailedNoFailoverException(errorMessage);
        }
        return super.createDefaultConcreteException(errorMessage);
    }
~~~

Why aren't we checking for ISO domain as well? ISO Domain exists as an option in businessentities/StorageDomainType.java:

~~~
public enum StorageDomainType {

    Master,
    Data,
    ISO,
    ImportExport,
    Image,
    Unknown;
...
~~~

To me sounds like backend bug still.
No reason to proceed SPM failover if ISO domain is inaccessible.

Comment 13 Aharon Canan 2014-10-30 08:17:34 UTC
verified using vt8

no SPM failover, 

from event log - "Failed to attach Storage Domain rhevm-3-iso-lion to Data Center Default. (User: admin)"

Comment 14 Tal Nisan 2014-11-26 11:35:34 UTC
The doc text was copied from the 3.4.4 clone bug #1157212

Comment 16 errata-xmlrpc 2015-02-11 18:09:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html