Bug 1140430 - Failure to Attach ISO domain causes SPM failover
Summary: Failure to Attach ISO domain causes SPM failover
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.4.0
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 3.5.0
Assignee: Federico Simoncelli
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks: 1157212
TreeView+ depends on / blocked
 
Reported: 2014-09-11 00:41 UTC by Marina Kalinin
Modified: 2019-05-20 11:20 UTC (History)
16 users (show)

Fixed In Version: org.ovirt.engine-root-3.5.0-18
Doc Type: Bug Fix
Doc Text:
Previously, failing to attach an ISO domain caused the storage pool manager to fail over and start an attempt to select a new storage pool manager. This behavior caused a potential fail over storm if the domain itself was corrupted, leaving the system without a storage pool manager for a prolonged time. With this update, a failed attempt to attach an ISO domain to a data center triggers an error message, but does not cause the storage pool manager to fail over.
Clone Of:
: 1157212 (view as bug list)
Environment:
Last Closed: 2015-02-11 18:09:03 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1203833 0 None None None Never
Red Hat Product Errata RHSA-2015:0158 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.5.0 2015-02-11 22:38:50 UTC
oVirt gerrit 33290 0 master MERGED vdsbroker: no spm failover on attach iso domain failure 2021-02-03 19:09:12 UTC
oVirt gerrit 33665 0 ovirt-engine-3.5 MERGED vdsbroker: no spm failover on attach iso domain failure 2021-02-03 19:09:12 UTC

Description Marina Kalinin 2014-09-11 00:41:53 UTC
//setting as urgent, just because SPM failover can cause other severe failures.
//if disagree, change to high.

Description of problem:
Failure to Attach unreachable ISO domain causes SPM failover.
This flow is incorrect.
SPM should report to the engine, if it cannot mount the ISO domain, but no reason to initiate SPM failover.
Indeed, the error is reported, but then SpmStop is initiated:


engine.log:
~~~~~~~~~~~
[mku: issue Attach SD for ISO domain de6fe43b-e85c-48e2-82f8-38963d42fe4c]
2014-09-10 19:53:58,413 INFO  [org.ovirt.engine.core.bll.storage.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Running command: AttachStorageDomainToPoolCommand internal: false
. Entities affected :  ID: de6fe43b-e85c-48e2-82f8-38963d42fe4c Type: Storage
2014-09-10 19:56:03,990 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] START, AttachStorageDomainVDSCommand( storagePoolId = 00000002-0002-0002-0002-00000000018a, ignoreFailoverLimit = false, storageDomainId = de6fe43b-e85c-48e2-82f8-38963d42fe4c), log id: 5ce92c8f
[mku: fails to attach domain, since unreachable from the host]
2014-09-10 19:56:05,066 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Failed in AttachStorageDomainVDS method
2014-09-10 19:56:05,130 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',), code = 358
[mku: SPM failover??? why?]
2014-09-10 19:56:05,187 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] START, SpmStopVDSCommand(HostName = rhel34-mku, HostId = 49b33266-8770-45e7-904d-bd704a648e72, storagePoolId = 00000002-0002-0002-0002-00000000018a), log id: 7743ee06
2014-09-10 19:56:05,266 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] SpmStopVDSCommand::Stopping SPM on vds rhel34-mku, pool id 00000002-0002-0002-0002-00000000018a
2014-09-10 19:56:05,406 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] FINISH, SpmStopVDSCommand, log id: 7743ee06
2014-09-10 19:56:05,406 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-41) [2d32dff6] Irs placed on server 49b33266-8770-45e7-904d-bd704a648e72 failed. Proceed Failover

vdsm.log:
~~~~~~~~~
Thread-39::INFO::2014-09-10 19:53:59,392::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'rhevm34-mku-
lab.usersys.redhat.com:/var/lib/exports/iso', 'iqn': '', 'user': '', 'tpgt': '', 'password': '******', 'id': '8fb87ff9-957e-490d-ac60-46ab5bb4e85d', 'port': ''}], options=None)
Thread-39::DEBUG::2014-09-10 19:53:59,410::mount::202::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 rhevm34-mku-lab.usersys.redhat.com:/va
r/lib/exports/iso /rhev/data-center/mnt/rhevm34-mku-lab.usersys.redhat.com:_var_lib_exports_iso' (cwd None)

Thread-39::ERROR::2014-09-10 19:56:04,492::storageServer::209::StorageServer.MountConnection::(connect) Mount failed: (32, ';mount.nfs: Connection timed out\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 198, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 214, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Connection timed out\n')

Thread-38::INFO::2014-09-10 19:56:04,624::logUtils::44::dispatcher::(wrapper) Run and protect: attachStorageDomain(sdUUID='de6fe43b-e85c-48e2-82f8-38963d42fe4c', spUUID='00000002-0002-0002-0002-00000000018a', options=None)
Thread-38::INFO::2014-09-10 19:56:04,630::sp::869::Storage.StoragePool::(attachSD) sdUUID=de6fe43b-e85c-48e2-82f8-38963d42fe4c spUUID=00000002-0002-0002-0002-00000000018a

Thread-38::ERROR::2014-09-10 19:56:05,577::sdc::143::Storage.StorageDomainCache::(_findDomain) domain de6fe43b-e85c-48e2-82f8-38963d42fe4c not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',)
Thread-38::ERROR::2014-09-10 19:56:05,652::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': "Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',)", 'code': 358}}
Thread-39::DEBUG::2014-09-10 19:56:05,760::BindingXMLRPC::251::vds::(wrapper) client [10.10.178.220] flowID [2d32dff6]
Thread-39::INFO::2014-09-10 19:56:05,761::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='00000002-0002-0002-0002-00000000018a', options=None)
Thread-39::INFO::2014-09-10 19:56:05,762::logUtils::47::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 13638}}
Thread-39::DEBUG::2014-09-10 19:56:05,762::task::1185::TaskManager.Task::(prepare) Task=`1fd160db-c69c-47fc-ab5e-729991450fd0`::finished: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 13638}}

Thread-39::INFO::2014-09-10 19:56:05,889::logUtils::44::dispatcher::(wrapper) Run and protect: spmStop(spUUID='00000002-0002-0002-0002-00000000018a', options=None)
Thread-39::DEBUG::2014-09-10 19:56:05,890::resourceManager::198::ResourceManager.Request::(__init__) ResName=`Storage.00000002-0002-0002-0002-00000000018a`ReqID=`1db2a323-37b5-4d42-b726-20b4e2bec1aa`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '616' at 'spmStop'


Version-Release number of selected component (if applicable):
3.4.2 (and before)

How reproducible:
100%

Steps to Reproduce:
1. Detach ISO domain from the Data Center.
2. Block NFS traffic from nfs servers providing the ISO domain.
3. Try attaching the ISO domain back to the Data Center.

Actual results:
The operation fails with error in Events:
Failed to attach Storage Domain ISO_DOMAIN to Data Center Default. (User: admin)
SPM role moved to another host in the Data Center.

Expected results:
Error should be reported as it is now.
But SPM role should not failover.

Comment 2 Federico Simoncelli 2014-09-12 14:26:56 UTC
This happens only when "acquire" succeeds and "renew" fails. The positive flow is quite fast usually (few seconds) but if the iso domains become unreachable between acquire and renew then the host is fenced.

So if you try to attach an iso that is not reachable you usually don't get fenced (unless as I said the first write is successful).

Comment 7 Marina Kalinin 2014-09-15 20:45:16 UTC
Federico and all,
This is an engine back (probably). It is IRSBroker that initiates the failover. However, why the exception returned by vdsm is IRSNoMasterDomainException?

This is what is happening in engine.log:
~~~
2014-09-15 16:36:28,408 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] Failed in AttachStorageDomainVDS method
2014-09-15 16:36:28,497 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('de6fe43b-e85c-48e2-82f8-38963d42fe4c',), code = 358
2014-09-15 16:36:28,550 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] START, SpmStopVDSCommand(HostName = rhel34-mku, HostId = 49b33266-8770-45e7-904d-bd704a648e72, storagePoolId = 00000002-0002-0002-0002-00000000018a), log id: 2c06f55d
2014-09-15 16:36:28,627 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] SpmStopVDSCommand::Stopping SPM on vds rhel34-mku, pool id 00000002-0002-0002-0002-00000000018a
2014-09-15 16:36:28,748 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] FINISH, SpmStopVDSCommand, log id: 2c06f55d
2014-09-15 16:36:28,748 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-27) [1df625e] Irs placed on server 49b33266-8770-45e7-904d-bd704a648e72 failed. Proceed Failover
~~~


And this is the piece of IrsBroker exception handler that is happening here:
~~~
            catch (IRSNoMasterDomainException ex) {
                getVDSReturnValue().setExceptionString(ex.toString());
                getVDSReturnValue().setExceptionObject(ex);
                getVDSReturnValue().setVdsError(ex.getVdsError());
                log.errorFormat("IrsBroker::Failed::{0}", getCommandName());
                log.errorFormat("Exception: {0}", ex.getMessage());

                if ((ex.getVdsError() == null || ex.getVdsError().getCode() != VdcBllErrors.StoragePoolWrongMaster)
                        && getCurrentIrsProxyData().getHasVdssForSpmSelection()) {
                    failover();
                } else {
                    isStartReconst

~~~

Comment 8 Marina Kalinin 2014-09-15 21:26:38 UTC
Looking into AttachStorageDomainVDSCommand.java, we see how we form the exception:
~~~
    @Override
    protected VDSExceptionBase createDefaultConcreteException(String errorMessage) {
        StorageDomain domainFromDb = DbFacade.getInstance().getStorageDomainDao().get(getParameters().getStorageDomainId());
        if (domainFromDb == null || domainFromDb.getStorageDomainType() == StorageDomainType.ImportExport) {
            return new IrsOperationFailedNoFailoverException(errorMessage);
        }
        return super.createDefaultConcreteException(errorMessage);
    }
~~~

Why aren't we checking for ISO domain as well? ISO Domain exists as an option in businessentities/StorageDomainType.java:

~~~
public enum StorageDomainType {

    Master,
    Data,
    ISO,
    ImportExport,
    Image,
    Unknown;
...
~~~

To me sounds like backend bug still.
No reason to proceed SPM failover if ISO domain is inaccessible.

Comment 13 Aharon Canan 2014-10-30 08:17:34 UTC
verified using vt8

no SPM failover, 

from event log - "Failed to attach Storage Domain rhevm-3-iso-lion to Data Center Default. (User: admin)"

Comment 14 Tal Nisan 2014-11-26 11:35:34 UTC
The doc text was copied from the 3.4.4 clone bug #1157212

Comment 16 errata-xmlrpc 2015-02-11 18:09:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html


Note You need to log in before you can comment on or make changes to this bug.