Bug 1001637 - [engine-backend] engine sends ActiveStorageDomain to vdsm even though ConnectStorageServer failed on host
Summary: [engine-backend] engine sends ActiveStorageDomain to vdsm even though Connect...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: x86_64
OS: Unspecified
unspecified
low
Target Milestone: ---
: 3.4.0
Assignee: Liron Aravot
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-27 12:41 UTC by Elad
Modified: 2016-02-10 17:03 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-15 15:54:21 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (693.32 KB, application/x-gzip)
2013-08-27 12:41 UTC, Elad
no flags Details

Description Elad 2013-08-27 12:41:45 UTC
Created attachment 790952 [details]
logs

Description of problem:
When storage server is inaccessible, and vdsm fails to perform connectStorageServer, engine proceed with storage domain activation flow and sends ActiveStorageDomain to vdsm. In case that master domain is active, ActiveStorageDomain succeeds and the inaccessible domain reported as active (false positive).

Version-Release number of selected component (if applicable):
rhevm-3.3.0-0.16.master.el6ev.noarch
vdsm-4.12.0-72.git287bb7e.el6ev.x86_64


How reproducible:
100%

Steps to Reproduce:
On a file pool with more than 1 SDs from different storage servers:
1) maintenance the non-master domain
2) block connectivity to the non-master storage server (which is in maintenance) from all hosts in cluster
3) activate the domain

Actual results:
ConnectStorageServer fails on vdsm:

Thread-1316::ERROR::2013-08-27 14:36:17,248::storageServer::209::StorageServer.MountConnection::(connect) Mount failed: (32, ';mount.nfs: Operation not permitted\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 222, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Operation not permitted\n')
Thread-1316::ERROR::2013-08-27 14:36:17,250::hsm::2367::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2364, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect
    raise e
MountError: (32, ';mount.nfs: Operation not permitted\n')



End on engine:

2013-08-27 14:35:10,102 ERROR [org.ovirt.engine.core.bll.storage.POSIXFSStorageHelper] (pool-5-thread-50) The connection with details lion.qa.lab:/export/elad/elad5 failed because of error code 477 and error message is: problem while trying to mount target
2013-08-27 14:35:10,105 ERROR [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-5-thread-50) Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand.

Even though ConnectStorageServer failed, engine proceed with ActivateStorageDomain:

2013-08-27 14:35:25,196 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-5-thread-50) [6404f91d] START, ActivateStorageDomainVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 66ae8355-db6a-4b17-a0a5-71d462946344), log id: 4b5ace97

The activation ends successfully and the domain is reported as 'Active'. This happens because the master domain is active.


Expected results:
Engine should fail the flow and not send ActivateStorageDomain to host

Additional info:
logs

Comment 1 Elad 2013-08-27 23:41:35 UTC
***End on engine = And on engine***

Comment 2 Ayal Baron 2013-09-15 15:54:21 UTC
Engine ignores connectStorageServer in most (all?) cases since in many cases the following op can succeed and it's not worth it to try and identify ahead of time which would and which wouldn't.
Also, once we get rid of the pool there will be no 'activate' operation so this is doubly not interesting.


Note You need to log in before you can comment on or make changes to this bug.