Bug 1001637

Summary: [engine-backend] engine sends ActiveStorageDomain to vdsm even though ConnectStorageServer failed on host
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: abaron, acathrow, amureini, iheim, lpeer, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.4.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-15 15:54:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs none

Description Elad 2013-08-27 12:41:45 UTC
Created attachment 790952 [details]
logs

Description of problem:
When storage server is inaccessible, and vdsm fails to perform connectStorageServer, engine proceed with storage domain activation flow and sends ActiveStorageDomain to vdsm. In case that master domain is active, ActiveStorageDomain succeeds and the inaccessible domain reported as active (false positive).

Version-Release number of selected component (if applicable):
rhevm-3.3.0-0.16.master.el6ev.noarch
vdsm-4.12.0-72.git287bb7e.el6ev.x86_64


How reproducible:
100%

Steps to Reproduce:
On a file pool with more than 1 SDs from different storage servers:
1) maintenance the non-master domain
2) block connectivity to the non-master storage server (which is in maintenance) from all hosts in cluster
3) activate the domain

Actual results:
ConnectStorageServer fails on vdsm:

Thread-1316::ERROR::2013-08-27 14:36:17,248::storageServer::209::StorageServer.MountConnection::(connect) Mount failed: (32, ';mount.nfs: Operation not permitted\n')
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storageServer.py", line 207, in connect
    self._mount.mount(self.options, self._vfsType)
  File "/usr/share/vdsm/storage/mount.py", line 222, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 238, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: Operation not permitted\n')
Thread-1316::ERROR::2013-08-27 14:36:17,250::hsm::2367::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2364, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 215, in connect
    raise e
MountError: (32, ';mount.nfs: Operation not permitted\n')



End on engine:

2013-08-27 14:35:10,102 ERROR [org.ovirt.engine.core.bll.storage.POSIXFSStorageHelper] (pool-5-thread-50) The connection with details lion.qa.lab:/export/elad/elad5 failed because of error code 477 and error message is: problem while trying to mount target
2013-08-27 14:35:10,105 ERROR [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-5-thread-50) Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand.

Even though ConnectStorageServer failed, engine proceed with ActivateStorageDomain:

2013-08-27 14:35:25,196 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ActivateStorageDomainVDSCommand] (pool-5-thread-50) [6404f91d] START, ActivateStorageDomainVDSCommand( storagePoolId = 7a93c0d1-1316-40e2-b946-3180c3415007, ignoreFailoverLimit = false, storageDomainId = 66ae8355-db6a-4b17-a0a5-71d462946344), log id: 4b5ace97

The activation ends successfully and the domain is reported as 'Active'. This happens because the master domain is active.


Expected results:
Engine should fail the flow and not send ActivateStorageDomain to host

Additional info:
logs

Comment 1 Elad 2013-08-27 23:41:35 UTC
***End on engine = And on engine***

Comment 2 Ayal Baron 2013-09-15 15:54:21 UTC
Engine ignores connectStorageServer in most (all?) cases since in many cases the following op can succeed and it's not worth it to try and identify ahead of time which would and which wouldn't.
Also, once we get rid of the pool there will be no 'activate' operation so this is doubly not interesting.