Bug 1303977

Summary: KeyError when primary server used to mount gluster volume is down
Product: [oVirt] vdsm Reporter: Sahina Bose <sabose>
Component: GeneralAssignee: Ala Hino <ahino>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.17.18CC: acanan, amureini, bugs, sabose, tnisan, ylavi
Target Milestone: ovirt-3.6.5Flags: rule-engine: ovirt-3.6.z+
ylavi: planning_ack+
amureini: devel_ack+
acanan: testing_ack+
Target Release: 4.17.25   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-21 14:36:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1258386    

Description Sahina Bose 2016-02-02 15:28:38 UTC
Description of problem:

In a replica 3 gluster volume deployment with hosts (h1, h2, h3), the storage domain was mounted with h1:vol with backup-volfile-servers as h2:h3.

However, when h1 server is down, the validation of gluster volume fails (as it uses primary server to fetch vol file) and the mount does not succeed though backup volfile servers were online.


jsonrpc.Executor/3::ERROR::2016-01-29 17:45:10,180::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 221, in connect
    self.validate()
  File "/usr/share/vdsm/storage/storageServer.py", line 346, in validate
    replicaCount = self.volinfo['replicaCount']
  File "/usr/share/vdsm/storage/storageServer.py", line 333, in volinfo
    self._volinfo = self._get_gluster_volinfo()
  File "/usr/share/vdsm/storage/storageServer.py", line 371, in _get_gluster_volinfo
    return volinfo[self._volname]
KeyError: u'engine'

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a storage domain using a gluster replica 3 volume, and provide backup-volfile-servers option. for instance, h1:/vol1 and in mount options, backup-volfile-servers=h2:h3
2. Bring one of the hypervisor nodes down, and the h1 server down.
3. Activate the hypervisor node which will try to mount the gluster storage domain - resulting in error.

Expected results:
Storage domain should be accessible using the backup-volfile-servers.

Additional info:

Comment 1 Sahina Bose 2016-02-04 10:30:40 UTC
Related, but a different scenario - if glusterd is not running on the primary server used for mount, following error is thrown

jsonrpc.Executor/7::ERROR::2016-02-04 15:12:33,219::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 221, in connect
    self.validate()
  File "/usr/share/vdsm/storage/storageServer.py", line 346, in validate
    replicaCount = self.volinfo['replicaCount']
  File "/usr/share/vdsm/storage/storageServer.py", line 333, in volinfo
    self._volinfo = self._get_gluster_volinfo()
  File "/usr/share/vdsm/storage/storageServer.py", line 370, in _get_gluster_volinfo
    self._volfileserver)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterVolumeInfo
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
GlusterCmdExecFailedException: Command execution failed
error: Connection failed. Please check if gluster daemon is operational.
return code: 1

Comment 2 Yaniv Lavi 2016-02-08 12:43:50 UTC
We simply fuse mount the Gluster servers, if this fails we fail to connect.
If we should not fail on this Gluster needs to be able to fuse mount even when primary is down.

Comment 3 Ala Hino 2016-02-08 13:31:09 UTC
The failure occurs because we get vol info from primary gluster server before doing the mount. If the the server is down, we fail before getting to do the mount.

Comment 4 Sahina Bose 2016-02-08 14:21:27 UTC
Removing needinfo as Ala has answered question. It's not a gluster mount issue, but a pre-validation issue as mentioned in Comment 3

Comment 5 Allon Mureinik 2016-03-13 11:11:56 UTC
The patch needs to be backported to the 3.6 branch - returning status to POST.

Comment 6 Eyal Edri 2016-03-31 08:35:18 UTC
Bugs moved pre-mature to ON_QA since they didn't have target release.
Notice that only bugs with a set target release will move to ON_QA.

Comment 7 Elad 2016-04-03 11:51:59 UTC
Gluster domain remains accessible after the scenario described in the description:

1. Create a storage domain using a gluster replica 3 volume, and provide backup-volfile-servers option. for instance, h1:/vol1 and in mount options, backup-volfile-servers=h2:h3
2. Bring one of the hypervisor nodes down, and the h1 server down.
3. Activate the hypervisor node 

Both hosts have access to the domain even though the primary Gluster server is down.

Verified using:
rhevm-3.6.5-0.1.el6.noarch
vdsm-4.17.25-0.el7ev.noarch
glusterfs-devel-3.7.8-4.el7.x86_64
glusterfs-rdma-3.7.8-4.el7.x86_64
glusterfs-fuse-3.7.8-4.el7.x86_64
glusterfs-server-3.7.8-4.el7.x86_64
python-gluster-3.7.8-4.el7.noarch
glusterfs-ganesha-3.7.8-4.el7.x86_64
glusterfs-debuginfo-3.7.8-4.el7.x86_64
glusterfs-client-xlators-3.7.8-4.el7.x86_64
glusterfs-extra-xlators-3.7.8-4.el7.x86_64
glusterfs-geo-replication-3.7.8-4.el7.x86_64
glusterfs-libs-3.7.8-4.el7.x86_64
glusterfs-3.7.8-4.el7.x86_64
nfs-ganesha-gluster-2.3.0-1.el7.x86_64
glusterfs-resource-agents-3.7.8-4.el7.noarch
glusterfs-cli-3.7.8-4.el7.x86_64
glusterfs-api-devel-3.7.8-4.el7.x86_64
glusterfs-api-3.7.8-4.el7.x86_64