1303977 – KeyError when primary server used to mount gluster volume is down

Bug 1303977 - KeyError when primary server used to mount gluster volume is down

Summary: KeyError when primary server used to mount gluster volume is down

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.17.18
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-3.6.5
Target Release:	4.17.25
Assignee:	Ala Hino
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-1
TreeView+	depends on / blocked

Reported:	2016-02-02 15:28 UTC by Sahina Bose
Modified:	2016-04-21 14:36 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-04-21 14:36:05 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.z+ ylavi: planning_ack+ amureini: devel_ack+ acanan: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	53785	0	master	MERGED	gluster: Don't fail connect server when getting volume info	2016-03-08 16:34:42 UTC
oVirt gerrit	54689	0	ovirt-3.6	MERGED	gluster: Don't fail connect server when getting volume info	2016-03-16 09:50:19 UTC

Description Sahina Bose 2016-02-02 15:28:38 UTC

Description of problem:

In a replica 3 gluster volume deployment with hosts (h1, h2, h3), the storage domain was mounted with h1:vol with backup-volfile-servers as h2:h3.

However, when h1 server is down, the validation of gluster volume fails (as it uses primary server to fetch vol file) and the mount does not succeed though backup volfile servers were online.

jsonrpc.Executor/3::ERROR::2016-01-29 17:45:10,180::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
conObj.connect()
File "/usr/share/vdsm/storage/storageServer.py", line 221, in connect
self.validate()
File "/usr/share/vdsm/storage/storageServer.py", line 346, in validate
replicaCount = self.volinfo['replicaCount']
File "/usr/share/vdsm/storage/storageServer.py", line 333, in volinfo
self._volinfo = self._get_gluster_volinfo()
File "/usr/share/vdsm/storage/storageServer.py", line 371, in _get_gluster_volinfo
return volinfo[self._volname]
KeyError: u'engine'

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create a storage domain using a gluster replica 3 volume, and provide backup-volfile-servers option. for instance, h1:/vol1 and in mount options, backup-volfile-servers=h2:h3
2. Bring one of the hypervisor nodes down, and the h1 server down.
3. Activate the hypervisor node which will try to mount the gluster storage domain - resulting in error.

Expected results:
Storage domain should be accessible using the backup-volfile-servers.

Additional info:

Comment 1 Sahina Bose 2016-02-04 10:30:40 UTC

Related, but a different scenario - if glusterd is not running on the primary server used for mount, following error is thrown

jsonrpc.Executor/7::ERROR::2016-02-04 15:12:33,219::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 221, in connect
    self.validate()
  File "/usr/share/vdsm/storage/storageServer.py", line 346, in validate
    replicaCount = self.volinfo['replicaCount']
  File "/usr/share/vdsm/storage/storageServer.py", line 333, in volinfo
    self._volinfo = self._get_gluster_volinfo()
  File "/usr/share/vdsm/storage/storageServer.py", line 370, in _get_gluster_volinfo
    self._volfileserver)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterVolumeInfo
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
GlusterCmdExecFailedException: Command execution failed
error: Connection failed. Please check if gluster daemon is operational.
return code: 1

Comment 2 Yaniv Lavi 2016-02-08 12:43:50 UTC

We simply fuse mount the Gluster servers, if this fails we fail to connect.
If we should not fail on this Gluster needs to be able to fuse mount even when primary is down.

Comment 3 Ala Hino 2016-02-08 13:31:09 UTC

The failure occurs because we get vol info from primary gluster server before doing the mount. If the the server is down, we fail before getting to do the mount.

Comment 4 Sahina Bose 2016-02-08 14:21:27 UTC

Removing needinfo as Ala has answered question. It's not a gluster mount issue, but a pre-validation issue as mentioned in Comment 3

Comment 5 Allon Mureinik 2016-03-13 11:11:56 UTC

The patch needs to be backported to the 3.6 branch - returning status to POST.

Comment 6 Eyal Edri 2016-03-31 08:35:18 UTC

Bugs moved pre-mature to ON_QA since they didn't have target release.
Notice that only bugs with a set target release will move to ON_QA.

Comment 7 Elad 2016-04-03 11:51:59 UTC

Gluster domain remains accessible after the scenario described in the description:

1. Create a storage domain using a gluster replica 3 volume, and provide backup-volfile-servers option. for instance, h1:/vol1 and in mount options, backup-volfile-servers=h2:h3
2. Bring one of the hypervisor nodes down, and the h1 server down.
3. Activate the hypervisor node 

Both hosts have access to the domain even though the primary Gluster server is down.

Verified using:
rhevm-3.6.5-0.1.el6.noarch
vdsm-4.17.25-0.el7ev.noarch
glusterfs-devel-3.7.8-4.el7.x86_64
glusterfs-rdma-3.7.8-4.el7.x86_64
glusterfs-fuse-3.7.8-4.el7.x86_64
glusterfs-server-3.7.8-4.el7.x86_64
python-gluster-3.7.8-4.el7.noarch
glusterfs-ganesha-3.7.8-4.el7.x86_64
glusterfs-debuginfo-3.7.8-4.el7.x86_64
glusterfs-client-xlators-3.7.8-4.el7.x86_64
glusterfs-extra-xlators-3.7.8-4.el7.x86_64
glusterfs-geo-replication-3.7.8-4.el7.x86_64
glusterfs-libs-3.7.8-4.el7.x86_64
glusterfs-3.7.8-4.el7.x86_64
nfs-ganesha-gluster-2.3.0-1.el7.x86_64
glusterfs-resource-agents-3.7.8-4.el7.noarch
glusterfs-cli-3.7.8-4.el7.x86_64
glusterfs-api-devel-3.7.8-4.el7.x86_64
glusterfs-api-3.7.8-4.el7.x86_64

Note You need to log in before you can comment on or make changes to this bug.