Bug 1303977 - KeyError when primary server used to mount gluster volume is down
KeyError when primary server used to mount gluster volume is down
Status: CLOSED CURRENTRELEASE
Product: vdsm
Classification: oVirt
Component: General (Show other bugs)
4.17.18
Unspecified Unspecified
medium Severity medium (vote)
: ovirt-3.6.5
: 4.17.25
Assigned To: Ala Hino
Elad
:
Depends On:
Blocks: Gluster-HC-1
  Show dependency treegraph
 
Reported: 2016-02-02 10:28 EST by Sahina Bose
Modified: 2016-04-21 10:36 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-04-21 10:36:05 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
ylavi: planning_ack+
amureini: devel_ack+
acanan: testing_ack+


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 53785 master MERGED gluster: Don't fail connect server when getting volume info 2016-03-08 11:34 EST
oVirt gerrit 54689 ovirt-3.6 MERGED gluster: Don't fail connect server when getting volume info 2016-03-16 05:50 EDT

  None (edit)
Description Sahina Bose 2016-02-02 10:28:38 EST
Description of problem:

In a replica 3 gluster volume deployment with hosts (h1, h2, h3), the storage domain was mounted with h1:vol with backup-volfile-servers as h2:h3.

However, when h1 server is down, the validation of gluster volume fails (as it uses primary server to fetch vol file) and the mount does not succeed though backup volfile servers were online.


jsonrpc.Executor/3::ERROR::2016-01-29 17:45:10,180::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 221, in connect
    self.validate()
  File "/usr/share/vdsm/storage/storageServer.py", line 346, in validate
    replicaCount = self.volinfo['replicaCount']
  File "/usr/share/vdsm/storage/storageServer.py", line 333, in volinfo
    self._volinfo = self._get_gluster_volinfo()
  File "/usr/share/vdsm/storage/storageServer.py", line 371, in _get_gluster_volinfo
    return volinfo[self._volname]
KeyError: u'engine'

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a storage domain using a gluster replica 3 volume, and provide backup-volfile-servers option. for instance, h1:/vol1 and in mount options, backup-volfile-servers=h2:h3
2. Bring one of the hypervisor nodes down, and the h1 server down.
3. Activate the hypervisor node which will try to mount the gluster storage domain - resulting in error.

Expected results:
Storage domain should be accessible using the backup-volfile-servers.

Additional info:
Comment 1 Sahina Bose 2016-02-04 05:30:40 EST
Related, but a different scenario - if glusterd is not running on the primary server used for mount, following error is thrown

jsonrpc.Executor/7::ERROR::2016-02-04 15:12:33,219::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 221, in connect
    self.validate()
  File "/usr/share/vdsm/storage/storageServer.py", line 346, in validate
    replicaCount = self.volinfo['replicaCount']
  File "/usr/share/vdsm/storage/storageServer.py", line 333, in volinfo
    self._volinfo = self._get_gluster_volinfo()
  File "/usr/share/vdsm/storage/storageServer.py", line 370, in _get_gluster_volinfo
    self._volfileserver)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterVolumeInfo
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
GlusterCmdExecFailedException: Command execution failed
error: Connection failed. Please check if gluster daemon is operational.
return code: 1
Comment 2 Yaniv Lavi 2016-02-08 07:43:50 EST
We simply fuse mount the Gluster servers, if this fails we fail to connect.
If we should not fail on this Gluster needs to be able to fuse mount even when primary is down.
Comment 3 Ala Hino 2016-02-08 08:31:09 EST
The failure occurs because we get vol info from primary gluster server before doing the mount. If the the server is down, we fail before getting to do the mount.
Comment 4 Sahina Bose 2016-02-08 09:21:27 EST
Removing needinfo as Ala has answered question. It's not a gluster mount issue, but a pre-validation issue as mentioned in Comment 3
Comment 5 Allon Mureinik 2016-03-13 07:11:56 EDT
The patch needs to be backported to the 3.6 branch - returning status to POST.
Comment 6 Eyal Edri 2016-03-31 04:35:18 EDT
Bugs moved pre-mature to ON_QA since they didn't have target release.
Notice that only bugs with a set target release will move to ON_QA.
Comment 7 Elad 2016-04-03 07:51:59 EDT
Gluster domain remains accessible after the scenario described in the description:

1. Create a storage domain using a gluster replica 3 volume, and provide backup-volfile-servers option. for instance, h1:/vol1 and in mount options, backup-volfile-servers=h2:h3
2. Bring one of the hypervisor nodes down, and the h1 server down.
3. Activate the hypervisor node 

Both hosts have access to the domain even though the primary Gluster server is down.

Verified using:
rhevm-3.6.5-0.1.el6.noarch
vdsm-4.17.25-0.el7ev.noarch
glusterfs-devel-3.7.8-4.el7.x86_64
glusterfs-rdma-3.7.8-4.el7.x86_64
glusterfs-fuse-3.7.8-4.el7.x86_64
glusterfs-server-3.7.8-4.el7.x86_64
python-gluster-3.7.8-4.el7.noarch
glusterfs-ganesha-3.7.8-4.el7.x86_64
glusterfs-debuginfo-3.7.8-4.el7.x86_64
glusterfs-client-xlators-3.7.8-4.el7.x86_64
glusterfs-extra-xlators-3.7.8-4.el7.x86_64
glusterfs-geo-replication-3.7.8-4.el7.x86_64
glusterfs-libs-3.7.8-4.el7.x86_64
glusterfs-3.7.8-4.el7.x86_64
nfs-ganesha-gluster-2.3.0-1.el7.x86_64
glusterfs-resource-agents-3.7.8-4.el7.noarch
glusterfs-cli-3.7.8-4.el7.x86_64
glusterfs-api-devel-3.7.8-4.el7.x86_64
glusterfs-api-3.7.8-4.el7.x86_64

Note You need to log in before you can comment on or make changes to this bug.