Bug 875076

Summary: [RHEV-RHS] Storage domain becomes inactive after rebalance
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: glusterfsAssignee: shishir gowda <sgowda>
Status: CLOSED NOTABUG QA Contact: shylesh <shmohan>
Severity: unspecified Docs Contact:
Priority: low    
Version: unspecifiedCC: amarts, asriram, grajaiya, nsathyan, rhs-bugs, shaines, vbellur, vbhat, vinaraya
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: Changing a volume with just one brick to multiple bricks (using add-brick) is not supported. Consequence: After add-brick and rebalance, we will have volume not function properly (ie, many operations on the volume fails). Workaround (if any): Start with volume with 2 bricks at the minimum. Result: If the volume has at least 2 bricks to start with, we will never have this issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-31 04:49:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 862981, 865669    
Attachments:
Description Flags
mnt, vdsm, engine logs, none

Description shylesh 2012-11-09 13:24:49 UTC
Created attachment 641528 [details]
mnt, vdsm, engine logs,

Description of problem:
rebalance on a single brick distribute volume makes storage domain inactive

Version-Release number of selected component (if applicable):
glusterfs-fuse-3.3.0rhsvirt1-8.el6rhs.x86_64
vdsm-gluster-4.9.6-16.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-8.el6rhs.x86_64
glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
glusterfs-rdma-3.3.0rhsvirt1-8.el6rhs.x86_64
gluster-swift-doc-1.4.8-4.el6.noarch
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.3.0rhsvirt1-8.el6rhs.x86_64


How reproducible:


Steps to Reproduce:
1. created a single brick distribute volume
2. created 2vms on this volume
3. add-brick and rebalance

  
Actual results:
after the rebalance is completed successfully storage domain is inactive though vms are active
 

Additional info:
[root@rhs-client44 ~]# gluster v info vmstore
 
Volume Name: vmstore
Type: Distribute
Volume ID: ab561be9-69e9-41ba-abda-f3d9f77db079
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: rhs-client36.lab.eng.blr.redhat.com:/brick3
Brick2: rhs-client37.lab.eng.blr.redhat.com:/brick2
Brick3: rhs-client43.lab.eng.blr.redhat.com:/brick2
Brick4: rhs-client44.lab.eng.blr.redhat.com:/brick2
Options Reconfigured:
cluster.subvols-per-directory: 1
storage.owner-uid: 36
storage.owner-gid: 36
cluster.eager-lock: enable
storage.linux-aio: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off


on the hypervisor mount 
======================
[root@rhs-gp-srv4 rhs-client36.lab.eng.blr.redhat.com:vmstore]# ll
ls: cannot access 13a3d358-65bd-4d03-bfcf-e6bcb6c8a176: No such file or directory
total 0
??????????? ? ?    ?      ?            ? 13a3d358-65bd-4d03-bfcf-e6bcb6c8a176



the permissions are scrambled

attached mnt, vdsm, engine logs





mnt log
======
[2012-11-09 18:07:45.604963] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1091519: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed
[2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1
[2012-11-09 18:07:55.956655] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1
[2012-11-09 18:07:55.956707] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument)
[2012-11-09 18:07:55.956731] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1091538: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed
[2012-11-09 18:08:06.303144] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=0 overlaps=2
[2012-11-09 18:08:06.304492] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1
[2012-11-09 18:08:06.304526] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument)
[2012-11-09 18:08:06.304565] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1095142: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed
[2012-11-09 18:08:16.646334] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=0 overlaps=2
[2012-11-09 18:08:16.647740] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1
[2012-11-09 18:08:16.647773] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument)
[2012-11-09 18:08:16.647789] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1110890: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed






vdsm logs
=========
Thread-6625::DEBUG::2012-11-09 18:53:54,639::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.1cfe1c
4b-1d33-46ab-9c53-4fb65d7a892d', Clearing records.
Thread-6625::ERROR::2012-11-09 18:53:54,640::task::853::TaskManager.Task::(_setError) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 853, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 895, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 648, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d, msdUUID=13a3d358-65bd-4d03-bfcf-e6bcb6c8a176'
Thread-6625::DEBUG::2012-11-09 18:53:54,640::task::872::TaskManager.Task::(_run) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::Task._run: 8a12bb1e-d17d-42d4-8c31-03deac10a056 ('1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d', 1, '1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d', '13a3d358-65bd-4d03-bfcf-e6bcb6c8a176', 1) {} failed - stopping task
Thread-6625::DEBUG::2012-11-09 18:53:54,641::task::1199::TaskManager.Task::(stop) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::stopping in state preparing (force False)

Comment 3 shylesh 2012-11-09 13:56:28 UTC
you manually unmounted the volume from rhel-h and then activated the domain ... things started working

Comment 5 Amar Tumballi 2012-11-12 05:50:06 UTC
> [2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1

a overlap of ranges is never good sign for DHT, and hence the storage domain is seen as inactive. This doesn't effect the open fd (ie, running VMs) hence the observation is valid. Still trying to figure out whats wrong with the volume. Need to figure out.

Can the reporter confirm if the case of more than 1brick volume expansion and rebalance works fine? if that is the case, i would reduce the severity of this as we would not support any volume less than 4 bricks in ideal scenario.

Comment 6 shylesh 2012-11-12 11:58:25 UTC
(In reply to comment #5)
> > [2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1
> 
> a overlap of ranges is never good sign for DHT, and hence the storage domain
> is seen as inactive. This doesn't effect the open fd (ie, running VMs) hence
> the observation is valid. Still trying to figure out whats wrong with the
> volume. Need to figure out.
> 
> Can the reporter confirm if the case of more than 1brick volume expansion
> and rebalance works fine? if that is the case, i would reduce the severity
> of this as we would not support any volume less than 4 bricks in ideal
> scenario.

I tried with 3x2 distributed-replicate and 5 brick distribute volumes but still issue persists

Comment 8 Vidya Sakar 2013-01-07 10:56:55 UTC
This is not high priority since it is hit only when the volume created is a single brick distribute volume, which is not a recommended configuration anyway. This will not be fixed in update 4, marking it update 5 for now.

Comment 9 Amar Tumballi 2013-01-30 05:43:07 UTC
I recommend this should be closed as NOTABUG considering this is not a valid config. Any thoughts?

Comment 10 Scott Haines 2013-01-31 04:49:37 UTC
Per 01/31 tiger team meeting, close, invalid.