Bug 875076 - [RHEV-RHS] Storage domain becomes inactive after rebalance
Summary: [RHEV-RHS] Storage domain becomes inactive after rebalance
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: unspecified
Hardware: x86_64
OS: Linux
low
unspecified
Target Milestone: ---
: ---
Assignee: shishir gowda
QA Contact: shylesh
URL:
Whiteboard:
Depends On:
Blocks: 862981 865669
TreeView+ depends on / blocked
 
Reported: 2012-11-09 13:24 UTC by shylesh
Modified: 2013-12-09 01:34 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: Changing a volume with just one brick to multiple bricks (using add-brick) is not supported. Consequence: After add-brick and rebalance, we will have volume not function properly (ie, many operations on the volume fails). Workaround (if any): Start with volume with 2 bricks at the minimum. Result: If the volume has at least 2 bricks to start with, we will never have this issue.
Clone Of:
Environment:
Last Closed: 2013-01-31 04:49:37 UTC
Embargoed:


Attachments (Terms of Use)
mnt, vdsm, engine logs, (1.65 MB, application/x-gzip)
2012-11-09 13:24 UTC, shylesh
no flags Details

Description shylesh 2012-11-09 13:24:49 UTC
Created attachment 641528 [details]
mnt, vdsm, engine logs,

Description of problem:
rebalance on a single brick distribute volume makes storage domain inactive

Version-Release number of selected component (if applicable):
glusterfs-fuse-3.3.0rhsvirt1-8.el6rhs.x86_64
vdsm-gluster-4.9.6-16.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-8.el6rhs.x86_64
glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
glusterfs-rdma-3.3.0rhsvirt1-8.el6rhs.x86_64
gluster-swift-doc-1.4.8-4.el6.noarch
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.3.0rhsvirt1-8.el6rhs.x86_64


How reproducible:


Steps to Reproduce:
1. created a single brick distribute volume
2. created 2vms on this volume
3. add-brick and rebalance

  
Actual results:
after the rebalance is completed successfully storage domain is inactive though vms are active
 

Additional info:
[root@rhs-client44 ~]# gluster v info vmstore
 
Volume Name: vmstore
Type: Distribute
Volume ID: ab561be9-69e9-41ba-abda-f3d9f77db079
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: rhs-client36.lab.eng.blr.redhat.com:/brick3
Brick2: rhs-client37.lab.eng.blr.redhat.com:/brick2
Brick3: rhs-client43.lab.eng.blr.redhat.com:/brick2
Brick4: rhs-client44.lab.eng.blr.redhat.com:/brick2
Options Reconfigured:
cluster.subvols-per-directory: 1
storage.owner-uid: 36
storage.owner-gid: 36
cluster.eager-lock: enable
storage.linux-aio: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off


on the hypervisor mount 
======================
[root@rhs-gp-srv4 rhs-client36.lab.eng.blr.redhat.com:vmstore]# ll
ls: cannot access 13a3d358-65bd-4d03-bfcf-e6bcb6c8a176: No such file or directory
total 0
??????????? ? ?    ?      ?            ? 13a3d358-65bd-4d03-bfcf-e6bcb6c8a176



the permissions are scrambled

attached mnt, vdsm, engine logs





mnt log
======
[2012-11-09 18:07:45.604963] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1091519: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed
[2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1
[2012-11-09 18:07:55.956655] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1
[2012-11-09 18:07:55.956707] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument)
[2012-11-09 18:07:55.956731] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1091538: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed
[2012-11-09 18:08:06.303144] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=0 overlaps=2
[2012-11-09 18:08:06.304492] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1
[2012-11-09 18:08:06.304526] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument)
[2012-11-09 18:08:06.304565] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1095142: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed
[2012-11-09 18:08:16.646334] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=0 overlaps=2
[2012-11-09 18:08:16.647740] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1
[2012-11-09 18:08:16.647773] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument)
[2012-11-09 18:08:16.647789] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1110890: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed






vdsm logs
=========
Thread-6625::DEBUG::2012-11-09 18:53:54,639::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.1cfe1c
4b-1d33-46ab-9c53-4fb65d7a892d', Clearing records.
Thread-6625::ERROR::2012-11-09 18:53:54,640::task::853::TaskManager.Task::(_setError) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 853, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 895, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 648, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d, msdUUID=13a3d358-65bd-4d03-bfcf-e6bcb6c8a176'
Thread-6625::DEBUG::2012-11-09 18:53:54,640::task::872::TaskManager.Task::(_run) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::Task._run: 8a12bb1e-d17d-42d4-8c31-03deac10a056 ('1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d', 1, '1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d', '13a3d358-65bd-4d03-bfcf-e6bcb6c8a176', 1) {} failed - stopping task
Thread-6625::DEBUG::2012-11-09 18:53:54,641::task::1199::TaskManager.Task::(stop) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::stopping in state preparing (force False)

Comment 3 shylesh 2012-11-09 13:56:28 UTC
you manually unmounted the volume from rhel-h and then activated the domain ... things started working

Comment 5 Amar Tumballi 2012-11-12 05:50:06 UTC
> [2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1

a overlap of ranges is never good sign for DHT, and hence the storage domain is seen as inactive. This doesn't effect the open fd (ie, running VMs) hence the observation is valid. Still trying to figure out whats wrong with the volume. Need to figure out.

Can the reporter confirm if the case of more than 1brick volume expansion and rebalance works fine? if that is the case, i would reduce the severity of this as we would not support any volume less than 4 bricks in ideal scenario.

Comment 6 shylesh 2012-11-12 11:58:25 UTC
(In reply to comment #5)
> > [2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1
> 
> a overlap of ranges is never good sign for DHT, and hence the storage domain
> is seen as inactive. This doesn't effect the open fd (ie, running VMs) hence
> the observation is valid. Still trying to figure out whats wrong with the
> volume. Need to figure out.
> 
> Can the reporter confirm if the case of more than 1brick volume expansion
> and rebalance works fine? if that is the case, i would reduce the severity
> of this as we would not support any volume less than 4 bricks in ideal
> scenario.

I tried with 3x2 distributed-replicate and 5 brick distribute volumes but still issue persists

Comment 8 Vidya Sakar 2013-01-07 10:56:55 UTC
This is not high priority since it is hit only when the volume created is a single brick distribute volume, which is not a recommended configuration anyway. This will not be fixed in update 4, marking it update 5 for now.

Comment 9 Amar Tumballi 2013-01-30 05:43:07 UTC
I recommend this should be closed as NOTABUG considering this is not a valid config. Any thoughts?

Comment 10 Scott Haines 2013-01-31 04:49:37 UTC
Per 01/31 tiger team meeting, close, invalid.


Note You need to log in before you can comment on or make changes to this bug.