Created attachment 641528 [details] mnt, vdsm, engine logs, Description of problem: rebalance on a single brick distribute volume makes storage domain inactive Version-Release number of selected component (if applicable): glusterfs-fuse-3.3.0rhsvirt1-8.el6rhs.x86_64 vdsm-gluster-4.9.6-16.el6rhs.noarch gluster-swift-plugin-1.0-5.noarch gluster-swift-container-1.4.8-4.el6.noarch org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-3.3.0rhsvirt1-8.el6rhs.x86_64 glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64 gluster-swift-proxy-1.4.8-4.el6.noarch gluster-swift-account-1.4.8-4.el6.noarch glusterfs-rdma-3.3.0rhsvirt1-8.el6rhs.x86_64 gluster-swift-doc-1.4.8-4.el6.noarch gluster-swift-1.4.8-4.el6.noarch gluster-swift-object-1.4.8-4.el6.noarch glusterfs-geo-replication-3.3.0rhsvirt1-8.el6rhs.x86_64 How reproducible: Steps to Reproduce: 1. created a single brick distribute volume 2. created 2vms on this volume 3. add-brick and rebalance Actual results: after the rebalance is completed successfully storage domain is inactive though vms are active Additional info: [root@rhs-client44 ~]# gluster v info vmstore Volume Name: vmstore Type: Distribute Volume ID: ab561be9-69e9-41ba-abda-f3d9f77db079 Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: rhs-client36.lab.eng.blr.redhat.com:/brick3 Brick2: rhs-client37.lab.eng.blr.redhat.com:/brick2 Brick3: rhs-client43.lab.eng.blr.redhat.com:/brick2 Brick4: rhs-client44.lab.eng.blr.redhat.com:/brick2 Options Reconfigured: cluster.subvols-per-directory: 1 storage.owner-uid: 36 storage.owner-gid: 36 cluster.eager-lock: enable storage.linux-aio: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off on the hypervisor mount ====================== [root@rhs-gp-srv4 rhs-client36.lab.eng.blr.redhat.com:vmstore]# ll ls: cannot access 13a3d358-65bd-4d03-bfcf-e6bcb6c8a176: No such file or directory total 0 ??????????? ? ? ? ? ? 13a3d358-65bd-4d03-bfcf-e6bcb6c8a176 the permissions are scrambled attached mnt, vdsm, engine logs mnt log ====== [2012-11-09 18:07:45.604963] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1091519: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed [2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1 [2012-11-09 18:07:55.956655] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1 [2012-11-09 18:07:55.956707] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument) [2012-11-09 18:07:55.956731] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1091538: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed [2012-11-09 18:08:06.303144] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=0 overlaps=2 [2012-11-09 18:08:06.304492] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1 [2012-11-09 18:08:06.304526] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument) [2012-11-09 18:08:06.304565] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1095142: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed [2012-11-09 18:08:16.646334] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=0 overlaps=2 [2012-11-09 18:08:16.647740] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in <gfid:ba87ed4b-5b55-4ef5-b53e-50d3238ebec1>. holes=1 overlaps=1 [2012-11-09 18:08:16.647773] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: ba87ed4b-5b55-4ef5-b53e-50d3238ebec1: failed to resolve (Invalid argument) [2012-11-09 18:08:16.647789] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 1110890: GETATTR 140550774108352 (ba87ed4b-5b55-4ef5-b53e-50d3238ebec1) resolution failed vdsm logs ========= Thread-6625::DEBUG::2012-11-09 18:53:54,639::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.1cfe1c 4b-1d33-46ab-9c53-4fb65d7a892d', Clearing records. Thread-6625::ERROR::2012-11-09 18:53:54,640::task::853::TaskManager.Task::(_setError) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 853, in connectStoragePool return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 895, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 648, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d, msdUUID=13a3d358-65bd-4d03-bfcf-e6bcb6c8a176' Thread-6625::DEBUG::2012-11-09 18:53:54,640::task::872::TaskManager.Task::(_run) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::Task._run: 8a12bb1e-d17d-42d4-8c31-03deac10a056 ('1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d', 1, '1cfe1c4b-1d33-46ab-9c53-4fb65d7a892d', '13a3d358-65bd-4d03-bfcf-e6bcb6c8a176', 1) {} failed - stopping task Thread-6625::DEBUG::2012-11-09 18:53:54,641::task::1199::TaskManager.Task::(stop) Task=`8a12bb1e-d17d-42d4-8c31-03deac10a056`::stopping in state preparing (force False)
you manually unmounted the volume from rhel-h and then activated the domain ... things started working
> [2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1 a overlap of ranges is never good sign for DHT, and hence the storage domain is seen as inactive. This doesn't effect the open fd (ie, running VMs) hence the observation is valid. Still trying to figure out whats wrong with the volume. Need to figure out. Can the reporter confirm if the case of more than 1brick volume expansion and rebalance works fine? if that is the case, i would reduce the severity of this as we would not support any volume less than 4 bricks in ideal scenario.
(In reply to comment #5) > > [2012-11-09 18:07:55.955351] I [dht-layout.c:593:dht_layout_normalize] 3-vmstore-dht: found anomalies in /13a3d358-65bd-4d03-bfcf-e6bcb6c8a176. holes=1 overlaps=1 > > a overlap of ranges is never good sign for DHT, and hence the storage domain > is seen as inactive. This doesn't effect the open fd (ie, running VMs) hence > the observation is valid. Still trying to figure out whats wrong with the > volume. Need to figure out. > > Can the reporter confirm if the case of more than 1brick volume expansion > and rebalance works fine? if that is the case, i would reduce the severity > of this as we would not support any volume less than 4 bricks in ideal > scenario. I tried with 3x2 distributed-replicate and 5 brick distribute volumes but still issue persists
This is not high priority since it is hit only when the volume created is a single brick distribute volume, which is not a recommended configuration anyway. This will not be fixed in update 4, marking it update 5 for now.
I recommend this should be closed as NOTABUG considering this is not a valid config. Any thoughts?
Per 01/31 tiger team meeting, close, invalid.