Bug 865669

Summary: [RHEV-RHS] part of the cluster should be available even if one of the brick is down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: glusterfsAssignee: shishir gowda <sgowda>
Status: CLOSED WONTFIX QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: medium    
Version: unspecifiedCC: amarts, grajaiya, kparthas, nsathyan, rhs-bugs, sdharane, sgowda, shaines, vagarwal, vbellur
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
virt rhev integration
Last Closed: 2013-10-03 06:23:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 875076    
Bug Blocks:    
Attachments:
Description Flags
mnt, vdsm, rebalance, brick logs none

Description shylesh 2012-10-12 05:32:12 UTC
Created attachment 625801 [details]
mnt, vdsm, rebalance, brick logs

Description of problem:

While testing rebalance on a pure distribute volume there were some disconnects on the bricks, VMs hosted on this volume were not able to hibernate.

Version-Release number of selected component (if applicable):
[root@rhs-gp-srv4 ~]# rpm -qa | grep gluster
glusterfs-server-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-debuginfo-3.3.0rhsvirt1-7.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-fuse-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-geo-replication-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-devel-3.3.0rhsvirt1-7.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-3.3.0rhsvirt1-7.el6rhs.x86_64
glusterfs-rdma-3.3.0rhsvirt1-7.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch


How reproducible:
intermittent

Steps to Reproduce:
1. crated a single brick volume 
2. made this as storage domain and created some VMs
3. Kept on doing add-brick and rebalance till brick count was 4 
4. then tried removing one of the brick
5. while remove-brick is happening tried to pause the VM but failed
6. log shows some disconnect messages for one of the brick
  
Actual results:


Expected results:
Even if one of the brick is down remaining part of partial data should be avaliable.

Additional info:

Volume Name: vmstore
Type: Distribute
Volume ID: 91aa3e01-6330-44b7-acf1-9e5a20570cc8
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/another
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/another
Brick3: rhs-gp-srv15.lab.eng.blr.redhat.com:/another
Options Reconfigured:
cluster.eager-lock: enable
storage.linux-aio: off
performance.read-ahead: disable
performance.stat-prefetch: disable
performance.io-cache: disable
performance.quick-read: disable
performance.write-behind: enable

One of the brick has been removed which is  "rhs-gp-srv15.lab.eng.blr.redhat.com:/another"

Attached all logs

Comment 2 shylesh 2012-10-12 05:43:44 UTC
The brick which was actually removed is 
"rhs-gp-srv12.lab.eng.blr.redhat.com:/another" sorry for the typo in the above comment

Comment 4 Amar Tumballi 2012-11-05 12:06:57 UTC
please use "gluster volume set <VOL> cluster.subvols-per-directory 1" on the volume and try.

Comment 5 shylesh 2013-01-16 12:50:22 UTC
I tried with "subvols-per-directory 1"  but still i could see storage domain going down.

here is the detail
===================
[root@rhs-gp-srv6 81904dae-9b04-4748-b757-19679fabb1f7]# gluster v info
 
Volume Name: distribute
Type: Distribute
Volume ID: ca30b6d7-e02b-4fe9-b581-5dd8dedd5205
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv9.lab.eng.blr.redhat.com:/brick1/disk1
Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/brick1/disk1
Brick3: rhs-gp-srv9.lab.eng.blr.redhat.com:/brick2
Options Reconfigured:
cluster.subvols-per-directory: 1
storage.owner-gid: 36
storage.owner-uid: 36
cluster.eager-lock: enable
storage.linux-aio: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off



ls on BRICK1:
=============
[root@rhs-gp-srv9 81904dae-9b04-4748-b757-19679fabb1f7]# ll images/*
images/9899a2cc-f7e6-46f3-a28b-01be5c155d5d:
total 13008308
-rw-rw---- 2 vdsm kvm 20782391296 Jan 16 15:38 2e95c610-4b2d-4598-816c-3fa765371ea7
-rw-rw---- 2 vdsm kvm     1048576 Jan 16 10:08 2e95c610-4b2d-4598-816c-3fa765371ea7.lease
-rw-r--r-- 2 vdsm kvm         268 Jan 16 10:08 2e95c610-4b2d-4598-816c-3fa765371ea7.meta

images/ac9c4442-1b7e-4f72-a18c-1870069265fc:
total 29248220
-rw-rw---- 2 vdsm kvm 31511646208 Jan 16 15:00 752089d5-ba3d-457b-8755-ade373448573
-rw-rw---- 2 vdsm kvm     1048576 Jan 16 10:51 752089d5-ba3d-457b-8755-ade373448573.lease
-rw-r--r-- 2 vdsm kvm         268 Jan 16 10:51 752089d5-ba3d-457b-8755-ade373448573.meta

images/f7ee0951-82fe-4005-aed3-686b0aa30212:
total 20972548
-rw-rw---- 2 vdsm kvm 21474836480 Jan 16 15:31 d2b30c6d-86c3-4d3b-a217-fbc5dd79a82e
-rw-rw---- 2 vdsm kvm     1048576 Jan 16 10:43 d2b30c6d-86c3-4d3b-a217-fbc5dd79a82e.lease
-rw-r--r-- 2 vdsm kvm         274 Jan 16 10:43 d2b30c6d-86c3-4d3b-a217-fbc5dd79a82e.meta

images/f92089cf-46a1-4157-8df2-5449a1de2be8:
total 0



From Brick2
=============
[root@rhs-gp-srv6 81904dae-9b04-4748-b757-19679fabb1f7]# ll images/*
images/9899a2cc-f7e6-46f3-a28b-01be5c155d5d:
total 0

images/ac9c4442-1b7e-4f72-a18c-1870069265fc:
total 0

images/f7ee0951-82fe-4005-aed3-686b0aa30212:
total 0

images/f92089cf-46a1-4157-8df2-5449a1de2be8:
total 565964
-rw-rw---- 2 vdsm kvm 1073741824 Jan 16 15:01 fb6f201c-c4e5-4efe-b28f-aabb498585cf
-rw-rw---- 2 vdsm kvm    1048576 Jan 16 11:01 fb6f201c-c4e5-4efe-b28f-aabb498585cf.lease
-rw-r--r-- 2 vdsm kvm        267 Jan 16 11:01 fb6f201c-c4e5-4efe-b28f-aabb498585cf.meta


From Brick3:
============
[root@rhs-gp-srv9 81904dae-9b04-4748-b757-19679fabb1f7]# ll images/*
images/9899a2cc-f7e6-46f3-a28b-01be5c155d5d:
total 0

images/ac9c4442-1b7e-4f72-a18c-1870069265fc:
total 0

images/f7ee0951-82fe-4005-aed3-686b0aa30212:
total 0

images/f92089cf-46a1-4157-8df2-5449a1de2be8:
total 0



Initially i had Brick1 and Brick2 , i created 3 vms on the storage domain
added a new brick Brick3 and started rebalance.
After rebalance is over i brougt down Brick2 which had only one vm hosted on it. Eventually vm belonging to Brick2 got paused which was expected but i tried pausing other vms which got failed with the message  "Error while executing action: Cannot hibernate VM. Low disk space on relevant Storage Domain." because storage-domain was down.

 from the mount log
==========================
[2013-01-16 15:24:05.486847] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport
 endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-01-16 15:24:05.487452] W [client3_1-fops.c:2571:client3_1_opendir_cbk] 1-distribute-client-1: remote operation failed: Transpor
t endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-01-16 15:24:05.488655] W [client3_1-fops.c:2356:client3_1_readdirp_cbk] 1-distribute-client-1: remote operation failed: Transpo
rt endpoint is not connected
[2013-01-16 15:24:05.489666] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport
 endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-01-16 15:24:05.490003] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport
 endpoint is not connected. Path: /81904dae-9b04-4748-b757-19679fabb1f7 (32c7da5f-6b59-4b99-8cf4-5c6a08e4965e)
[2013-01-16 15:24:05.490371] W [client3_1-fops.c:529:client3_1_stat_cbk] 1-distribute-client-1: remote operation failed: Transport en
dpoint is not connected
[2013-01-16 15:24:05.490957] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport
 endpoint is not connected. Path: /81904dae-9b04-4748-b757-19679fabb1f7/dom_md (00000000-0000-0000-0000-000000000000)
[2013-01-16 15:24:05.491266] I [dht-layout.c:598:dht_layout_normalize] 1-distribute-dht: found anomalies in /81904dae-9b04-4748-b757-
19679fabb1f7/dom_md. holes=1 overlaps=0
[2013-01-16 15:24:05.491300] W [dht-selfheal.c:872:dht_selfheal_directory] 1-distribute-dht: 1 subvolumes down -- not fixing
[2013-01-16 15:24:05.491365] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport
 endpoint is not connected. Path: /81904dae-9b04-4748-b757-19679fabb1f7/dom_md (06015e5d-386c-491e-9388-1f6fb4a2df62)
[2013-01-16 15:24:05.491712] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport
 endpoint is not connected. Path: <gfid:06015e5d-386c-491e-9388-1f6fb4a2df62> (00000000-0000-0000-0000-000000000000)
[2013-01-16 15:24:05.492046] I [dht-layout.c:598:dht_layout_normalize] 1-distribute-dht: found anomalies in <gfid:06015e5d-386c-491e-9388-1f6fb4a2df62>. holes=1 overlaps=0
[2013-01-16 15:24:05.492077] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: 06015e5d-386c-491e-9388-1f6fb4a2df62: failed to resolve (Invalid argument)
[2013-01-16 15:24:05.492095] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 11932079: GETATTR 140403424420180 (06015e5d-386c-491e-9388-1f6fb4a2df62) resolution failed 
=================================================================

from Brick1:
========================
[root@rhs-gp-srv9 81904dae-9b04-4748-b757-19679fabb1f7]# getfattr -d -e hex -m . /brick1/disk1/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/
getfattr: Removing leading '/' from absolute path names
# file: brick1/disk1/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/
trusted.gfid=0x06015e5d386c491e93881f6fb4a2df62
trusted.glusterfs.dht=0x000000010000000000000000ffffffff


From Brick2
========================
[root@rhs-gp-srv6 81904dae-9b04-4748-b757-19679fabb1f7]# getfattr -d -e hex -m . /brick1/disk1/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/
getfattr: Removing leading '/' from absolute path names
# file: brick1/disk1/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/
trusted.gfid=0x06015e5d386c491e93881f6fb4a2df62
trusted.glusterfs.dht=0x000000010000000000000000ffffffff


From Brick3
==========================
[root@rhs-gp-srv9 81904dae-9b04-4748-b757-19679fabb1f7]# getfattr -d -e hex -m . /brick2/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/ 
getfattr: Removing leading '/' from absolute path names
# file: brick2/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/
trusted.gfid=0x06015e5d386c491e93881f6fb4a2df62
trusted.glusterfs.dht=0x00000001000000000000000000000000


On hypervisor mount
====================
root@rhs-client1 81904dae-9b04-4748-b757-19679fabb1f7]# ll
ls: cannot access dom_md: No such file or directory
total 0
??????????? ? ?    ?     ?            ? dom_md
drwxr-xr-x. 6 vdsm kvm 534 Jan 16 11:28 images
drwxr-xr-x. 4 vdsm kvm  84 Jan 16 11:28 master


conclusion: part of cluster was not available after one of the brick is down

Comment 6 shishir gowda 2013-01-24 05:05:06 UTC
The availability of the cluster depends on which subvol went down. Even with "subvols-per-directory=1" set, we cannot guarantee the availability all the time. What if the subvol which went down, had info of the cluster?

"subvols-per-directory=1" option only tries to reduce the percentage of error.
If such kind of a resilence is required, using a distributed-replica volume is a must. As killing a brick in a distribute volume does not guarantee availability.

Comment 7 shishir gowda 2013-01-24 05:22:55 UTC
Additionally the error message is indicative of the subvol being down, which lead to holes.
[2013-01-16 15:24:05.491266] I [dht-layout.c:598:dht_layout_normalize] 1-distribute-dht: found anomalies in /81904dae-9b04-4748-b757-
19679fabb1f7/dom_md. holes=1 overlaps=0

Comment 8 Amar Tumballi 2013-02-28 09:26:27 UTC
This bug should be again marked as ON_QA because the patch to default make the 'subvols-per-directory=1' is accepted now.

Comment 9 Scott Haines 2013-09-23 19:46:56 UTC
Targeting for 2.1.z (Big Bend) U1.

Comment 11 Amar Tumballi 2013-10-03 06:23:20 UTC
If a user is not using 'replicate' in their setup, we can't/won't support any high availability. With a distribute only setup, it is normal that a subvolume going down would cause partial filesystem to be not accessible.

We will not fix the issue with only distribute in near future. High Availability of Glusterfs comes only with 'replicate' translator loaded in the graph.