Created attachment 625801 [details] mnt, vdsm, rebalance, brick logs Description of problem: While testing rebalance on a pure distribute volume there were some disconnects on the bricks, VMs hosted on this volume were not able to hibernate. Version-Release number of selected component (if applicable): [root@rhs-gp-srv4 ~]# rpm -qa | grep gluster glusterfs-server-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-debuginfo-3.3.0rhsvirt1-7.el6rhs.x86_64 vdsm-gluster-4.9.6-14.el6rhs.noarch gluster-swift-plugin-1.0-5.noarch gluster-swift-container-1.4.8-4.el6.noarch org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-fuse-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-geo-replication-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-devel-3.3.0rhsvirt1-7.el6rhs.x86_64 gluster-swift-proxy-1.4.8-4.el6.noarch gluster-swift-account-1.4.8-4.el6.noarch gluster-swift-doc-1.4.8-4.el6.noarch glusterfs-3.3.0rhsvirt1-7.el6rhs.x86_64 glusterfs-rdma-3.3.0rhsvirt1-7.el6rhs.x86_64 gluster-swift-1.4.8-4.el6.noarch gluster-swift-object-1.4.8-4.el6.noarch How reproducible: intermittent Steps to Reproduce: 1. crated a single brick volume 2. made this as storage domain and created some VMs 3. Kept on doing add-brick and rebalance till brick count was 4 4. then tried removing one of the brick 5. while remove-brick is happening tried to pause the VM but failed 6. log shows some disconnect messages for one of the brick Actual results: Expected results: Even if one of the brick is down remaining part of partial data should be avaliable. Additional info: Volume Name: vmstore Type: Distribute Volume ID: 91aa3e01-6330-44b7-acf1-9e5a20570cc8 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/another Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/another Brick3: rhs-gp-srv15.lab.eng.blr.redhat.com:/another Options Reconfigured: cluster.eager-lock: enable storage.linux-aio: off performance.read-ahead: disable performance.stat-prefetch: disable performance.io-cache: disable performance.quick-read: disable performance.write-behind: enable One of the brick has been removed which is "rhs-gp-srv15.lab.eng.blr.redhat.com:/another" Attached all logs
The brick which was actually removed is "rhs-gp-srv12.lab.eng.blr.redhat.com:/another" sorry for the typo in the above comment
Fix @ https://code.engineering.redhat.com/gerrit/#/c/123/ https://code.engineering.redhat.com/gerrit/#/c/122/1
please use "gluster volume set <VOL> cluster.subvols-per-directory 1" on the volume and try.
I tried with "subvols-per-directory 1" but still i could see storage domain going down. here is the detail =================== [root@rhs-gp-srv6 81904dae-9b04-4748-b757-19679fabb1f7]# gluster v info Volume Name: distribute Type: Distribute Volume ID: ca30b6d7-e02b-4fe9-b581-5dd8dedd5205 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: rhs-gp-srv9.lab.eng.blr.redhat.com:/brick1/disk1 Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/brick1/disk1 Brick3: rhs-gp-srv9.lab.eng.blr.redhat.com:/brick2 Options Reconfigured: cluster.subvols-per-directory: 1 storage.owner-gid: 36 storage.owner-uid: 36 cluster.eager-lock: enable storage.linux-aio: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off ls on BRICK1: ============= [root@rhs-gp-srv9 81904dae-9b04-4748-b757-19679fabb1f7]# ll images/* images/9899a2cc-f7e6-46f3-a28b-01be5c155d5d: total 13008308 -rw-rw---- 2 vdsm kvm 20782391296 Jan 16 15:38 2e95c610-4b2d-4598-816c-3fa765371ea7 -rw-rw---- 2 vdsm kvm 1048576 Jan 16 10:08 2e95c610-4b2d-4598-816c-3fa765371ea7.lease -rw-r--r-- 2 vdsm kvm 268 Jan 16 10:08 2e95c610-4b2d-4598-816c-3fa765371ea7.meta images/ac9c4442-1b7e-4f72-a18c-1870069265fc: total 29248220 -rw-rw---- 2 vdsm kvm 31511646208 Jan 16 15:00 752089d5-ba3d-457b-8755-ade373448573 -rw-rw---- 2 vdsm kvm 1048576 Jan 16 10:51 752089d5-ba3d-457b-8755-ade373448573.lease -rw-r--r-- 2 vdsm kvm 268 Jan 16 10:51 752089d5-ba3d-457b-8755-ade373448573.meta images/f7ee0951-82fe-4005-aed3-686b0aa30212: total 20972548 -rw-rw---- 2 vdsm kvm 21474836480 Jan 16 15:31 d2b30c6d-86c3-4d3b-a217-fbc5dd79a82e -rw-rw---- 2 vdsm kvm 1048576 Jan 16 10:43 d2b30c6d-86c3-4d3b-a217-fbc5dd79a82e.lease -rw-r--r-- 2 vdsm kvm 274 Jan 16 10:43 d2b30c6d-86c3-4d3b-a217-fbc5dd79a82e.meta images/f92089cf-46a1-4157-8df2-5449a1de2be8: total 0 From Brick2 ============= [root@rhs-gp-srv6 81904dae-9b04-4748-b757-19679fabb1f7]# ll images/* images/9899a2cc-f7e6-46f3-a28b-01be5c155d5d: total 0 images/ac9c4442-1b7e-4f72-a18c-1870069265fc: total 0 images/f7ee0951-82fe-4005-aed3-686b0aa30212: total 0 images/f92089cf-46a1-4157-8df2-5449a1de2be8: total 565964 -rw-rw---- 2 vdsm kvm 1073741824 Jan 16 15:01 fb6f201c-c4e5-4efe-b28f-aabb498585cf -rw-rw---- 2 vdsm kvm 1048576 Jan 16 11:01 fb6f201c-c4e5-4efe-b28f-aabb498585cf.lease -rw-r--r-- 2 vdsm kvm 267 Jan 16 11:01 fb6f201c-c4e5-4efe-b28f-aabb498585cf.meta From Brick3: ============ [root@rhs-gp-srv9 81904dae-9b04-4748-b757-19679fabb1f7]# ll images/* images/9899a2cc-f7e6-46f3-a28b-01be5c155d5d: total 0 images/ac9c4442-1b7e-4f72-a18c-1870069265fc: total 0 images/f7ee0951-82fe-4005-aed3-686b0aa30212: total 0 images/f92089cf-46a1-4157-8df2-5449a1de2be8: total 0 Initially i had Brick1 and Brick2 , i created 3 vms on the storage domain added a new brick Brick3 and started rebalance. After rebalance is over i brougt down Brick2 which had only one vm hosted on it. Eventually vm belonging to Brick2 got paused which was expected but i tried pausing other vms which got failed with the message "Error while executing action: Cannot hibernate VM. Low disk space on relevant Storage Domain." because storage-domain was down. from the mount log ========================== [2013-01-16 15:24:05.486847] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-01-16 15:24:05.487452] W [client3_1-fops.c:2571:client3_1_opendir_cbk] 1-distribute-client-1: remote operation failed: Transpor t endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-01-16 15:24:05.488655] W [client3_1-fops.c:2356:client3_1_readdirp_cbk] 1-distribute-client-1: remote operation failed: Transpo rt endpoint is not connected [2013-01-16 15:24:05.489666] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-01-16 15:24:05.490003] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: /81904dae-9b04-4748-b757-19679fabb1f7 (32c7da5f-6b59-4b99-8cf4-5c6a08e4965e) [2013-01-16 15:24:05.490371] W [client3_1-fops.c:529:client3_1_stat_cbk] 1-distribute-client-1: remote operation failed: Transport en dpoint is not connected [2013-01-16 15:24:05.490957] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: /81904dae-9b04-4748-b757-19679fabb1f7/dom_md (00000000-0000-0000-0000-000000000000) [2013-01-16 15:24:05.491266] I [dht-layout.c:598:dht_layout_normalize] 1-distribute-dht: found anomalies in /81904dae-9b04-4748-b757- 19679fabb1f7/dom_md. holes=1 overlaps=0 [2013-01-16 15:24:05.491300] W [dht-selfheal.c:872:dht_selfheal_directory] 1-distribute-dht: 1 subvolumes down -- not fixing [2013-01-16 15:24:05.491365] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: /81904dae-9b04-4748-b757-19679fabb1f7/dom_md (06015e5d-386c-491e-9388-1f6fb4a2df62) [2013-01-16 15:24:05.491712] W [client3_1-fops.c:2655:client3_1_lookup_cbk] 1-distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: <gfid:06015e5d-386c-491e-9388-1f6fb4a2df62> (00000000-0000-0000-0000-000000000000) [2013-01-16 15:24:05.492046] I [dht-layout.c:598:dht_layout_normalize] 1-distribute-dht: found anomalies in <gfid:06015e5d-386c-491e-9388-1f6fb4a2df62>. holes=1 overlaps=0 [2013-01-16 15:24:05.492077] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: 06015e5d-386c-491e-9388-1f6fb4a2df62: failed to resolve (Invalid argument) [2013-01-16 15:24:05.492095] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 11932079: GETATTR 140403424420180 (06015e5d-386c-491e-9388-1f6fb4a2df62) resolution failed ================================================================= from Brick1: ======================== [root@rhs-gp-srv9 81904dae-9b04-4748-b757-19679fabb1f7]# getfattr -d -e hex -m . /brick1/disk1/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/ getfattr: Removing leading '/' from absolute path names # file: brick1/disk1/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/ trusted.gfid=0x06015e5d386c491e93881f6fb4a2df62 trusted.glusterfs.dht=0x000000010000000000000000ffffffff From Brick2 ======================== [root@rhs-gp-srv6 81904dae-9b04-4748-b757-19679fabb1f7]# getfattr -d -e hex -m . /brick1/disk1/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/ getfattr: Removing leading '/' from absolute path names # file: brick1/disk1/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/ trusted.gfid=0x06015e5d386c491e93881f6fb4a2df62 trusted.glusterfs.dht=0x000000010000000000000000ffffffff From Brick3 ========================== [root@rhs-gp-srv9 81904dae-9b04-4748-b757-19679fabb1f7]# getfattr -d -e hex -m . /brick2/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/ getfattr: Removing leading '/' from absolute path names # file: brick2/81904dae-9b04-4748-b757-19679fabb1f7/dom_md/ trusted.gfid=0x06015e5d386c491e93881f6fb4a2df62 trusted.glusterfs.dht=0x00000001000000000000000000000000 On hypervisor mount ==================== root@rhs-client1 81904dae-9b04-4748-b757-19679fabb1f7]# ll ls: cannot access dom_md: No such file or directory total 0 ??????????? ? ? ? ? ? dom_md drwxr-xr-x. 6 vdsm kvm 534 Jan 16 11:28 images drwxr-xr-x. 4 vdsm kvm 84 Jan 16 11:28 master conclusion: part of cluster was not available after one of the brick is down
The availability of the cluster depends on which subvol went down. Even with "subvols-per-directory=1" set, we cannot guarantee the availability all the time. What if the subvol which went down, had info of the cluster? "subvols-per-directory=1" option only tries to reduce the percentage of error. If such kind of a resilence is required, using a distributed-replica volume is a must. As killing a brick in a distribute volume does not guarantee availability.
Additionally the error message is indicative of the subvol being down, which lead to holes. [2013-01-16 15:24:05.491266] I [dht-layout.c:598:dht_layout_normalize] 1-distribute-dht: found anomalies in /81904dae-9b04-4748-b757- 19679fabb1f7/dom_md. holes=1 overlaps=0
This bug should be again marked as ON_QA because the patch to default make the 'subvols-per-directory=1' is accepted now.
Targeting for 2.1.z (Big Bend) U1.
If a user is not using 'replicate' in their setup, we can't/won't support any high availability. With a distribute only setup, it is normal that a subvolume going down would cause partial filesystem to be not accessible. We will not fix the issue with only distribute in near future. High Availability of Glusterfs comes only with 'replicate' translator loaded in the graph.