Created attachment 620762 [details] Attached the mnt, vdsm and rebalance logs Description of problem: VMs were in paused state after rebalancing the volume Version-Release number of selected component (if applicable): [root@rhs-gp-srv11 rebal]# rpm -qa | grep gluster glusterfs-server-3.3.0rhsvirt1-6.el6rhs.x86_64 vdsm-gluster-4.9.6-14.el6rhs.noarch gluster-swift-plugin-1.0-5.noarch gluster-swift-container-1.4.8-4.el6.noarch org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-fuse-3.3.0rhsvirt1-6.el6rhs.x86_64 glusterfs-geo-replication-3.3.0rhsvirt1-6.el6rhs.x86_64 gluster-swift-proxy-1.4.8-4.el6.noarch gluster-swift-account-1.4.8-4.el6.noarch gluster-swift-doc-1.4.8-4.el6.noarch glusterfs-3.3.0rhsvirt1-6.el6rhs.x86_64 glusterfs-rdma-3.3.0rhsvirt1-6.el6rhs.x86_64 gluster-swift-1.4.8-4.el6.noarch gluster-swift-object-1.4.8-4.el6.noarch How reproducible: Steps to Reproduce: 1. created a single brick distribute volume 2. crated a storage-domain on this and some VMs were created 3. added a brick from another peer and initiated rebalance 4. dd was running inside the VMs 5. restarted glusterd on one of the peer Actual results: VMs were in paused state after sometime No rebalance errors were seen in the rebalance logs Additional info: mnt.logs ======= [2012-10-03 16:15:48.414640] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 3193106: GETATTR 140086284337492 (2f67f555-0f44-4679-ab8c-8a 32e08b27d5) resolution failed [2012-10-03 16:15:48.419910] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conn ected. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5 (b94fd3f7-448b-44bf-9f20-4f200d36ae2c) [2012-10-03 16:15:48.420436] W [client3_1-fops.c:818:client3_1_statfs_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conne cted [2012-10-03 16:15:48.421933] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conn ected. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/dom_md (00000000-0000-0000-0000-000000000000) [2012-10-03 16:15:48.422275] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/dom_md. holes=1 overlaps=0 [2012-10-03 16:15:48.422310] W [dht-selfheal.c:875:dht_selfheal_directory] 1-rebal-dht: 1 subvolumes down -- not fixing [2012-10-03 16:15:48.422450] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conn ected. Path: <gfid:2f67f555-0f44-4679-ab8c-8a32e08b27d5> (00000000-0000-0000-0000-000000000000) [2012-10-03 16:15:48.422885] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in <gfid:2f67f555-0f44-4679-ab8c-8a32e08b27d5>. h oles=1 overlaps=0 [2012-10-03 16:15:48.422922] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: 2f67f555-0f44-4679-ab8c-8a32e08b27d5: failed to resolve (Invalid ar gument) [2012-10-03 16:15:48.422937] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 3193116: GETATTR 140086284337492 (2f67f555-0f44-4679-ab8c-8a 32e08b27d5) resolution failed [2012-10-03 16:15:49.831243] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2012-10-03 16:15:49.831756] W [client3_1-fops.c:2566:client3_1_opendir_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2012-10-03 16:15:49.832252] W [client3_1-fops.c:2351:client3_1_readdirp_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected [2012-10-03 16:15:49.833415] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5 (b94fd3f7-448b-44bf-9f20-4f200d36ae2c) [2012-10-03 16:15:49.834253] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/dom_md (00000000-0000-0000-0000-000000000000) [2012-10-03 16:15:49.834625] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/dom_md. holes=1 overlaps=0 [2012-10-03 16:15:49.834650] W [dht-selfheal.c:875:dht_selfheal_directory] 1-rebal-dht: 1 subvolumes down -- not fixing [2012-10-03 16:15:49.834794] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conn ected. Path: <gfid:2f67f555-0f44-4679-ab8c-8a32e08b27d5> (00000000-0000-0000-0000-000000000000) [2012-10-03 16:15:49.835165] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in <gfid:2f67f555-0f44-4679-ab8c-8a32e08b27d5>. holes=1 overlaps=0 [2012-10-03 16:15:49.835198] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: 2f67f555-0f44-4679-ab8c-8a32e08b27d5: failed to resolve (Invalid argument) [2012-10-03 16:15:49.835214] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 3193128: GETATTR 140086284337492 (2f67f555-0f44-4679-ab8c-8a32e08b27d5) resolution failed [2012-10-03 16:15:52.102341] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2012-10-03 16:15:52.105840] I [client-handshake.c:1614:select_server_supported_programs] 1-rebal-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330) [2012-10-03 16:15:52.106196] I [client-handshake.c:1411:client_setvolume_cbk] 1-rebal-client-0: Connected to 10.70.36.8:24013, attached to remote volume '/rebal'. [2012-10-03 16:15:52.106216] I [client-handshake.c:1423:client_setvolume_cbk] 1-rebal-client-0: Server and Client lk-version numbers are not same, reopening the fds [2012-10-03 16:15:52.106229] I [client-handshake.c:1260:client_post_handshake] 1-rebal-client-0: 5 fds open - Delaying child_up until they are re-opened [2012-10-03 16:15:52.107058] I [client-lk.c:601:decrement_reopen_fd_count] 1-rebal-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2012-10-03 16:15:52.107240] I [client-handshake.c:453:client_set_lk_version_cbk] 1-rebal-client-0: Server lk version = 1 [2012-10-03 16:16:32.139748] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/2ee16797-3fe1-4595-97ef-b4b2306b412c. holes=1 overlaps=0 volume info ========== Volume Name: rebal Type: Distribute Volume ID: 0952e193-a12c-420a-b752-a77c54b3bf98 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/rebal Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/rebal
Looks like there is a brick down, and some permission issues. [2012-10-03 16:13:47.546525] W [client3_1-fops.c:473:client3_1_open_cbk] 1-rebal-client-1: remote operation failed: Permission denied. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2) [2012-10-03 16:13:47.546635] E [dht-helper.c:884:dht_rebalance_inprogress_task] 1-rebal-dht: (null): failed to send open() on target file at rebal-client-1 [2012-10-03 16:15:41.787932] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2012-10-03 16:15:41.788478] I [dht-layout.c:698:dht_layout_dir_mismatch] 1-rebal-dht: subvol: rebal-client-1; inode layout - 2147483647 - 4294967295; disk layout - 0 - 4294967295 [2012-10-03 16:15:41.788517] I [dht-common.c:596:dht_revalidate_cbk] 1-rebal-dht: mismatching layouts for / [2012-10-03 16:15:41.788552] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001 Can you check if the vm's which got paused were hosted on the brick in question? Additionally, please verify the backend ownerships/permissions of /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2)
(In reply to comment #2) > Looks like there is a brick down, and some permission issues. > > [2012-10-03 16:13:47.546525] W [client3_1-fops.c:473:client3_1_open_cbk] > 1-rebal-client-1: remote operation failed: Permission denied. Path: > /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f- > b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 > (185ca417-c913-4b1b-b875-8e7a213df6b2) > [2012-10-03 16:13:47.546635] E > [dht-helper.c:884:dht_rebalance_inprogress_task] 1-rebal-dht: (null): failed > to send open() on target file at rebal-client-1 > > [2012-10-03 16:15:41.787932] W [client3_1-fops.c:2650:client3_1_lookup_cbk] > 1-rebal-client-0: remote operation failed: Transport endpoint is not > connected. Path: / (00000000-0000-0000-0000-000000000001) > [2012-10-03 16:15:41.788478] I [dht-layout.c:698:dht_layout_dir_mismatch] > 1-rebal-dht: subvol: rebal-client-1; inode layout - 2147483647 - 4294967295; > disk layout - 0 - 4294967295 > [2012-10-03 16:15:41.788517] I [dht-common.c:596:dht_revalidate_cbk] > 1-rebal-dht: mismatching layouts for / > [2012-10-03 16:15:41.788552] W [client3_1-fops.c:2650:client3_1_lookup_cbk] > 1-rebal-client-0: remote operation failed: Transport endpoint is not > connected. Path: / (00000000-0000-0000-0000-000000000001 > > Can you check if the vm's which got paused were hosted on the brick in > question? > Additionally, please verify the backend ownerships/permissions of > /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f- > b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 > (185ca417-c913-4b1b-b875-8e7a213df6b2) [root@rhs-gp-srv11 rebal]# ll 89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 -rw-rwS--T. 2 vdsm kvm 106707910656 Oct 4 11:14 89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 I could see a sticky bit set for this file apart from that owner and group are proper. I was checking volume status during rebalance, all the bricks were up.
Looks like se-linux is enabled [root@rhs-gp-srv11 ~]# sestatus SELinux status: enabled SELinuxfs mount: /selinux Current mode: permissive Mode from config file: enforcing Policy version: 24 Policy from config file: targeted [root@rhs-gp-srv11 ~]# getfattr -d -e hex -m . /rebal/89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/ getfattr: Removing leading '/' from absolute path names # file: rebal/89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/ security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 trusted.gfid=0xe67ebcfb240045a685f686095d5644cc trusted.glusterfs.dht=0x00000001000000007fffffffffffffff open failes with permission issues on the same client. [2012-10-03 16:13:47.546525] W [client3_1-fops.c:473:client3_1_open_cbk] 1-rebal-client-1: remote operation failed: Permission denied. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2) [2012-10-03 16:13:47.546635] E [dht-helper.c:884:dht_rebalance_inprogress_task] 1-rebal-dht: (null): failed to send open() on target file at rebal-client-1 [2012-10-03 16:13:50.784324] W [client3_1-fops.c:473:client3_1_open_cbk] 1-rebal-client-1: remote operation failed: Permission denied. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2) [2012-10-03 16:13:50.784408] I [client-helpers.c:100:this_fd_set_ctx] 1-rebal-client-1: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2): trying duplicate remote fd set. [2012-10-03 16:13:50.784416] E [dht-helper.c:884:dht_rebalance_inprogress_task] 1-rebal-dht: (null): failed to send open() on target file at rebal-client-1 Please disable selinux and re-run your tests
This bug i was not able to reproduce after selinux was disabled.so closing this bug.