Bug 862611 - VMs were in paused state after rebalancing a distribute volume
Summary: VMs were in paused state after rebalancing a distribute volume
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: shishir gowda
QA Contact: Sudhir D
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-10-03 11:41 UTC by shylesh
Modified: 2013-12-09 01:33 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-10-05 06:57:48 UTC
Embargoed:


Attachments (Terms of Use)
Attached the mnt, vdsm and rebalance logs (38.72 KB, application/x-gzip)
2012-10-03 11:41 UTC, shylesh
no flags Details

Description shylesh 2012-10-03 11:41:28 UTC
Created attachment 620762 [details]
Attached the mnt, vdsm and rebalance logs

Description of problem:
VMs were in paused state after rebalancing the volume

Version-Release number of selected component (if applicable):
[root@rhs-gp-srv11 rebal]# rpm -qa | grep gluster
glusterfs-server-3.3.0rhsvirt1-6.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-fuse-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-geo-replication-3.3.0rhsvirt1-6.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-rdma-3.3.0rhsvirt1-6.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch


How reproducible:


Steps to Reproduce:
1. created a single brick distribute volume 
2. crated a storage-domain on this and some VMs were created
3. added a brick from another peer and initiated rebalance
4. dd was running inside the VMs
5. restarted glusterd on one of the peer
  
Actual results:
VMs were in paused state after sometime

No rebalance errors were seen in the rebalance logs 

Additional info:
mnt.logs
=======
[2012-10-03 16:15:48.414640] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 3193106: GETATTR 140086284337492 (2f67f555-0f44-4679-ab8c-8a
32e08b27d5) resolution failed
[2012-10-03 16:15:48.419910] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conn
ected. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5 (b94fd3f7-448b-44bf-9f20-4f200d36ae2c)
[2012-10-03 16:15:48.420436] W [client3_1-fops.c:818:client3_1_statfs_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conne
cted
[2012-10-03 16:15:48.421933] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conn
ected. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/dom_md (00000000-0000-0000-0000-000000000000)
[2012-10-03 16:15:48.422275] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/dom_md. 
holes=1 overlaps=0
[2012-10-03 16:15:48.422310] W [dht-selfheal.c:875:dht_selfheal_directory] 1-rebal-dht: 1 subvolumes down -- not fixing
[2012-10-03 16:15:48.422450] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conn
ected. Path: <gfid:2f67f555-0f44-4679-ab8c-8a32e08b27d5> (00000000-0000-0000-0000-000000000000)
[2012-10-03 16:15:48.422885] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in <gfid:2f67f555-0f44-4679-ab8c-8a32e08b27d5>. h
oles=1 overlaps=0
[2012-10-03 16:15:48.422922] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: 2f67f555-0f44-4679-ab8c-8a32e08b27d5: failed to resolve (Invalid ar
gument)
[2012-10-03 16:15:48.422937] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 3193116: GETATTR 140086284337492 (2f67f555-0f44-4679-ab8c-8a
32e08b27d5) resolution failed
[2012-10-03 16:15:49.831243] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2012-10-03 16:15:49.831756] W [client3_1-fops.c:2566:client3_1_opendir_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2012-10-03 16:15:49.832252] W [client3_1-fops.c:2351:client3_1_readdirp_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected
[2012-10-03 16:15:49.833415] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5 (b94fd3f7-448b-44bf-9f20-4f200d36ae2c)
[2012-10-03 16:15:49.834253] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/dom_md (00000000-0000-0000-0000-000000000000)
[2012-10-03 16:15:49.834625] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/dom_md. holes=1 overlaps=0
[2012-10-03 16:15:49.834650] W [dht-selfheal.c:875:dht_selfheal_directory] 1-rebal-dht: 1 subvolumes down -- not fixing
[2012-10-03 16:15:49.834794] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not conn
ected. Path: <gfid:2f67f555-0f44-4679-ab8c-8a32e08b27d5> (00000000-0000-0000-0000-000000000000)
[2012-10-03 16:15:49.835165] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in <gfid:2f67f555-0f44-4679-ab8c-8a32e08b27d5>. holes=1 overlaps=0
[2012-10-03 16:15:49.835198] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: 2f67f555-0f44-4679-ab8c-8a32e08b27d5: failed to resolve (Invalid argument)
[2012-10-03 16:15:49.835214] E [fuse-bridge.c:543:fuse_getattr_resume] 0-glusterfs-fuse: 3193128: GETATTR 140086284337492 (2f67f555-0f44-4679-ab8c-8a32e08b27d5) resolution failed
[2012-10-03 16:15:52.102341] I [glusterfsd-mgmt.c:1568:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2012-10-03 16:15:52.105840] I [client-handshake.c:1614:select_server_supported_programs] 1-rebal-client-0: Using Program GlusterFS 3.3.0rhsvirt1, Num (1298437), Version (330)
[2012-10-03 16:15:52.106196] I [client-handshake.c:1411:client_setvolume_cbk] 1-rebal-client-0: Connected to 10.70.36.8:24013, attached to remote volume '/rebal'.
[2012-10-03 16:15:52.106216] I [client-handshake.c:1423:client_setvolume_cbk] 1-rebal-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-10-03 16:15:52.106229] I [client-handshake.c:1260:client_post_handshake] 1-rebal-client-0: 5 fds open - Delaying child_up until they are re-opened
[2012-10-03 16:15:52.107058] I [client-lk.c:601:decrement_reopen_fd_count] 1-rebal-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2012-10-03 16:15:52.107240] I [client-handshake.c:453:client_set_lk_version_cbk] 1-rebal-client-0: Server lk version = 1
[2012-10-03 16:16:32.139748] I [dht-layout.c:593:dht_layout_normalize] 1-rebal-dht: found anomalies in /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/2ee16797-3fe1-4595-97ef-b4b2306b412c. holes=1 overlaps=0




volume info
==========
Volume Name: rebal
Type: Distribute
Volume ID: 0952e193-a12c-420a-b752-a77c54b3bf98
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/rebal
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/rebal

Comment 2 shishir gowda 2012-10-04 05:22:50 UTC
Looks like there is a brick down, and some permission issues.

[2012-10-03 16:13:47.546525] W [client3_1-fops.c:473:client3_1_open_cbk] 1-rebal-client-1: remote operation failed: Permission denied. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2)
[2012-10-03 16:13:47.546635] E [dht-helper.c:884:dht_rebalance_inprogress_task] 1-rebal-dht: (null): failed to send open() on target file at rebal-client-1

[2012-10-03 16:15:41.787932] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2012-10-03 16:15:41.788478] I [dht-layout.c:698:dht_layout_dir_mismatch] 1-rebal-dht: subvol: rebal-client-1; inode layout - 2147483647 - 4294967295; disk layout - 0 - 4294967295
[2012-10-03 16:15:41.788517] I [dht-common.c:596:dht_revalidate_cbk] 1-rebal-dht: mismatching layouts for /
[2012-10-03 16:15:41.788552] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 1-rebal-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001

Can you check if the vm's which got paused were hosted on the brick in question?
Additionally, please verify the backend ownerships/permissions of /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2)

Comment 3 shylesh 2012-10-04 06:25:30 UTC
(In reply to comment #2)
> Looks like there is a brick down, and some permission issues.
> 
> [2012-10-03 16:13:47.546525] W [client3_1-fops.c:473:client3_1_open_cbk]
> 1-rebal-client-1: remote operation failed: Permission denied. Path:
> /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-
> b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2
> (185ca417-c913-4b1b-b875-8e7a213df6b2)
> [2012-10-03 16:13:47.546635] E
> [dht-helper.c:884:dht_rebalance_inprogress_task] 1-rebal-dht: (null): failed
> to send open() on target file at rebal-client-1
> 
> [2012-10-03 16:15:41.787932] W [client3_1-fops.c:2650:client3_1_lookup_cbk]
> 1-rebal-client-0: remote operation failed: Transport endpoint is not
> connected. Path: / (00000000-0000-0000-0000-000000000001)
> [2012-10-03 16:15:41.788478] I [dht-layout.c:698:dht_layout_dir_mismatch]
> 1-rebal-dht: subvol: rebal-client-1; inode layout - 2147483647 - 4294967295;
> disk layout - 0 - 4294967295
> [2012-10-03 16:15:41.788517] I [dht-common.c:596:dht_revalidate_cbk]
> 1-rebal-dht: mismatching layouts for /
> [2012-10-03 16:15:41.788552] W [client3_1-fops.c:2650:client3_1_lookup_cbk]
> 1-rebal-client-0: remote operation failed: Transport endpoint is not
> connected. Path: / (00000000-0000-0000-0000-000000000001
> 
> Can you check if the vm's which got paused were hosted on the brick in
> question?
> Additionally, please verify the backend ownerships/permissions of
> /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-
> b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2
> (185ca417-c913-4b1b-b875-8e7a213df6b2)


[root@rhs-gp-srv11 rebal]# ll 89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2
-rw-rwS--T. 2 vdsm kvm 106707910656 Oct  4 11:14 89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 


I could see a sticky bit set for this file apart from that owner and group are proper. I was checking volume status during rebalance, all the bricks were up.

Comment 4 shishir gowda 2012-10-04 08:19:53 UTC
Looks like se-linux is enabled

[root@rhs-gp-srv11 ~]# sestatus 
SELinux status:                 enabled
SELinuxfs mount:                /selinux
Current mode:                   permissive
Mode from config file:          enforcing
Policy version:                 24
Policy from config file:        targeted


[root@rhs-gp-srv11 ~]# getfattr -d -e hex -m . /rebal/89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/
getfattr: Removing leading '/' from absolute path names
# file: rebal/89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0xe67ebcfb240045a685f686095d5644cc
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff


open failes with permission issues on the same client.
[2012-10-03 16:13:47.546525] W [client3_1-fops.c:473:client3_1_open_cbk] 1-rebal-client-1: remote operation failed: Permission denied. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2)
[2012-10-03 16:13:47.546635] E [dht-helper.c:884:dht_rebalance_inprogress_task] 1-rebal-dht: (null): failed to send open() on target file at rebal-client-1
[2012-10-03 16:13:50.784324] W [client3_1-fops.c:473:client3_1_open_cbk] 1-rebal-client-1: remote operation failed: Permission denied. Path: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2)
[2012-10-03 16:13:50.784408] I [client-helpers.c:100:this_fd_set_ctx] 1-rebal-client-1: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/images/6d6250b7-9b33-43d2-9e8f-b4a89dbe597e/3f183a59-71e9-4ed9-bd02-8e6309b016e2 (185ca417-c913-4b1b-b875-8e7a213df6b2): trying duplicate remote fd set. 
[2012-10-03 16:13:50.784416] E [dht-helper.c:884:dht_rebalance_inprogress_task] 1-rebal-dht: (null): failed to send open() on target file at rebal-client-1

Please disable selinux and re-run your tests

Comment 5 shylesh 2012-10-05 06:57:48 UTC
This bug i was not able to reproduce after selinux was disabled.so closing this bug.


Note You need to log in before you can comment on or make changes to this bug.