Description of problem: starting rebalance operation brings vms into paused state Version-Release number of selected component (if applicable): [root@rhs1-bb ~]# rpm -qa | grep gluster glusterfs-server-3.4.0.3rhs-1.el6rhs.x86_64 glusterfs-fuse-3.4.0.3rhs-1.el6rhs.x86_64 glusterfs-devel-3.4.0.3rhs-1.el6rhs.x86_64 glusterfs-3.4.0.3rhs-1.el6rhs.x86_64 glusterfs-debuginfo-3.4.0.3rhs-1.el6rhs.x86_64 How reproducible: Steps to Reproduce: 1. created a 6x2 distributed replicate 2. create 5 vms on this volume 3. added one more pair of brick and started fix-layout 4. once fix-layout is over issued the command gluster volume rebalance vstore start Actual results: rebalance ran for sometime , while its still in progress vms got paused one by one Additional info: [root@rhs1-bb ~]# gluster v info Volume Name: vstore Type: Distributed-Replicate Volume ID: e8fe6a61-6345-41f0-9329-a802b051a026 Status: Started Number of Bricks: 7 x 2 = 14 Transport-type: tcp Bricks: Brick1: 10.70.37.76:/brick1/vs1 Brick2: 10.70.37.133:/brick1/vs1 Brick3: 10.70.37.76:/brick2/vs2 Brick4: 10.70.37.133:/brick2/vs2 Brick5: 10.70.37.76:/brick3/vs3 Brick6: 10.70.37.133:/brick3/vs3 Brick7: 10.70.37.76:/brick4/vs4 Brick8: 10.70.37.133:/brick4/vs4 Brick9: 10.70.37.76:/brick5/vs5 Brick10: 10.70.37.133:/brick5/vs5 Brick11: 10.70.37.76:/brick6/vs6 Brick12: 10.70.37.133:/brick6/vs6 Brick13: 10.70.37.134:/brick1/vs1 Brick14: 10.70.37.59:/brick1/vs1 Options Reconfigured: storage.owner-gid: 36 storage.owner-uid: 36 performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: on errors from the hypervisor mount ================================ [2013-05-06 13:11:16.803849] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.803907] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.803939] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546926: WRITE => -1 (Bad file descriptor) [2013-05-06 13:11:16.805217] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.805422] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.805451] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546928: WRITE => -1 (Bad file descriptor) [2013-05-06 13:11:16.807145] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.807230] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.807259] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546930: WRITE => -1 (Bad file descriptor) [2013-05-06 13:11:16.809052] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.809995] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.810026] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546932: WRITE => -1 (Bad file descriptor) [2013-05-06 13:11:16.811380] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.811564] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor [2013-05-06 13:11:16.811589] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546934: WRITE => -1 (Bad file descriptor) vms could be stopped and restarted attached the sosreports
Looks like a grap switch at client is leading to EBADFD errors due to disconnections at the server side on the old graph. [2013-05-06 12:34:57.980819] I [server-handshake.c:567:server_setvolume] 0-vstore-server: accepted client from rhs4-bb.lab.eng.blr.redhat.com-14007-2013/05/06-12:34:52:846644-vstore-client-0-0 (version: 3.4.0.3r hs) [2013-05-06 12:34:57.987618] I [server-handshake.c:567:server_setvolume] 0-vstore-server: accepted client from rhs3-bb.lab.eng.blr.redhat.com-13915-2013/05/06-12:34:52:855118-vstore-client-0-0 (version: 3.4.0.3r hs) [2013-05-06 12:34:58.003871] I [server-handshake.c:567:server_setvolume] 0-vstore-server: accepted client from rhs2-bb.lab.eng.blr.redhat.com-14187-2013/05/06-12:34:52:851734-vstore-client-0-0 (version: 3.4.0.3r hs) [2013-05-06 12:34:59.015321] I [server.c:762:server_rpc_notify] 0-vstore-server: disconnecting connectionfrom rhs2-bb.lab.eng.blr.redhat.com-14187-2013/05/06-12:34:52:851734-vstore-client-0-0 [2013-05-06 12:34:59.015377] I [server-helpers.c:726:server_connection_put] 0-vstore-server: Shutting down connection rhs2-bb.lab.eng.blr.redhat.com-14187-2013/05/06-12:34:52:851734-vstore-client-0-0 [2013-05-06 12:34:59.015434] I [server-helpers.c:614:server_connection_destroy] 0-vstore-server: destroyed connection of rhs2-bb.lab.eng.blr.redhat.com-14187-2013/05/06-12:34:52:851734-vstore-client-0-0 [2013-05-06 12:34:59.016573] I [server.c:762:server_rpc_notify] 0-vstore-server: disconnecting connectionfrom rhs4-bb.lab.eng.blr.redhat.com-14007-2013/05/06-12:34:52:846644-vstore-client-0-0 [2013-05-06 12:34:59.016635] I [server-helpers.c:726:server_connection_put] 0-vstore-server: Shutting down connection rhs4-bb.lab.eng.blr.redhat.com-14007-2013/05/06-12:34:52:846644-vstore-client-0-0 [2013-05-06 12:34:59.016729] I [server-helpers.c:614:server_connection_destroy] 0-vstore-server: destroyed connection of rhs4-bb.lab.eng.blr.redhat.com-14007-2013/05/06-12:34:52:846644-vstore-client-0-0 [2013-05-06 12:34:59.031861] I [server.c:762:server_rpc_notify] 0-vstore-server: disconnecting connectionfrom rhs3-bb.lab.eng.blr.redhat.com-13915-2013/05/06-12:34:52:855118-vstore-client-0-0 [2013-05-06 12:34:59.031924] I [server-helpers.c:726:server_connection_put] 0-vstore-server: Shutting down connection rhs3-bb.lab.eng.blr.redhat.com-13915-2013/05/06-12:34:52:855118-vstore-client-0-0 [2013-05-06 12:34:59.031974] I [server-helpers.c:614:server_connection_destroy] 0-vstore-server: destroyed connection of rhs3-bb.lab.eng.blr.redhat.com-13915-2013/05/06-12:34:52:855118-vstore-client-0-0 [2013-05-06 12:35:57.880891] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 526389248, Bad file descriptor [2013-05-06 12:35:57.880972] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1712: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor) [2013-05-06 12:35:57.909136] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 530587648, Bad file descriptor [2013-05-06 12:35:57.909201] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1715: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor) [2013-05-06 12:35:57.911811] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 11404472320, Bad file descriptor [2013-05-06 12:35:57.911863] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1718: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor) [2013-05-06 12:35:57.914596] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 666689536, Bad file descriptor [2013-05-06 12:35:57.914644] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1722: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor) [2013-05-06 12:35:57.917624] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 9275346944, Bad file descriptor [2013-05-06 12:35:57.917808] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1725: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor) [2013-05-06 12:52:56.316583] I [server.c:762:server_rpc_notify] 0-vstore-server: disconnecting connectionfrom rhs1-bb.lab.eng.blr.redhat.com-12009-2013/05/06-12:34:47:766675-vstore-client-0-0 [2013-05-06 12:52:56.316703] I [server-helpers.c:726:server_connection_put] 0-vstore-server: Shutting down connection rhs1-bb.lab.eng.blr.redhat.com-12009-2013/05/06-12:34:47:766675-vstore-client-0-0 [2013-05-06 12:52:56.349423] I [server-helpers.c:460:do_fd_cleanup] 0-vstore-server: fd cleanup on /f3e8bf4f-1791-4777-bb97-ab161efa7fcc/images/f87f3951-3c46-494e-be48-124ca38ee3fa/cca8ce16-c191-42a8-8e1e-bd7635bffe81
Issue reproduced on glusterfs-server-3.4.0.8rhs-1.el6rhs.x86_64 Environment: RHEV+RHS RHEVM: 3.2.0-10.21.master.el6ev Hypervisor: RHEL 6.4 RHS: 4 nodes running gluster*3.4.0.8rhs-1.el6rhs.x86_64 Volume Name: RHEV-BigBend_extra Bricks were added to the volume and rebalance was started as given below: ----------------------------------------------------- [Thu May 16 13:54:00 root@rhs-client45:~ ] #gluster volume rebalance RHEV-BigBend_extra start volume rebalance: RHEV-BigBend_extra: success: Starting rebalance on volume RHEV-BigBend_extra has been successful. ID: 35858114-cb13-48ce-a189-c499aa480810 [Thu May 16 13:54:51 root@rhs-client45:~ ] #gluster volume rebalance RHEV-BigBend_extra status Node Rebalanced-files size scanned failures status run time in secs --------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 5 3.0MB 13 2 in progress 59.00 rhs-client37.lab.eng.blr.redhat.com 0 0Bytes 17 0 completed 1.00 rhs-client4.lab.eng.blr.redhat.com 0 0Bytes 17 0 completed 1.00 rhs-client15.lab.eng.blr.redhat.com 3 8.9KB 19 2 completed 6.00 volume rebalance: RHEV-BigBend_extra: success: .... [Thu May 16 14:04:36 root@rhs-client45:~ ] #gluster volume rebalance RHEV-BigBend_extra status Node Rebalanced-files size scanned failures status run time in secs --------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 7 20.0GB 17 2 in progress 677.00 rhs-client37.lab.eng.blr.redhat.com 0 0Bytes 17 0 completed 1.00 rhs-client4.lab.eng.blr.redhat.com 0 0Bytes 17 0 completed 1.00 rhs-client15.lab.eng.blr.redhat.com 3 8.9KB 19 2 completed 6.00 volume rebalance: RHEV-BigBend_extra: success: [Thu May 16 14:06:08 root@rhs-client45:~ ] #gluster volume rebalance RHEV-BigBend_extra status Node Rebalanced-files size scanned failures status run time in secs --------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 9 45.0GB 23 2 completed 691.00 rhs-client37.lab.eng.blr.redhat.com 0 0Bytes 17 0 completed 1.00 rhs-client4.lab.eng.blr.redhat.com 0 0Bytes 17 0 completed 1.00 rhs-client15.lab.eng.blr.redhat.com 3 8.9KB 19 2 completed 6.00 volume rebalance: RHEV-BigBend_extra: success: ----------------------------------------------------- 2 VMs got paused during the operation. They were recoverable only after they were forcefully stopped, and started.
It is interesting to note that VMs that are being migrated during the rebalance operation, seem to be automatically recoverable from the issue reported here, as seen during the verification of BZ 923523 (comment 8)
The rebalance process from one of the nodes had logged these. 1) Its surprising that how did inode become NULL because of which inode_ctx_get failed. 2) Why getting node-uuid was not obtained in getxattr call. [2013-05-06 12:34:58.468919] E [dht-helper.c:1054:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/clus ter/distribute.so(dht_lookup_linkfile_create_cbk+0x75) [0x7f38fb120c85] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/c luster/distribute.so(dht_layout_preset+0x5e) [0x7f38fb10819e] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/cluster/dis tribute.so(dht_inode_ctx_layout_set+0x34) [0x7f38fb1094d4]))) 0-vstore-dht: invalid argument: inode [2013-05-06 12:34:58.468983] E [dht-helper.c:1073:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/clus ter/distribute.so(dht_lookup_linkfile_create_cbk+0x75) [0x7f38fb120c85] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/c luster/distribute.so(dht_layout_preset+0x5e) [0x7f38fb10819e] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/cluster/dis tribute.so(dht_inode_ctx_layout_set+0x52) [0x7f38fb1094f2]))) 0-vstore-dht: invalid argument: inode [2013-05-06 12:34:58.469142] E [dht-common.c:2100:dht_getxattr] 0-vstore-dht: layout is NULL [2013-05-06 12:34:58.469215] E [dht-rebalance.c:1210:gf_defrag_migrate_data] 0-vstore-dht: Failed to get node-uuid f or /f3e8bf4f-1791-4777-bb97-ab161efa7fcc/images/333561b6-2bc7-4bde-ae79-41b4a9ad56ee/5f4cacb7-fa3c-46ee-82f8-47a8921 13119.lease
Rejy/Shanks, there are couple of more fixes in rebalance now which should have fixed the issue now in Big Bend. Can we please test this once more?
We have 2 fixes in rebalance/remove-brick code path 976755 and 981949 merged in. In addition to bug 981708 is a client side fix which could potentially affect the bug. Could you please re-run these tests and check if the issue is fixed? Please re-open the bug if the issue is hit again
This issue is still reproducible on 3.4.0.18rhs-1.el6rhs.x86_64 RHS nodes ======== 10.70.37.113 10.70.37.133 Mounted on ========= rhs-client36.lab.eng.blr.redhat.com mount point =========== /rhev/data-center/mnt/10.70.37.113:vmstore Volume Name: vmstore Type: Distributed-Replicate Volume ID: 10b93f79-2a1d-4737-8632-05f57c97db93 Status: Started Number of Bricks: 7 x 2 = 14 Transport-type: tcp Bricks: Brick1: 10.70.37.113:/brick1/vss1 Brick2: 10.70.37.133:/brick1/vss1 Brick3: 10.70.37.113:/brick2/vss2 Brick4: 10.70.37.133:/brick2/vss2 Brick5: 10.70.37.113:/brick3/vss3 Brick6: 10.70.37.133:/brick3/vss3 Brick7: 10.70.37.113:/brick4/vss4 Brick8: 10.70.37.133:/brick4/vss4 Brick9: 10.70.37.113:/brick4/vss5 Brick10: 10.70.37.133:/brick5/vss5 Brick11: 10.70.37.113:/brick6/vss6 Brick12: 10.70.37.133:/brick6/vss6 Brick13: 10.70.37.113:/brick1/vss7 Brick14: 10.70.37.133:/brick1/vss7 Options Reconfigured: performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable storage.owner-uid: 36 storage.owner-gid: 36 48325] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:46.951735] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:46.951771] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311765: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:46.971392] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:46.975536] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:46.975575] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311773: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:46.997078] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:46.997968] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:46.998002] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311776: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:47.020290] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.020474] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.020508] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311778: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:47.038749] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.039092] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.039123] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311780: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:47.045422] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.047381] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.047412] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311782: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:47.053965] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.054327] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.054356] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311784: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:47.063849] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.064494] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.064523] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311786: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:47.073986] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.074109] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor [2013-08-08 10:52:47.074138] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311788: WRITE => -1 (Bad file descriptor) [2013-08-08 10:52:47.083434] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor sosreports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/923774/
https://code.engineering.redhat.com/gerrit/11398 fixes the issue.
While verifying this bug same steps lead to ext4 corruption on one of the app vm. Impact: EXT4 corruption in app vm Volume Name: vmstore Type: Distributed-Replicate Volume ID: 10b93f79-2a1d-4737-8632-05f57c97db93 Status: Started Number of Bricks: 8 x 2 = 16 Transport-type: tcp Bricks: Brick1: 10.70.37.113:/brick1/vss1 Brick2: 10.70.37.133:/brick1/vss1 Brick3: 10.70.37.113:/brick2/vss2 Brick4: 10.70.37.133:/brick2/vss2 Brick5: 10.70.37.113:/brick3/vss3 Brick6: 10.70.37.133:/brick3/vss3 Brick7: 10.70.37.113:/brick4/vss4 Brick8: 10.70.37.133:/brick4/vss4 Brick9: 10.70.37.113:/brick4/vss5 Brick10: 10.70.37.133:/brick5/vss5 Brick11: 10.70.37.113:/brick6/vss6 Brick12: 10.70.37.133:/brick6/vss6 Brick13: 10.70.37.113:/brick1/vss7 Brick14: 10.70.37.133:/brick1/vss7 Brick15: 10.70.37.113:/brick1/vss8 Brick16: 10.70.37.133:/brick1/vss8 Options Reconfigured: storage.owner-gid: 36 storage.owner-uid: 36 network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off cluster info ============ RHS nodes --------- 10.70.37.113 10.70.37.133 Hypervisor ========== rhs-client36.lab.eng.blr.redhat.com Mount point =========== /rhev/data-center/mnt/10.70.37.113:vmstore Mount log messages =================== 2013-08-14 12:08:13.805531] I [client.c:2103:client_rpc_notify] 0-vmstore-client-11: disconnected from 10.70.37.133:49170. Client pro cess will keep trying to connect to glusterd until brick's port is available. [2013-08-14 12:08:13.805540] E [afr-common.c:3832:afr_notify] 0-vmstore-replicate-5: All subvolumes are down. Going offline until atle ast one of them comes back up. [2013-08-14 12:08:13.805556] I [client.c:2103:client_rpc_notify] 0-vmstore-client-12: disconnected from 10.70.37.113:49164. Client pro cess will keep trying to connect to glusterd until brick's port is available. [2013-08-14 12:08:13.805574] I [client.c:2103:client_rpc_notify] 0-vmstore-client-13: disconnected from 10.70.37.133:49171. Client pro cess will keep trying to connect to glusterd until brick's port is available. [2013-08-14 12:08:13.805583] E [afr-common.c:3832:afr_notify] 0-vmstore-replicate-6: All subvolumes are down. Going offline until atle ast one of them comes back up. [2013-08-14 12:08:13.806891] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-vmstore-client-14: remote operation failed: Permission denied. Path: /05ba73ee-552a-4eb4-9368-6db52bac31ef (00000000-0000-0000-0000-000000000000) [2013-08-14 12:08:13.807476] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-vmstore-client-14: remote operation failed: Permission denied. Path: /05ba73ee-552a-4eb4-9368-6db52bac31ef (00000000-0000-0000-0000-000000000000) [2013-08-14 12:08:13.813842] I [dht-layout.c:633:dht_layout_normalize] 1-vmstore-dht: found anomalies in /05ba73ee-552a-4eb4-9368-6db5 2bac31ef. holes=1 overlaps=0 missing=0 down=0 misc=1 [2013-08-14 12:08:13.813876] W [dht-selfheal.c:916:dht_selfheal_directory] 1-vmstore-dht: 1 subvolumes have unrecoverable errors [2013-08-14 12:08:13.814402] I [dht-layout.c:633:dht_layout_normalize] 1-vmstore-dht: found anomalies in /05ba73ee-552a-4eb4-9368-6db52bac31ef. holes=1 overlaps=0 missing=0 down=0 misc=1 [2013-08-14 12:08:13.814421] W [dht-selfheal.c:916:dht_selfheal_directory] 1-vmstore-dht: 1 subvolumes have unrecoverable errors [2013-08-14 12:08:13.815177] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-vmstore-client-14: remote operation failed: Permission denied. Path: /05ba73ee-552a-4eb4-9368-6db52bac31ef (140eaaf5-c667-4a71-aef1-a69a50c249b0) [2013-08-14 12:08:13.815222] I [dht-common.c:567:dht_revalidate_cbk] 1-vmstore-dht: subvolume vmstore-replicate-7 for /05ba73ee-552a-4eb4-9368-6db52bac31ef returned -1 (Permission denied) [2013-08-14 12:08:13.815493] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-vmstore-client-14: remote operation failed: Permission denied. Path: /05ba73ee-552a-4eb4-9368-6db52bac31ef (140eaaf5-c667-4a71-aef1-a69a50c249b0) [2013-08-14 12:08:13.815510] I [dht-common.c:567:dht_revalidate_cbk] 1-vmstore-dht: subvolume vmstore-replicate-7 for /05ba73ee-552a-4eb4-9368-6db52bac31ef returned -1 (Permission denied) [2013-08-14 12:08:13.834774] I [dht-layout.c:633:dht_layout_normalize] 1-vmstore-dht: found anomalies in <gfid:140eaaf5-c667-4a71-aef1-a69a50c249b0>. holes=1 overlaps=0 missing=1 down=0 misc=0 [2013-08-14 12:08:13.835080] I [dht-layout.c:633:dht_layout_normalize] 1-vmstore-dht: found anomalies in <gfid:140eaaf5-c667-4a71-aef1-a69a50c249b0>. holes=1 overlaps=0 missing=1 down=0 misc=0 [2013-08-14 12:08:13.835545] W [client-rpc-fops.c:519:client3_3_stat_cbk] 1-vmstore-client-14: remote operation failed: No such file or directory [2013-08-14 12:08:13.836324] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 1-vmstore-client-14: remote operation failed: No such file there are some permission denied errors on one of the brick. Brick15 and Brick16 are the newly added bricks and then rebalance was invoked No vm pausing is seen, but ext4 curruption was seen on the vm (attached the ext4 corrutpion message snapshot) attached the sosreports
Created attachment 786542 [details] ext4 corruption snapshot
Created attachment 786599 [details] program which can be run to test the bug
the gfid handle for the mentioned gfid can be found in /<brick-path>/.glusterfs/05/ba/05ba73ee-552a-4eb4-9368-6db52bac31ef
Marking this bug as verified as the original issue is no more reproducible, opening a seperate bug for vm corruption issue. verified on 3.4.0.19rhs-2.el6rhs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html