Description of problem: rebalance destination glusterfsd processes got killed while doing geo-rep and rebalance in parallel with large data set, which resulted in geo-rep status to be faulty. Version-Release number of selected component (if applicable):glusterfs-3.4.0.35rhs-1.el6rhs.x86_64 How reproducible: Didn't try to reproduce Steps to Reproduce: 1.Create dist-rep volume (6x2) 2.Create a large set of data like 10*10*10000 of 10KB files. 10 top level directories , in each dir , 10 dir . In each of those 10 dirs , 10000 files of 10KB size 3.Add 2 more bricks and start rebalancing 4.Parallelly start geo-rep session with this volume as master, to some slave Actual results: rebalance destination glusterfsd processes got killed. Expected results: rebalance and geo-rep should work properly. Additional info: rebalance logs >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 13-10-22 06:20:53.593735] I [client.c:2103:client_rpc_notify] 0-georep-client-12: disconnected from 10.70.36.28:49192. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-22 06:20:53.593767] E [afr-common.c:3832:afr_notify] 0-georep-replicate-6: All subvolumes are down. Going offline until atleast one of them comes back up. [2013-10-22 06:20:53.593826] W [dht-common.c:5131:dht_notify] 0-georep-dht: Received CHILD_DOWN. Exiting [2013-10-22 06:20:53.593850] I [dht-rebalance.c:1788:gf_defrag_stop] 0-: Received stop command on rebalance [2013-10-22 06:20:53.797940] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-13: changing port to 49194 (from 0) [2013-10-22 06:20:53.804138] E [socket.c:2158:socket_connect_finish] 0-georep-client-13: connection to 10.70.36.33:49194 failed (Connection refused) [2013-10-22 06:20:54.804353] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-15: changing port to 49193 (from 0) [2013-10-22 06:20:54.808487] E [socket.c:2158:socket_connect_finish] 0-georep-client-15: connection to 10.70.36.28:49193 failed (Connection refused) [2013-10-22 06:20:56.810913] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-13: changing port to 49194 (from 0) [2013-10-22 06:20:56.817386] E [socket.c:2158:socket_connect_finish] 0-georep-client-13: connection to 10.70.36.33:49194 failed (Connection refused) [2013-10-22 06:20:57.817340] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-15: changing port to 49193 (from 0) [2013-10-22 06:20:57.823627] E [socket.c:2158:socket_connect_finish] 0-georep-client-15: connection to 10.70.36.28:49193 failed (Connection refused) [2013-10-22 06:21:00.823870] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-13: changing port to 49194 (from 0) [2013-10-22 06:21:00.830281] E [socket.c:2158:socket_connect_finish] 0-georep-client-13: connection to 10.70.36.33:49194 failed (Connection refused) [2013-10-22 06:21:01.829839] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-15: changing port to 49193 (from 0) [2013-10-22 06:21:01.835901] E [socket.c:2158:socket_connect_finish] 0-georep-client-15: connection to 10.70.36.28:49193 failed (Connection refused) [2013-10-22 06:21:03.957570] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-12: changing port to 49192 (from 0) [2013-10-22 06:21:03.964102] E [socket.c:2158:socket_connect_finish] 0-georep-client-12: connection to 10.70.36.28:49192 failed (Connection refused) [2013-10-22 06:21:04.049124] I [dht-rebalance.c:881:dht_migrate_file] 0-georep-dht: completed migration of /level01/level11/level21/level31/526117ef%%03LOS9RPN4 from subvolume georep-replicate-2 to georep-replicate-1 [2013-10-22 06:21:04.051816] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-georep-dht: Fix layout failed for /level01/level11/level21/level31 [2013-10-22 06:21:04.052859] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-georep-dht: Fix layout failed for /level01/level11/level21 [2013-10-22 06:21:04.070432] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-georep-dht: Fix layout failed for /level01/level11 [2013-10-22 06:21:04.135330] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-georep-dht: Fix layout failed for /level01 [2013-10-22 06:21:04.146217] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 63769.00 secs [2013-10-22 06:21:04.146241] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 33958, size: 434089984, lookups: 157076, failures: 4, skipped: 527 [2013-10-22 06:21:04.167697] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x35a68e894d] (-->/lib64/libpthread.so.0() [0x35a7007851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down (END) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one of the killed brick logs >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2013-10-22 05:45:19.990706] I [server.c:773:server_rpc_notify] 0-georep-server: disconnecting connection from rhs-client9.lab.eng.blr.redhat.com-22546-2013/10/22-05:17:17:434028-georep-client-13-0, Number of pending operations: 1 [2013-10-22 05:45:20.169203] I [server-helpers.c:757:server_connection_put] 0-georep-server: Shutting down connection rhs-client9.lab.eng.blr.redhat.com-22546-2013/10/22-05:17:17:434028-georep-client-13-0 [2013-10-22 05:45:20.633103] I [server-helpers.c:590:server_log_conn_destroy] 0-georep-server: destroyed connection of rhs-client9.lab.eng.blr.redhat.com-22546-2013/10/22-05:17:17:434028-georep-client-13-0 [2013-10-22 05:45:20.633144] W [socket.c:522:__socket_rwv] 0-georep-quota: readv on /tmp/quotad.socket failed (No data available) [2013-10-22 05:45:21.325010] I [glusterfsd-mgmt.c:1559:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2013-10-22 05:45:22.537176] I [server.c:773:server_rpc_notify] 0-georep-server: disconnecting connection from rhs-client40.lab.eng.blr.redhat.com-14960-2013/10/22-05:44:20:188621-georep-client-13-0, Number of pending operations: 1 [2013-10-22 05:45:22.537236] I [server-helpers.c:757:server_connection_put] 0-georep-server: Shutting down connection rhs-client40.lab.eng.blr.redhat.com-14960-2013/10/22-05:44:20:188621-georep-client-13-0 [2013-10-22 05:45:22.537292] I [server-helpers.c:590:server_log_conn_destroy] 0-georep-server: destroyed connection of rhs-client40.lab.eng.blr.redhat.com-14960-2013/10/22-05:44:20:188621-georep-client-13-0 [2013-10-22 05:45:22.704865] I [server.c:773:server_rpc_notify] 0-georep-server: disconnecting connection from rhs-client39.lab.eng.blr.redhat.com-17748-2013/10/22-05:44:20:426873-georep-client-13-0, Number of pending operations: 1 [2013-10-22 05:45:22.704890] I [server-helpers.c:757:server_connection_put] 0-georep-server: Shutting down connection rhs-client39.lab.eng.blr.redhat.com-17748-2013/10/22-05:44:20:426873-georep-client-13-0 [2013-10-22 05:45:22.704914] I [server-helpers.c:590:server_log_conn_destroy] 0-georep-server: destroyed connection of rhs-client39.lab.eng.blr.redhat.com-17748-2013/10/22-05:44:20:426873-georep-client-13-0 [2013-10-22 05:45:23.518301] I [server-handshake.c:569:server_setvolume] 0-georep-server: accepted client from rhs-client9.lab.eng.blr.redhat.com-22895-2013/10/22-05:45:21:522386-georep-client-13-0 (version: 3.4.0.35rhs) [2013-10-22 05:45:23.603873] I [server.c:773:server_rpc_notify] 0-georep-server: disconnecting connection from rhs-client4.lab.eng.blr.redhat.com-9204-2013/10/22-05:44:16:685359-georep-client-13-0, Number of pending operations: 1 [2013-10-22 05:45:23.685467] I [server-helpers.c:757:server_connection_put] 0-georep-server: Shutting down connection rhs-client4.lab.eng.blr.redhat.com-9204-2013/10/22-05:44:16:685359-georep-client-13-0 [2013-10-22 05:45:23.685504] I [server-helpers.c:590:server_log_conn_destroy] 0-georep-server: destroyed connection of rhs-client4.lab.eng.blr.redhat.com-9204-2013/10/22-05:44:16:685359-georep-client-13-0 [2013-10-22 05:45:23.896776] I [server-handshake.c:569:server_setvolume] 0-georep-server: accepted client from rhs-client40.lab.eng.blr.redhat.com-14983-2013/10/22-05:45:23:528928-georep-client-13-0 (version: 3.4.0.35rhs) [2013-10-22 05:45:24.289256] I [server-handshake.c:569:server_setvolume] 0-georep-server: accepted client from rhs-client39.lab.eng.blr.redhat.com-17773-2013/10/22-05:45:23:836360-georep-client-13-0 (version: 3.4.0.35rhs) [2013-10-22 05:45:25.407579] I [server-handshake.c:569:server_setvolume] 0-georep-server: accepted client from rhs-client4.lab.eng.blr.redhat.com-9292-2013/10/22-05:45:24:603010-georep-client-13-0 (version: 3.4.0.35rhs) [2013-10-22 05:46:04.190203] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed [2013-10-22 05:46:15.039014] I [glusterfsd-mgmt.c:1559:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> geo-rep logs, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>. [2013-10-22 12:46:32.806102] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-10-22 12:46:32.806871] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2013-10-22 12:46:32.865411] I [gsyncd(/home/georep10):520:main_i] <top>: syncing: gluster://localhost:georep -> ssh://root.43.174:gluster://localhost:slave [2013-10-22 12:46:32.943844] I [gsyncd(/home/georep1):520:main_i] <top>: syncing: gluster://localhost:georep -> ssh://root.43.76:gluster://localhost:slave [2013-10-22 12:46:34.873538] E [syncdutils(/home/georep4):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN] [2013-10-22 12:46:34.874373] I [syncdutils(/home/georep4):159:finalize] <top>: exiting. [2013-10-22 12:46:34.890589] E [syncdutils(/home/georep7):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN] [2013-10-22 12:46:34.891346] I [syncdutils(/home/georep7):159:finalize] <top>: exiting. [2013-10-22 12:46:34.897224] E [resource(/home/georep4):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-oZvWz9" returned with 1, saying: [2013-10-22 12:46:34.897595] E [resource(/home/georep4):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-oZvWz9: not mounted [2013-10-22 12:46:34.908623] E [resource(/home/georep7):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-aTIJHn" returned with 1, saying: [2013-10-22 12:46:34.909269] E [resource(/home/georep7):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-aTIJHn: not mounted [2013-10-22 12:46:34.922786] E [syncdutils(/home/georep10):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN] [2013-10-22 12:46:34.923459] I [syncdutils(/home/georep10):159:finalize] <top>: exiting. [2013-10-22 12:46:34.929419] E [syncdutils(/home/georep13):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN] [2013-10-22 12:46:34.929990] I [syncdutils(/home/georep13):159:finalize] <top>: exiting. [2013-10-22 12:46:34.940953] E [resource(/home/georep10):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-VnyoY_" returned with 1, saying: [2013-10-22 12:46:34.941675] E [resource(/home/georep10):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-VnyoY_: not mounted [2013-10-22 12:46:34.949685] E [resource(/home/georep13):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-rBSAIZ" returned with 1, saying: [2013-10-22 12:46:34.950395] E [resource(/home/georep13):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-rBSAIZ: not mounted [2013-10-22 12:46:35.21224] E [syncdutils(/home/georep1):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN] [2013-10-22 12:46:35.22105] I [syncdutils(/home/georep1):159:finalize] <top>: exiting. [2013-10-22 12:46:35.39799] E [resource(/home/georep1):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-31x8sW" returned with 1, saying: [2013-10-22 12:46:35.40436] E [resource(/home/georep1):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-31x8sW: not mounted >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Created attachment 814872 [details] sosreport for the machine rhs-client39.lab.eng.blr.redhat.com
Created attachment 814873 [details] sosreport for the machine rhs-client9.lab.eng.blr.redhat.com
volume info , [root@rhs-client9 ~]# gluster v i georep Volume Name: georep Type: Distributed-Replicate Volume ID: 93610fea-3323-4607-9221-41f500c149f7 Status: Started Number of Bricks: 8 x 2 = 16 Transport-type: tcp Bricks: Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/georep0 Brick2: rhs-client9.lab.eng.blr.redhat.com:/home/georep1 Brick3: rhs-client39.lab.eng.blr.redhat.com:/home/georep2 Brick4: rhs-client4.lab.eng.blr.redhat.com:/home/georep3 Brick5: rhs-client9.lab.eng.blr.redhat.com:/home/georep4 Brick6: rhs-client39.lab.eng.blr.redhat.com:/home/georep5 Brick7: rhs-client4.lab.eng.blr.redhat.com:/home/georep6 Brick8: rhs-client9.lab.eng.blr.redhat.com:/home/georep7 Brick9: rhs-client39.lab.eng.blr.redhat.com:/home/georep8 Brick10: rhs-client4.lab.eng.blr.redhat.com:/home/georep9 Brick11: rhs-client9.lab.eng.blr.redhat.com:/home/georep10 Brick12: rhs-client39.lab.eng.blr.redhat.com:/home/georep11 Brick13: rhs-client4.lab.eng.blr.redhat.com:/home/georep12 Brick14: rhs-client9.lab.eng.blr.redhat.com:/home/georep13 Brick15: rhs-client39.lab.eng.blr.redhat.com:/home/georep14 Brick16: rhs-client4.lab.eng.blr.redhat.com:/home/georep15 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on features.quota: on rest of the sosreports are here - http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1021818/