Bug 1021818 - glusterfsd processes got killed while doing geo-rep and rebalance in parallel with large data set
Summary: glusterfsd processes got killed while doing geo-rep and rebalance in parallel...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: 2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Nithya Balachandran
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1286189
TreeView+ depends on / blocked
 
Reported: 2013-10-22 07:22 UTC by Vijaykumar Koppad
Modified: 2015-11-27 12:21 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1286189 (view as bug list)
Environment:
Last Closed: 2015-11-27 12:18:46 UTC
Embargoed:


Attachments (Terms of Use)
sosreport for the machine rhs-client39.lab.eng.blr.redhat.com (16.25 MB, application/x-xz)
2013-10-22 07:27 UTC, Vijaykumar Koppad
no flags Details
sosreport for the machine rhs-client9.lab.eng.blr.redhat.com (15.92 MB, application/x-xz)
2013-10-22 07:30 UTC, Vijaykumar Koppad
no flags Details

Description Vijaykumar Koppad 2013-10-22 07:22:10 UTC
Description of problem: rebalance destination glusterfsd processes got killed while doing geo-rep and rebalance in parallel with large data set, which resulted in geo-rep status to be faulty. 


Version-Release number of selected component (if applicable):glusterfs-3.4.0.35rhs-1.el6rhs.x86_64


How reproducible: Didn't try to reproduce


Steps to Reproduce:
1.Create dist-rep volume (6x2)
2.Create a large set of data like 10*10*10000 of 10KB files. 
10 top level directories , in each dir , 10 dir . In each of those 
10 dirs , 10000 files of 10KB size 
3.Add 2 more bricks and start rebalancing 
4.Parallelly start geo-rep session with this volume as master, to some slave  

Actual results: rebalance destination glusterfsd processes got killed.


Expected results: rebalance and geo-rep should work properly. 


Additional info:

rebalance logs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
13-10-22 06:20:53.593735] I [client.c:2103:client_rpc_notify] 0-georep-client-12: disconnected from 10.70.36.28:49192. Client process will keep trying to connect to glusterd until brick's port is available. 
[2013-10-22 06:20:53.593767] E [afr-common.c:3832:afr_notify] 0-georep-replicate-6: All subvolumes are down. Going offline until atleast one of them comes back up.
[2013-10-22 06:20:53.593826] W [dht-common.c:5131:dht_notify] 0-georep-dht: Received CHILD_DOWN. Exiting
[2013-10-22 06:20:53.593850] I [dht-rebalance.c:1788:gf_defrag_stop] 0-: Received stop command on rebalance
[2013-10-22 06:20:53.797940] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-13: changing port to 49194 (from 0)
[2013-10-22 06:20:53.804138] E [socket.c:2158:socket_connect_finish] 0-georep-client-13: connection to 10.70.36.33:49194 failed (Connection refused)
[2013-10-22 06:20:54.804353] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-15: changing port to 49193 (from 0)
[2013-10-22 06:20:54.808487] E [socket.c:2158:socket_connect_finish] 0-georep-client-15: connection to 10.70.36.28:49193 failed (Connection refused)
[2013-10-22 06:20:56.810913] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-13: changing port to 49194 (from 0)
[2013-10-22 06:20:56.817386] E [socket.c:2158:socket_connect_finish] 0-georep-client-13: connection to 10.70.36.33:49194 failed (Connection refused)
[2013-10-22 06:20:57.817340] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-15: changing port to 49193 (from 0)
[2013-10-22 06:20:57.823627] E [socket.c:2158:socket_connect_finish] 0-georep-client-15: connection to 10.70.36.28:49193 failed (Connection refused)
[2013-10-22 06:21:00.823870] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-13: changing port to 49194 (from 0)
[2013-10-22 06:21:00.830281] E [socket.c:2158:socket_connect_finish] 0-georep-client-13: connection to 10.70.36.33:49194 failed (Connection refused)
[2013-10-22 06:21:01.829839] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-15: changing port to 49193 (from 0)
[2013-10-22 06:21:01.835901] E [socket.c:2158:socket_connect_finish] 0-georep-client-15: connection to 10.70.36.28:49193 failed (Connection refused)
[2013-10-22 06:21:03.957570] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-georep-client-12: changing port to 49192 (from 0)
[2013-10-22 06:21:03.964102] E [socket.c:2158:socket_connect_finish] 0-georep-client-12: connection to 10.70.36.28:49192 failed (Connection refused)
[2013-10-22 06:21:04.049124] I [dht-rebalance.c:881:dht_migrate_file] 0-georep-dht: completed migration of /level01/level11/level21/level31/526117ef%%03LOS9RPN4 from subvolume georep-replicate-2 to georep-replicate-1
[2013-10-22 06:21:04.051816] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-georep-dht: Fix layout failed for /level01/level11/level21/level31
[2013-10-22 06:21:04.052859] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-georep-dht: Fix layout failed for /level01/level11/level21
[2013-10-22 06:21:04.070432] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-georep-dht: Fix layout failed for /level01/level11
[2013-10-22 06:21:04.135330] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-georep-dht: Fix layout failed for /level01
[2013-10-22 06:21:04.146217] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 63769.00 secs
[2013-10-22 06:21:04.146241] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 33958, size: 434089984, lookups: 157076, failures: 4, skipped: 527
[2013-10-22 06:21:04.167697] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x35a68e894d] (-->/lib64/libpthread.so.0() [0x35a7007851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down
(END) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


one of the killed brick logs


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2013-10-22 05:45:19.990706] I [server.c:773:server_rpc_notify] 0-georep-server: disconnecting connection from rhs-client9.lab.eng.blr.redhat.com-22546-2013/10/22-05:17:17:434028-georep-client-13-0, Number of pending operations: 1
[2013-10-22 05:45:20.169203] I [server-helpers.c:757:server_connection_put] 0-georep-server: Shutting down connection rhs-client9.lab.eng.blr.redhat.com-22546-2013/10/22-05:17:17:434028-georep-client-13-0
[2013-10-22 05:45:20.633103] I [server-helpers.c:590:server_log_conn_destroy] 0-georep-server: destroyed connection of rhs-client9.lab.eng.blr.redhat.com-22546-2013/10/22-05:17:17:434028-georep-client-13-0  
[2013-10-22 05:45:20.633144] W [socket.c:522:__socket_rwv] 0-georep-quota: readv on /tmp/quotad.socket failed (No data available)
[2013-10-22 05:45:21.325010] I [glusterfsd-mgmt.c:1559:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2013-10-22 05:45:22.537176] I [server.c:773:server_rpc_notify] 0-georep-server: disconnecting connection from rhs-client40.lab.eng.blr.redhat.com-14960-2013/10/22-05:44:20:188621-georep-client-13-0, Number of pending operations: 1
[2013-10-22 05:45:22.537236] I [server-helpers.c:757:server_connection_put] 0-georep-server: Shutting down connection rhs-client40.lab.eng.blr.redhat.com-14960-2013/10/22-05:44:20:188621-georep-client-13-0
[2013-10-22 05:45:22.537292] I [server-helpers.c:590:server_log_conn_destroy] 0-georep-server: destroyed connection of rhs-client40.lab.eng.blr.redhat.com-14960-2013/10/22-05:44:20:188621-georep-client-13-0  
[2013-10-22 05:45:22.704865] I [server.c:773:server_rpc_notify] 0-georep-server: disconnecting connection from rhs-client39.lab.eng.blr.redhat.com-17748-2013/10/22-05:44:20:426873-georep-client-13-0, Number of pending operations: 1
[2013-10-22 05:45:22.704890] I [server-helpers.c:757:server_connection_put] 0-georep-server: Shutting down connection rhs-client39.lab.eng.blr.redhat.com-17748-2013/10/22-05:44:20:426873-georep-client-13-0
[2013-10-22 05:45:22.704914] I [server-helpers.c:590:server_log_conn_destroy] 0-georep-server: destroyed connection of rhs-client39.lab.eng.blr.redhat.com-17748-2013/10/22-05:44:20:426873-georep-client-13-0  
[2013-10-22 05:45:23.518301] I [server-handshake.c:569:server_setvolume] 0-georep-server: accepted client from rhs-client9.lab.eng.blr.redhat.com-22895-2013/10/22-05:45:21:522386-georep-client-13-0 (version: 3.4.0.35rhs)
[2013-10-22 05:45:23.603873] I [server.c:773:server_rpc_notify] 0-georep-server: disconnecting connection from rhs-client4.lab.eng.blr.redhat.com-9204-2013/10/22-05:44:16:685359-georep-client-13-0, Number of pending operations: 1
[2013-10-22 05:45:23.685467] I [server-helpers.c:757:server_connection_put] 0-georep-server: Shutting down connection rhs-client4.lab.eng.blr.redhat.com-9204-2013/10/22-05:44:16:685359-georep-client-13-0
[2013-10-22 05:45:23.685504] I [server-helpers.c:590:server_log_conn_destroy] 0-georep-server: destroyed connection of rhs-client4.lab.eng.blr.redhat.com-9204-2013/10/22-05:44:16:685359-georep-client-13-0  
[2013-10-22 05:45:23.896776] I [server-handshake.c:569:server_setvolume] 0-georep-server: accepted client from rhs-client40.lab.eng.blr.redhat.com-14983-2013/10/22-05:45:23:528928-georep-client-13-0 (version: 3.4.0.35rhs)
[2013-10-22 05:45:24.289256] I [server-handshake.c:569:server_setvolume] 0-georep-server: accepted client from rhs-client39.lab.eng.blr.redhat.com-17773-2013/10/22-05:45:23:836360-georep-client-13-0 (version: 3.4.0.35rhs)
[2013-10-22 05:45:25.407579] I [server-handshake.c:569:server_setvolume] 0-georep-server: accepted client from rhs-client4.lab.eng.blr.redhat.com-9292-2013/10/22-05:45:24:603010-georep-client-13-0 (version: 3.4.0.35rhs)
[2013-10-22 05:46:04.190203] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2013-10-22 05:46:15.039014] I [glusterfsd-mgmt.c:1559:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


geo-rep logs,


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.
[2013-10-22 12:46:32.806102] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-10-22 12:46:32.806871] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-10-22 12:46:32.865411] I [gsyncd(/home/georep10):520:main_i] <top>: syncing: gluster://localhost:georep -> ssh://root.43.174:gluster://localhost:slave
[2013-10-22 12:46:32.943844] I [gsyncd(/home/georep1):520:main_i] <top>: syncing: gluster://localhost:georep -> ssh://root.43.76:gluster://localhost:slave
[2013-10-22 12:46:34.873538] E [syncdutils(/home/georep4):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN]
[2013-10-22 12:46:34.874373] I [syncdutils(/home/georep4):159:finalize] <top>: exiting.
[2013-10-22 12:46:34.890589] E [syncdutils(/home/georep7):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN]
[2013-10-22 12:46:34.891346] I [syncdutils(/home/georep7):159:finalize] <top>: exiting.
[2013-10-22 12:46:34.897224] E [resource(/home/georep4):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-oZvWz9" returned with 1, saying:
[2013-10-22 12:46:34.897595] E [resource(/home/georep4):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-oZvWz9: not mounted
[2013-10-22 12:46:34.908623] E [resource(/home/georep7):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-aTIJHn" returned with 1, saying:
[2013-10-22 12:46:34.909269] E [resource(/home/georep7):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-aTIJHn: not mounted
[2013-10-22 12:46:34.922786] E [syncdutils(/home/georep10):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN]
[2013-10-22 12:46:34.923459] I [syncdutils(/home/georep10):159:finalize] <top>: exiting.
[2013-10-22 12:46:34.929419] E [syncdutils(/home/georep13):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN]
[2013-10-22 12:46:34.929990] I [syncdutils(/home/georep13):159:finalize] <top>: exiting.
[2013-10-22 12:46:34.940953] E [resource(/home/georep10):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-VnyoY_" returned with 1, saying:
[2013-10-22 12:46:34.941675] E [resource(/home/georep10):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-VnyoY_: not mounted
[2013-10-22 12:46:34.949685] E [resource(/home/georep13):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-rBSAIZ" returned with 1, saying:
[2013-10-22 12:46:34.950395] E [resource(/home/georep13):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-rBSAIZ: not mounted
[2013-10-22 12:46:35.21224] E [syncdutils(/home/georep1):201:log_raise_exception] <top>: glusterfs session went down [ENOTCONN]
[2013-10-22 12:46:35.22105] I [syncdutils(/home/georep1):159:finalize] <top>: exiting.
[2013-10-22 12:46:35.39799] E [resource(/home/georep1):204:errlog] Popen: command "umount -l /tmp/gsyncd-aux-mount-31x8sW" returned with 1, saying:
[2013-10-22 12:46:35.40436] E [resource(/home/georep1):207:logerr] Popen: umount> umount: /tmp/gsyncd-aux-mount-31x8sW: not mounted
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 2 Vijaykumar Koppad 2013-10-22 07:27:53 UTC
Created attachment 814872 [details]
sosreport for the machine rhs-client39.lab.eng.blr.redhat.com

Comment 3 Vijaykumar Koppad 2013-10-22 07:30:51 UTC
Created attachment 814873 [details]
sosreport for the machine rhs-client9.lab.eng.blr.redhat.com

Comment 4 Vijaykumar Koppad 2013-10-22 07:35:35 UTC
volume info , 
[root@rhs-client9 ~]# gluster v i georep
 
Volume Name: georep
Type: Distributed-Replicate
Volume ID: 93610fea-3323-4607-9221-41f500c149f7
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/georep0
Brick2: rhs-client9.lab.eng.blr.redhat.com:/home/georep1
Brick3: rhs-client39.lab.eng.blr.redhat.com:/home/georep2
Brick4: rhs-client4.lab.eng.blr.redhat.com:/home/georep3
Brick5: rhs-client9.lab.eng.blr.redhat.com:/home/georep4
Brick6: rhs-client39.lab.eng.blr.redhat.com:/home/georep5
Brick7: rhs-client4.lab.eng.blr.redhat.com:/home/georep6
Brick8: rhs-client9.lab.eng.blr.redhat.com:/home/georep7
Brick9: rhs-client39.lab.eng.blr.redhat.com:/home/georep8
Brick10: rhs-client4.lab.eng.blr.redhat.com:/home/georep9
Brick11: rhs-client9.lab.eng.blr.redhat.com:/home/georep10
Brick12: rhs-client39.lab.eng.blr.redhat.com:/home/georep11
Brick13: rhs-client4.lab.eng.blr.redhat.com:/home/georep12
Brick14: rhs-client9.lab.eng.blr.redhat.com:/home/georep13
Brick15: rhs-client39.lab.eng.blr.redhat.com:/home/georep14
Brick16: rhs-client4.lab.eng.blr.redhat.com:/home/georep15
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.quota: on


rest of the sosreports are here - http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1021818/


Note You need to log in before you can comment on or make changes to this bug.