Bug 1463180

Summary: Rebalance failed on an EC volume, seeing CHILD_DOWN message in rebalance logs
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Prasad Desala <tdesala>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED WORKSFORME QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, nbalacha, nchilaka, rhs-bugs, storage-qa-internal, tdesala
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: brick-multiplexing
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-23 07:31:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Prasad Desala 2017-06-20 10:30:30 UTC
Description of problem:
=======================
Rebalance failed on an EC volume, seeing CHILD_DOWN message in rebalance logs.

Version-Release number of selected component (if applicable):
3.8.4-28.el7rhgs.x86_64

How reproducible:
Reported at first occurrence

Steps to Reproduce:
===================
1) Create a 1x (4+2) ec volume and start it.
2) Enable brick mux "cluster.brick-multiplex" and turn on "cluster.lookup-optimize"
3) Add few bricks to the volume.
4) Trigger rebalance.

Note: No data is present on the volume.

[root@dhcp43-49 smbd]# gluster v rebalance ec status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               failed        0:00:01
                             10.70.43.41                0        0Bytes             0             1             0               failed        0:00:00
                             10.70.43.35                0        0Bytes             0             1             0               failed        0:00:00
                             10.70.43.37                0        0Bytes             0             1             0               failed        0:00:00
                             10.70.43.31                0        0Bytes             0             1             0               failed        0:00:00
volume rebalance: ec: success


Rebalance output snippet:
=========================
[2017-06-20 06:22:49.045790] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ec-client-10: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-20 06:22:49.047371] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ec-client-10: Connected to ec-client-10, attached to remote volume '/bricks/brick7/b7'.
[2017-06-20 06:22:49.047426] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ec-client-10: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-20 06:22:49.047701] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ec-client-10: Server lk version = 1
[2017-06-20 06:22:53.947560] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-1: changing port to 49154 (from 0)
[2017-06-20 06:22:53.949112] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-6: changing port to 49153 (from 0)
[2017-06-20 06:22:53.954416] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-11: changing port to 49153 (from 0)
[2017-06-20 06:22:53.961053] E [socket.c:2360:socket_connect_finish] 0-ec-client-1: connection to 10.70.43.41:49154 failed (Connection refused); disconnecting socket
[2017-06-20 06:22:53.961226] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-2: changing port to 49154 (from 0)
[2017-06-20 06:22:53.966897] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-7: changing port to 49154 (from 0)
[2017-06-20 06:22:53.967105] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ec-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-20 06:22:53.973051] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ec-client-6: Connected to ec-client-6, attached to remote volume '/bricks/brick6/b6'.
[2017-06-20 06:22:53.973103] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ec-client-6: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-20 06:22:53.973847] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ec-client-11: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-20 06:22:53.974102] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ec-client-6: Server lk version = 1
[2017-06-20 06:22:53.974774] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ec-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-20 06:22:53.975806] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ec-client-11: Connected to ec-client-11, attached to remote volume '/bricks/brick7/b7'.
[2017-06-20 06:22:53.975861] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ec-client-11: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-20 06:22:53.976668] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ec-client-11: Server lk version = 1
[2017-06-20 06:22:53.977719] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ec-client-2: Connected to ec-client-2, attached to remote volume '/bricks/brick5/b5'.
[2017-06-20 06:22:53.977790] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ec-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-20 06:22:53.978806] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ec-client-2: Server lk version = 1
[2017-06-20 06:22:53.980042] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-4: changing port to 49154 (from 0)
[2017-06-20 06:22:53.980980] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ec-client-7: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-20 06:22:53.981756] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-9: changing port to 49153 (from 0)
[2017-06-20 06:22:53.991470] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ec-client-7: Connected to ec-client-7, attached to remote volume '/bricks/brick6/b6'.
[2017-06-20 06:22:53.991537] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ec-client-7: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-20 06:22:53.991843] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-3: changing port to 49154 (from 0)
[2017-06-20 06:22:53.992354] E [socket.c:2360:socket_connect_finish] 0-ec-client-4: connection to 10.70.43.31:49154 failed (Connection refused); disconnecting socket
[2017-06-20 06:22:54.001538] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-8: changing port to 49153 (from 0)
[2017-06-20 06:22:54.001802] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ec-client-7: Server lk version = 1
[2017-06-20 06:22:54.008582] E [socket.c:2360:socket_connect_finish] 0-ec-client-3: connection to 10.70.43.37:49154 failed (Connection refused); disconnecting socket
[2017-06-20 06:22:54.009229] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ec-client-9: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-20 06:22:54.013037] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ec-client-9: Connected to ec-client-9, attached to remote volume '/bricks/brick6/b6'.
[2017-06-20 06:22:54.013174] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ec-client-9: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-20 06:22:54.014104] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ec-client-9: Server lk version = 1
[2017-06-20 06:22:54.018910] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ec-client-8: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-20 06:22:54.022303] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ec-client-8: Connected to ec-client-8, attached to remote volume '/bricks/brick6/b6'.
[2017-06-20 06:22:54.022360] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ec-client-8: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-20 06:22:54.022621] I [MSGID: 122061] [ec.c:323:ec_up] 0-ec-disperse-1: Going UP
[2017-06-20 06:22:54.023356] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ec-client-8: Server lk version = 1
[2017-06-20 06:22:57.955446] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-1: changing port to 49154 (from 0)
[2017-06-20 06:22:57.961346] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-4: changing port to 49154 (from 0)
[2017-06-20 06:22:57.964976] E [socket.c:2360:socket_connect_finish] 0-ec-client-1: connection to 10.70.43.41:49154 failed (Connection refused); disconnecting socket
[2017-06-20 06:22:57.972650] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ec-client-3: changing port to 49154 (from 0)
[2017-06-20 06:22:57.977445] E [socket.c:2360:socket_connect_finish] 0-ec-client-4: connection to 10.70.43.31:49154 failed (Connection refused); disconnecting socket
[2017-06-20 06:22:57.983485] E [socket.c:2360:socket_connect_finish] 0-ec-client-3: connection to 10.70.43.37:49154 failed (Connection refused); disconnecting socket
[2017-06-20 06:22:58.970616] W [MSGID: 109073] [dht-common.c:9185:dht_notify] 0-ec-dht: Received CHILD_DOWN. Exiting
[2017-06-20 06:22:58.997791] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-ec-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
[2017-06-20 06:22:58.997916] W [MSGID: 109005] [dht-selfheal.c:2111:dht_selfheal_directory] 0-ec-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /, gfid = 
[2017-06-20 06:22:58.998380] W [MSGID: 109075] [dht-diskusage.c:44:dht_du_info_cbk] 0-ec-dht: failed to get disk info from ec-disperse-0 [Transport endpoint is not connected]
[2017-06-20 06:22:58.999420] I [dht-rebalance.c:4211:gf_defrag_start_crawl] 0-ec-dht: gf_defrag_start_crawl using commit hash 3393583516
[2017-06-20 06:22:59.006385] I [MSGID: 109081] [dht-common.c:4258:dht_setxattr] 0-ec-dht: fixing the layout of /
[2017-06-20 06:22:59.006473] W [MSGID: 109016] [dht-selfheal.c:1738:dht_fix_layout_of_directory] 0-ec-dht: Layout fix failed: 1 subvolume(s) are down. Skipping fix layout.
[2017-06-20 06:22:59.007270] E [MSGID: 109026] [dht-rebalance.c:4253:gf_defrag_start_crawl] 0-ec-dht: fix layout on / failed
[2017-06-20 06:22:59.008677] I [MSGID: 109028] [dht-rebalance.c:4713:gf_defrag_status_get] 0-ec-dht: Rebalance is failed. Time taken is 1.00 secs
[2017-06-20 06:22:59.008719] I [MSGID: 109028] [dht-rebalance.c:4717:gf_defrag_status_get] 0-ec-dht: Files migrated: 0, size: 0, lookups: 0, failures: 1, skipped: 0
[2017-06-20 06:22:59.009391] W [glusterfsd.c:1290:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7f9c0af09e25] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x561f5dcfb005] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x561f5dcfae2b] ) 0-: received signum (15), shutting down

Actual results:
================
Rebalance failed.

Expected results:
=================
Rebalance should complete without any failures/issues.