Bug 1448833
Summary: | [Brick Multiplexing] heal info shows the status of the bricks as "Transport endpoint is not connected" though bricks are up | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Prasad Desala <tdesala> | |
Component: | core | Assignee: | Atin Mukherjee <amukherj> | |
Status: | CLOSED ERRATA | QA Contact: | Prasad Desala <tdesala> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.3 | CC: | amukherj, ekuric, nchilaka, rhs-bugs, rkavunga, storage-qa-internal, tdesala | |
Target Milestone: | --- | Keywords: | Reopened | |
Target Release: | RHGS 3.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | brick-multiplexing | |||
Fixed In Version: | glusterfs-3.8.4-27 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1454865 (view as bug list) | Environment: | ||
Last Closed: | 2017-09-21 04:41:45 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1454865, 1455907, 1455912 | |||
Bug Blocks: | 1260779, 1417151 |
Description
Prasad Desala
2017-05-08 10:55:13 UTC
Hi Prasad, I was not able to reproduce the issue on a 2 node glusterfs 3.8.4-24 setup. When I tried to bring back a killed brick using the old cmdline parameters, not all multiplexed bricks came back online and `gluster vol status` also showed them offline, which is when heal info gave ENOTCONN. But if I use `volume start force`, all multiplexed bricks came back up and heal info also showed the directories in split-brain. Can you give me a setup to debug where volume status shows all bricks up but heal info gives ENOTCONN? Repeated the same scenario on glusterfs version 3.8.4-25.el7rhgs.x86_64 and the issue is not reproduced. I have hit this issue in my brick mux setup which hosts about 50 volumes. It came after soaking the testbed with different test activities [root@dhcp35-45 ~]# gluster v heal cross3-27 info Brick 10.70.35.45:/rhs/brick27/cross3-27 Status: Transport endpoint is not connected Number of entries: - Brick 10.70.35.130:/rhs/brick27/cross3-27 Status: Connected Number of entries: 0 Brick 10.70.35.122:/rhs/brick27/cross3-27 Status: Connected Number of entries: 0 While the b1 is still up and i am also able to do IOs which are getting written to all 3 disks, still seeing this. hence reopening the bz glfsheal log shows below [2017-05-20 07:37:49.672672] I [afr.c:94:fix_quorum_options] 0-cross3-27-replicate-0: reindeer: incoming qtype = none [2017-05-20 07:37:49.672710] I [afr.c:116:fix_quorum_options] 0-cross3-27-replicate-0: reindeer: quorum_count = 2147483647 [2017-05-20 07:37:49.674088] W [MSGID: 101174] [graph.c:360:_log_if_unknown_option] 0-cross3-27-readdir-ahead: option 'parallel-readdir' is not recognized [2017-05-20 07:37:49.674206] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 64686370-3335-2d34-352e-6c61622e656e (0) coming up [2017-05-20 07:37:49.674239] I [MSGID: 114020] [client.c:2356:notify] 0-cross3-27-client-0: parent translators are ready, attempting connect on transport [2017-05-20 07:37:49.677415] I [MSGID: 114020] [client.c:2356:notify] 0-cross3-27-client-1: parent translators are ready, attempting connect on transport [2017-05-20 07:37:49.677806] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0) [2017-05-20 07:37:49.680125] I [MSGID: 114020] [client.c:2356:notify] 0-cross3-27-client-2: parent translators are ready, attempting connect on transport [2017-05-20 07:37:49.682924] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-05-20 07:37:49.683235] W [MSGID: 114043] [client-handshake.c:1104:client_setvolume_cbk] 0-cross3-27-client-0: failed to set the volume [No such file or directory] [2017-05-20 07:37:49.683268] W [MSGID: 114007] [client-handshake.c:1133:client_setvolume_cbk] 0-cross3-27-client-0: failed to get 'process-uuid' from reply dict [Invalid argument] [2017-05-20 07:37:49.683283] E [MSGID: 114044] [client-handshake.c:1139:client_setvolume_cbk] 0-cross3-27-client-0: SETVOLUME on remote-host failed [No such file or directory] [2017-05-20 07:37:49.683294] I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event [2017-05-20 07:37:49.683374] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-1: changing port to 49152 (from 0) [2017-05-20 07:37:49.683434] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2017-05-20 07:37:49.686033] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-2: changing port to 49152 (from 0) [2017-05-20 07:37:49.688512] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-05-20 07:37:49.692156] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-05-20 07:37:49.692207] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-cross3-27-client-1: Connected to cross3-27-client-1, attached to remote volume '/rhs/brick27/cross3-27'. [2017-05-20 07:37:49.692243] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-cross3-27-client-1: Server and Client lk-version numbers are not same, reopening the fds [2017-05-20 07:37:49.692353] I [MSGID: 108005] [afr-common.c:4659:afr_notify] 0-cross3-27-replicate-0: Subvolume 'cross3-27-client-1' came back up; going online. [2017-05-20 07:37:49.695021] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-cross3-27-client-1: Server lk version = 1 [2017-05-20 07:37:49.695137] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-cross3-27-client-2: Connected to cross3-27-client-2, attached to remote volume '/rhs/brick27/cross3-27'. [2017-05-20 07:37:49.695195] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-cross3-27-client-2: Server and Client lk-version numbers are not same, reopening the fds [2017-05-20 07:37:49.696395] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-cross3-27-client-2: Server lk version = 1 [2017-05-20 07:37:49.698251] I [MSGID: 104041] [glfs-resolve.c:889:__glfs_active_subvol] 0-cross3-27: switched to graph 64686370-3335-2d34-352e-6c61622e656e (0) [2017-05-20 07:37:49.698929] W [MSGID: 114031] [client-rpc-fops.c:1102:client3_3_getxattr_cbk] 0-cross3-27-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001). Key: glusterfs.xattrop_index_gfid [Transport endpoint is not connected] glusterd log [2017-05-20 07:39:41.682664] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0) The message "I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available" repeated 4 times between [2017-05-20 07:39:21.667931] and [2017-05-20 07:39:37.682932] [2017-05-20 07:39:41.685469] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) The message "I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event" repeated 4 times between [2017-05-20 07:39:25.670950] and [2017-05-20 07:39:41.686123] [2017-05-20 07:39:41.686180] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2017-05-20 07:39:45.686373] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0) The message "E [MSGID: 114044] [client-handshake.c:1139:client_setvolume_cbk] 0-cross3-27-client-0: SETVOLUME on remote-host failed [No such file or directory]" repeated 4 times between [2017-05-20 07:39:29.675657] and [2017-05-20 07:39:45.689947] [2017-05-20 07:39:45.689948] I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event [2017-05-20 07:39:49.689889] I [rpc-clnt.c:2001:rpc n2: -client-0: failed to set the volume [No such file or directory] [2017-05-20 07:40:02.749448] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0) The message "I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available" repeated 4 times between [2017-05-20 07:39:42.736270] and [2017-05-20 07:39:58.749905] [2017-05-20 07:40:02.752539] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) The message "I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event" repeated 4 times between [2017-05-20 07:39:46.739084] and [2017-05-20 07:40:02.752841] [2017-05-20 07:40:02.752887] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2017-05-20 07:40:06.752460] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0) The message "E [MSGID: 114044] [client-handshake.c:1139:client_setvolume_cbk] 0-cross3-27-client-0: SETVOLUME on remote-host failed [No such file or directory]" repeated 4 times between [2017-05-20 07:39:50.743592] and [2017-05-20 07:40:06.756281] [2017-05-20 07:40:06.756282] I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event [2017-05-20 07:40:10.756047] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0) The message "W [MSGID: 114007] [client-handshake.c:1133:client_setvolume_cbk] 0-cross3-27-client-0: failed to get 'process-uuid' from reply dict [Invalid argument]" repeated 4 times between [2017-05-20 07:39:54.746353] and [2017-05-20 07:40:10.760008] [2017-05-20 07:40:10.760012] E [MSGID: 114044] [client-handshake.c:1139:client_setvolume_cbk] 0-cross3-27-client-0: SETVOLUME on remote-host failed [No such file or directory] [2017-05-20 07:40:14.759215] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0) The message "W [MSGID: 114043] [client-handshake.c:1104:client_setvolume_cbk] 0-cross3-27-client-0: failed to set the volume [No such file or directory]" repeated 4 times between [2017-05-20 07:39:58.749772] and [2017-05-20 07:40:14.762507] [2017-05-20 07:40:14.762512] W [MSGID: 114007] [client-handshake.c:1133:client_setvolume_cbk] 0-cross3-27-client-0: failed to get 'process-uuid' from reply dict [Invalid argument] RCA: When we stop the volume GlusterD actually sends two terminate request to brick process one during brick op phase and another during commit phase. Without multiplexing, it wasn't causing any problem, because the process was supposed to stop. But with multiplexing, it is just a detach which will be executed twice. Now those two requests can be executed at the same time, if that happens we may delete the graph entry twice as we are not taking any locks during the link modification of graph in glusterfs_handle_detach. So the linked list will me moved twice which is results a deletion of an independent brick. upstream patch : https://review.gluster.org/#/c/17374/ downstream patch : https://code.engineering.redhat.com/gerrit/107687 Verified this BZ on glusterfs version 3.8.4-27.el7rhgs.x86_64. Followed the same steps as in the description, with the bricks up and running when issued heal info it shows the bricks status as connected. Repeated the same test for 3 times and I didn't see "Transport endpoint is not connected" as status in heal info command. Moving this BZ to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |