Bug 1448833 - [Brick Multiplexing] heal info shows the status of the bricks as "Transport endpoint is not connected" though bricks are up
Summary: [Brick Multiplexing] heal info shows the status of the bricks as "Transport ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.3.0
Assignee: Atin Mukherjee
QA Contact: Prasad Desala
URL:
Whiteboard: brick-multiplexing
Depends On: 1454865 1455907 1455912
Blocks: 1260779 1417151
TreeView+ depends on / blocked
 
Reported: 2017-05-08 10:55 UTC by Prasad Desala
Modified: 2018-07-13 11:40 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.8.4-27
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1454865 (view as bug list)
Environment:
Last Closed: 2017-09-21 04:41:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Prasad Desala 2017-05-08 10:55:13 UTC
Description of problem:
=======================
heal info command output shows the status of the bricks as  "Transport endpoint is not connected" though bricks are up and running.

Version-Release number of selected component (if applicable):
3.8.4-24.el7rhgs.x86_64

How reproducible:
=================
always

Steps to Reproduce:
===================
1) Create a Distributed-Replicate volume and enable brick mux on it.
2) Start the volume and FUSE mount it on a client.
3) Set cluster.self-heal-daemon to off.
4) Create a 10 directory on the mount point.
5) Kill one brick of one of the replica sets in the volume and modify the permissions of all directories.
6) Start volume with force option.
7) Kill the other brick in the same replica set and modify permissions of the directory again.
8) Start volume with force option. Examine the output of `gluster volume heal <vol-name> info' command on the server.

Actual results:
===============
heal info command output shows the status of the bricks as  "Transport endpoint is not connected" though bricks are up and running.

Expected results:
=================
Heal info command should return the status of all the bricks as Connected as all the bricks are up and running.

Tested the same scenario disabling brick mux and heal info command output shows the status of the bricks as connected.

Comment 3 Ravishankar N 2017-05-09 09:20:12 UTC
Hi Prasad, I was not able to reproduce the issue on a 2 node glusterfs 3.8.4-24 setup. When I tried to bring back a killed brick using the old cmdline parameters, not all multiplexed bricks came back online and `gluster vol status` also showed them offline, which is when heal info gave ENOTCONN. But if I use `volume start force`, all multiplexed bricks came back up and heal info also showed the directories in split-brain.

Can you give  me a setup to debug where volume status shows all bricks up but heal info gives ENOTCONN?

Comment 5 Prasad Desala 2017-05-18 15:01:13 UTC
Repeated the same scenario on glusterfs version 3.8.4-25.el7rhgs.x86_64 and the issue is not reproduced.

Comment 6 Nag Pavan Chilakam 2017-05-20 07:36:40 UTC
I have hit this issue in my brick mux setup which hosts about 50 volumes.
It came after soaking the testbed with different test activities

[root@dhcp35-45 ~]# gluster v heal cross3-27 info
Brick 10.70.35.45:/rhs/brick27/cross3-27
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.35.130:/rhs/brick27/cross3-27
Status: Connected
Number of entries: 0

Brick 10.70.35.122:/rhs/brick27/cross3-27
Status: Connected
Number of entries: 0



While the b1 is still up and i am also able to do IOs which are getting written to all 3 disks, still seeing this.
hence reopening the bz

Comment 7 Nag Pavan Chilakam 2017-05-20 07:38:18 UTC
glfsheal log shows below 


[2017-05-20 07:37:49.672672] I [afr.c:94:fix_quorum_options] 0-cross3-27-replicate-0: reindeer: incoming qtype = none
[2017-05-20 07:37:49.672710] I [afr.c:116:fix_quorum_options] 0-cross3-27-replicate-0: reindeer: quorum_count = 2147483647
[2017-05-20 07:37:49.674088] W [MSGID: 101174] [graph.c:360:_log_if_unknown_option] 0-cross3-27-readdir-ahead: option 'parallel-readdir' is not recognized
[2017-05-20 07:37:49.674206] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 64686370-3335-2d34-352e-6c61622e656e (0) coming up
[2017-05-20 07:37:49.674239] I [MSGID: 114020] [client.c:2356:notify] 0-cross3-27-client-0: parent translators are ready, attempting connect on transport
[2017-05-20 07:37:49.677415] I [MSGID: 114020] [client.c:2356:notify] 0-cross3-27-client-1: parent translators are ready, attempting connect on transport
[2017-05-20 07:37:49.677806] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0)
[2017-05-20 07:37:49.680125] I [MSGID: 114020] [client.c:2356:notify] 0-cross3-27-client-2: parent translators are ready, attempting connect on transport
[2017-05-20 07:37:49.682924] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-20 07:37:49.683235] W [MSGID: 114043] [client-handshake.c:1104:client_setvolume_cbk] 0-cross3-27-client-0: failed to set the volume [No such file or directory]
[2017-05-20 07:37:49.683268] W [MSGID: 114007] [client-handshake.c:1133:client_setvolume_cbk] 0-cross3-27-client-0: failed to get 'process-uuid' from reply dict [Invalid argument]
[2017-05-20 07:37:49.683283] E [MSGID: 114044] [client-handshake.c:1139:client_setvolume_cbk] 0-cross3-27-client-0: SETVOLUME on remote-host failed [No such file or directory]
[2017-05-20 07:37:49.683294] I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event
[2017-05-20 07:37:49.683374] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-1: changing port to 49152 (from 0)
[2017-05-20 07:37:49.683434] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-05-20 07:37:49.686033] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-2: changing port to 49152 (from 0)
[2017-05-20 07:37:49.688512] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-20 07:37:49.692156] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-20 07:37:49.692207] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-cross3-27-client-1: Connected to cross3-27-client-1, attached to remote volume '/rhs/brick27/cross3-27'.
[2017-05-20 07:37:49.692243] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-cross3-27-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-20 07:37:49.692353] I [MSGID: 108005] [afr-common.c:4659:afr_notify] 0-cross3-27-replicate-0: Subvolume 'cross3-27-client-1' came back up; going online.
[2017-05-20 07:37:49.695021] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-cross3-27-client-1: Server lk version = 1
[2017-05-20 07:37:49.695137] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-cross3-27-client-2: Connected to cross3-27-client-2, attached to remote volume '/rhs/brick27/cross3-27'.
[2017-05-20 07:37:49.695195] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-cross3-27-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2017-05-20 07:37:49.696395] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-cross3-27-client-2: Server lk version = 1
[2017-05-20 07:37:49.698251] I [MSGID: 104041] [glfs-resolve.c:889:__glfs_active_subvol] 0-cross3-27: switched to graph 64686370-3335-2d34-352e-6c61622e656e (0)
[2017-05-20 07:37:49.698929] W [MSGID: 114031] [client-rpc-fops.c:1102:client3_3_getxattr_cbk] 0-cross3-27-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001). Key: glusterfs.xattrop_index_gfid [Transport endpoint is not connected]

Comment 8 Nag Pavan Chilakam 2017-05-20 07:40:28 UTC
glusterd log




[2017-05-20 07:39:41.682664] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0)
The message "I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available" repeated 4 times between [2017-05-20 07:39:21.667931] and [2017-05-20 07:39:37.682932]
[2017-05-20 07:39:41.685469] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
The message "I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event" repeated 4 times between [2017-05-20 07:39:25.670950] and [2017-05-20 07:39:41.686123]
[2017-05-20 07:39:41.686180] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-05-20 07:39:45.686373] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0)
The message "E [MSGID: 114044] [client-handshake.c:1139:client_setvolume_cbk] 0-cross3-27-client-0: SETVOLUME on remote-host failed [No such file or directory]" repeated 4 times between [2017-05-20 07:39:29.675657] and [2017-05-20 07:39:45.689947]
[2017-05-20 07:39:45.689948] I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event
[2017-05-20 07:39:49.689889] I [rpc-clnt.c:2001:rpc





n2:
-client-0: failed to set the volume [No such file or directory]
[2017-05-20 07:40:02.749448] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0)
The message "I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available" repeated 4 times between [2017-05-20 07:39:42.736270] and [2017-05-20 07:39:58.749905]
[2017-05-20 07:40:02.752539] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-cross3-27-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
The message "I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event" repeated 4 times between [2017-05-20 07:39:46.739084] and [2017-05-20 07:40:02.752841]
[2017-05-20 07:40:02.752887] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-cross3-27-client-0: disconnected from cross3-27-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-05-20 07:40:06.752460] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0)
The message "E [MSGID: 114044] [client-handshake.c:1139:client_setvolume_cbk] 0-cross3-27-client-0: SETVOLUME on remote-host failed [No such file or directory]" repeated 4 times between [2017-05-20 07:39:50.743592] and [2017-05-20 07:40:06.756281]
[2017-05-20 07:40:06.756282] I [MSGID: 114051] [client-handshake.c:1259:client_setvolume_cbk] 0-cross3-27-client-0: sending CHILD_CONNECTING event
[2017-05-20 07:40:10.756047] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0)
The message "W [MSGID: 114007] [client-handshake.c:1133:client_setvolume_cbk] 0-cross3-27-client-0: failed to get 'process-uuid' from reply dict [Invalid argument]" repeated 4 times between [2017-05-20 07:39:54.746353] and [2017-05-20 07:40:10.760008]
[2017-05-20 07:40:10.760012] E [MSGID: 114044] [client-handshake.c:1139:client_setvolume_cbk] 0-cross3-27-client-0: SETVOLUME on remote-host failed [No such file or directory]
[2017-05-20 07:40:14.759215] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-cross3-27-client-0: changing port to 49152 (from 0)
The message "W [MSGID: 114043] [client-handshake.c:1104:client_setvolume_cbk] 0-cross3-27-client-0: failed to set the volume [No such file or directory]" repeated 4 times between [2017-05-20 07:39:58.749772] and [2017-05-20 07:40:14.762507]
[2017-05-20 07:40:14.762512] W [MSGID: 114007] [client-handshake.c:1133:client_setvolume_cbk] 0-cross3-27-client-0: failed to get 'process-uuid' from reply dict [Invalid argument]

Comment 9 Mohammed Rafi KC 2017-05-23 08:48:21 UTC
RCA:

When we stop the volume GlusterD actually sends two terminate request to brick process one during brick op phase and another during commit phase. Without multiplexing, it wasn't causing any problem, because the process was supposed to stop. But with multiplexing, it is just a detach which will be executed twice. Now those two requests can be executed at the same time, if that happens we may delete the graph entry twice as we are not taking any locks during the link modification of graph in glusterfs_handle_detach.

So the linked list will me moved twice which is results a deletion of an independent brick.

Comment 10 Atin Mukherjee 2017-05-23 15:37:51 UTC
upstream patch : https://review.gluster.org/#/c/17374/

Comment 13 Atin Mukherjee 2017-05-30 16:31:49 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/107687

Comment 15 Prasad Desala 2017-06-14 09:51:24 UTC
Verified this BZ on glusterfs version 3.8.4-27.el7rhgs.x86_64. Followed the same steps as in the description, with the bricks up and running when issued heal info it shows the bricks status as connected. 
Repeated the same test for 3 times and I didn't see "Transport endpoint is not connected" as status in heal info command.

Moving this BZ to Verified.

Comment 17 errata-xmlrpc 2017-09-21 04:41:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.