Bug 1451756
Summary: | client fails to connect to the brick due to an incorrect port reported back by glusterd | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | RamaKasturi <knarra> | |
Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> | |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.3 | CC: | knarra, rgowdapp, rhinduja, rhs-bugs, rkavunga, storage-qa-internal, vbellur | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-28 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1457981 (view as bug list) | Environment: | ||
Last Closed: | 2017-09-21 04:43:23 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1457981 | |||
Bug Blocks: | 1417151 |
Description
RamaKasturi
2017-05-17 13:02:51 UTC
sosreports can be found in the link below: ================================================= http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1451756/ Logs are found in the sos-report: sosreport-yarrow.lab.eng.blr.redhat.com-20170517173214.tar.xz However looks like entire log is not present. The log file is a symbolic link pointing to a tailed log: [raghu@unused glusterfs]$ ls -l rhev-data-center-mnt-glusterSD-10.70.36.78\:_data.log lrwxrwxrwx. 1 raghu users 106 May 17 17:34 rhev-data-center-mnt-glusterSD-10.70.36.78:_data.log -> ../../../sos_strings/gluster/var.log.glusterfs.rhev-data-center-mnt-glusterSD-10.70.36.78:_data.log.tailed The tailed log captures messages only after disconnect. Need full logs to analyze the issue. glusterd says the brick is listening on port 49155 # gluster volume status data Brick 10.70.36.77:/gluster_bricks/data/data 49155 0 Y 22436 Client logs say client is connecting to port 49155: [2017-06-01 08:56:05.149017] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:05.152941] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:09.153591] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:09.157781] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:13.158146] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:13.162099] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:17.162637] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:17.166604] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:21.167186] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:21.171153] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:25.172089] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:25.176367] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting But the brick is listening is on 49152: # ps ax | grep -i brick | grep data root 22436 0.5 0.0 1217564 26164 ? Ssl 12:31 0:40 /usr/sbin/glusterfsd -s 10.70.36.77 --volfile-id data.10.70.36.77.gluster_bricks-data-data -p /var/lib/glusterd/vols/data/run/10.70.36.77-gluster_bricks-data-data.pid -S /var/run/gluster/77a68376a313c246358b3d10487b0e94.socket --brick-name /gluster_bricks/data/data -l /var/log/glusterfs/bricks/gluster_bricks-data-data.log --xlator-option *-posix.glusterd-uuid=f889a05f-648e-4890-b1e6-637e33b69803 --brick-port 49152 --xlator-option data-server.listen-port=49152 So, this is an issue with glusterd portmapper. Its giving wrong port for the brick. from the statedump: glusterd.peer2.rpc.msgs_sent=164 glusterd.pmap_port=49152 glusterd.pmap[49152].type=4 glusterd.pmap[49152].brickname=/gluster_bricks/engine/engine /gluster_bricks/data/data <===== this is wrong, we can never have two bricks for same port if brick mux is not enabled which is not the case here. glusterd.pmap_port=49153 glusterd.pmap[49153].type=4 glusterd.pmap[49153].brickname=/gluster_bricks/engine/engine glusterd.pmap_port=49154 glusterd.pmap[49154].type=4 glusterd.pmap[49154].brickname=/gluster_bricks/vmstore/vmstore /gluster_bricks/vmstore/vmstore <===== same here as above glusterd.pmap_port=49155 glusterd.pmap[49155].type=4 glusterd.pmap[49155].brickname=/gluster_bricks/data/data <=== this brick has an entry for port 49152 as well glusterd.pmap_port=49156 glusterd.pmap[49156].type=0 glusterd.pmap[49156].brickname=(null) from the brick logfile gluster_bricks-data-data.log: [2017-05-31 09:48:40.595526] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.8.4 (args: /usr/sbin/glusterfsd -s 10.70.36.77 --volfile-id data.10.70.36.77.gluster_bricks-data-data -p /var/lib/glusterd/vols/data/run/10.70.36.77-gluster_bricks-data-data.pid -S /var/run/gluster/77a68376a313c246358b3d10487b0e94.socket --brick-name /gluster_bricks/data/data -l /var/log/glusterfs/bricks/gluster_bricks-data-data.log --xlator-option *-posix.glusterd-uuid=f889a05f-648e-4890-b1e6-637e33b69803 --brick-port 49152 --xlator-option data-server.listen-port=49152) [2017-05-31 09:48:40.630861] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.8.4 (args: /usr/sbin/glusterfsd -s 10.70.36.77 --volfile-id data.10.70.36.77.gluster_bricks-data-data -p /var/lib/glusterd/vols/data/run/10.70.36.77-gluster_bricks-data-data.pid -S /var/run/gluster/77a68376a313c246358b3d10487b0e94.socket --brick-name /gluster_bricks/data/data -l /var/log/glusterfs/bricks/gluster_bricks-data-data.log --xlator-option *-posix.glusterd-uuid=f889a05f-648e-4890-b1e6-637e33b69803 --brick-port 49155 --xlator-option data-server.listen-port=49155) The above clearly indicates that we have tried to start this brick twice, once with 49152 port and in the second attempt with 49155 which is the actual problem here. And this is the reason why we see same brick entries for two different ports. Additionally, we have one port having two different entries which is also wrong. If we map this time entry with glusterd log here are the sequences: [2017-05-31 09:48:40.563893] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-31 09:48:40.564975] I [MSGID: 106163] [glusterd-handshake.c:1274:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31001 [2017-05-31 09:48:40.574124] I [MSGID: 106490] [glusterd-handler.c:2610:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 75ab54f6-c570-4eda-8d8a-ed37f96d43ca [2017-05-31 09:48:40.577823] I [MSGID: 106493] [glusterd-handler.c:3865:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.36.76 (0), ret: 0, op_ret: 0 [2017-05-31 09:48:40.580988] C [MSGID: 106003] [glusterd-server-quorum.c:341:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume data. Starting local bricks. [2017-05-31 09:48:40.585199] I [rpc-clnt.c:1058:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2017-05-31 09:48:40.586056] C [MSGID: 106003] [glusterd-server-quorum.c:341:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume engine. Starting local bricks. [2017-05-31 09:48:40.588661] I [rpc-clnt.c:1058:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2017-05-31 09:48:40.588818] C [MSGID: 106003] [glusterd-server-quorum.c:341:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume vmstore. Starting local bricks. [2017-05-31 09:48:40.591559] I [rpc-clnt.c:1058:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2017-05-31 09:48:40.592001] I [rpc-clnt.c:1058:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 Further analysis to progress. upstream patch : https://review.gluster.org/17447 The RCA is detailed in the commit message. downstream patch : https://code.engineering.redhat.com/gerrit/#/c/108338 Verified and works fine with build glusterfs-3.8.4-29.el7rhgs.x86_64. I have followed the procedure below to test: ============================================ 1) Had a HC installation with glusterfs 3.2 async bits. 2) Moved one node to maintenance. 3) Upgraded glusterfs to 3.8.4-29 version. 4) Brought the node up back. 5) I see that self heal has been successfully completed and no issues seen. Port where brick is listening and what is returned by glusterd is same. Repeated the process for all other nodes and did not see any issues. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |