Description of problem: Had HC installation with three nodes with 3.2 glusterfs build. I have upgraded one of the node to 3.3 glusterfs build and i saw that heal info count never becomes zero. File is getting healed but the pending xattrs are getting set again. Another observation is that the private structure of the client xlator shows both connecting and connected as 1. client 0: =============== cat /rhev/data-center/mnt/glusterSD/10.70.36.78\:_data/.meta/graphs/active/data-client-1/private [xlator.protocol.client.data-client-1.priv] fd.0.remote_fd = -1 fd.1.remote_fd = -1 fd.2.remote_fd = -1 fd.3.remote_fd = -1 connecting = 1 connected = 1 total_bytes_read = 3889072373 ping_timeout = 30 total_bytes_written = 12075388236 ping_msgs_sent = 0 msgs_sent = 152360 client 1: ==================== cat /rhev/data-center/mnt/glusterSD/10.70.36.78\:_data/.meta/graphs/active/data-client-1/private [xlator.protocol.client.data-client-1.priv] fd.0.remote_fd = 1 connecting = 0 connected = 1 total_bytes_read = 1166694134 ping_timeout = 30 total_bytes_written = 37377716 ping_msgs_sent = 760 msgs_sent = 108352 cat /rhev/data-center/mnt/glusterSD/10.70.36.78\:_data/.meta/graphs/active/data-client-1/private [xlator.protocol.client.data-client-1.priv] fd.0.remote_fd = -1 fd.1.remote_fd = -1 connecting = 1 connected = 1 total_bytes_read = 794759321 ping_timeout = 30 total_bytes_written = 13378418443 ping_msgs_sent = 0 msgs_sent = 306251 Version-Release number of selected component (if applicable): How reproducible: Hit it once Steps to Reproduce: 1. Installed HC with 3.2 glusterfs build 2. Upgraded one of the node which is client-1 to glusterfs 3.3 build 3. once the upgrade is done heal info never becomes zero. Actual results: File gets healed but pending xattrs are set again and i see that private structure of the client xlator shows both connecting and connected as 1 Expected results: File should get healed successfully and private structure should not display both connecting and connected as 1. Additional info:
sosreports can be found in the link below: ================================================= http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1451756/
Logs are found in the sos-report: sosreport-yarrow.lab.eng.blr.redhat.com-20170517173214.tar.xz However looks like entire log is not present. The log file is a symbolic link pointing to a tailed log: [raghu@unused glusterfs]$ ls -l rhev-data-center-mnt-glusterSD-10.70.36.78\:_data.log lrwxrwxrwx. 1 raghu users 106 May 17 17:34 rhev-data-center-mnt-glusterSD-10.70.36.78:_data.log -> ../../../sos_strings/gluster/var.log.glusterfs.rhev-data-center-mnt-glusterSD-10.70.36.78:_data.log.tailed The tailed log captures messages only after disconnect. Need full logs to analyze the issue.
glusterd says the brick is listening on port 49155 # gluster volume status data Brick 10.70.36.77:/gluster_bricks/data/data 49155 0 Y 22436 Client logs say client is connecting to port 49155: [2017-06-01 08:56:05.149017] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:05.152941] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:09.153591] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:09.157781] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:13.158146] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:13.162099] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:17.162637] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:17.166604] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:21.167186] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:21.171153] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting socket [2017-06-01 08:56:25.172089] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49155 (from 0) [2017-06-01 08:56:25.176367] E [socket.c:2318:socket_connect_finish] 0-data-client-1: connection to 10.70.36.77:49155 failed (Connection refused); disconnecting But the brick is listening is on 49152: # ps ax | grep -i brick | grep data root 22436 0.5 0.0 1217564 26164 ? Ssl 12:31 0:40 /usr/sbin/glusterfsd -s 10.70.36.77 --volfile-id data.10.70.36.77.gluster_bricks-data-data -p /var/lib/glusterd/vols/data/run/10.70.36.77-gluster_bricks-data-data.pid -S /var/run/gluster/77a68376a313c246358b3d10487b0e94.socket --brick-name /gluster_bricks/data/data -l /var/log/glusterfs/bricks/gluster_bricks-data-data.log --xlator-option *-posix.glusterd-uuid=f889a05f-648e-4890-b1e6-637e33b69803 --brick-port 49152 --xlator-option data-server.listen-port=49152 So, this is an issue with glusterd portmapper. Its giving wrong port for the brick.
from the statedump: glusterd.peer2.rpc.msgs_sent=164 glusterd.pmap_port=49152 glusterd.pmap[49152].type=4 glusterd.pmap[49152].brickname=/gluster_bricks/engine/engine /gluster_bricks/data/data <===== this is wrong, we can never have two bricks for same port if brick mux is not enabled which is not the case here. glusterd.pmap_port=49153 glusterd.pmap[49153].type=4 glusterd.pmap[49153].brickname=/gluster_bricks/engine/engine glusterd.pmap_port=49154 glusterd.pmap[49154].type=4 glusterd.pmap[49154].brickname=/gluster_bricks/vmstore/vmstore /gluster_bricks/vmstore/vmstore <===== same here as above glusterd.pmap_port=49155 glusterd.pmap[49155].type=4 glusterd.pmap[49155].brickname=/gluster_bricks/data/data <=== this brick has an entry for port 49152 as well glusterd.pmap_port=49156 glusterd.pmap[49156].type=0 glusterd.pmap[49156].brickname=(null) from the brick logfile gluster_bricks-data-data.log: [2017-05-31 09:48:40.595526] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.8.4 (args: /usr/sbin/glusterfsd -s 10.70.36.77 --volfile-id data.10.70.36.77.gluster_bricks-data-data -p /var/lib/glusterd/vols/data/run/10.70.36.77-gluster_bricks-data-data.pid -S /var/run/gluster/77a68376a313c246358b3d10487b0e94.socket --brick-name /gluster_bricks/data/data -l /var/log/glusterfs/bricks/gluster_bricks-data-data.log --xlator-option *-posix.glusterd-uuid=f889a05f-648e-4890-b1e6-637e33b69803 --brick-port 49152 --xlator-option data-server.listen-port=49152) [2017-05-31 09:48:40.630861] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.8.4 (args: /usr/sbin/glusterfsd -s 10.70.36.77 --volfile-id data.10.70.36.77.gluster_bricks-data-data -p /var/lib/glusterd/vols/data/run/10.70.36.77-gluster_bricks-data-data.pid -S /var/run/gluster/77a68376a313c246358b3d10487b0e94.socket --brick-name /gluster_bricks/data/data -l /var/log/glusterfs/bricks/gluster_bricks-data-data.log --xlator-option *-posix.glusterd-uuid=f889a05f-648e-4890-b1e6-637e33b69803 --brick-port 49155 --xlator-option data-server.listen-port=49155) The above clearly indicates that we have tried to start this brick twice, once with 49152 port and in the second attempt with 49155 which is the actual problem here. And this is the reason why we see same brick entries for two different ports. Additionally, we have one port having two different entries which is also wrong. If we map this time entry with glusterd log here are the sequences: [2017-05-31 09:48:40.563893] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-05-31 09:48:40.564975] I [MSGID: 106163] [glusterd-handshake.c:1274:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 31001 [2017-05-31 09:48:40.574124] I [MSGID: 106490] [glusterd-handler.c:2610:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 75ab54f6-c570-4eda-8d8a-ed37f96d43ca [2017-05-31 09:48:40.577823] I [MSGID: 106493] [glusterd-handler.c:3865:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.36.76 (0), ret: 0, op_ret: 0 [2017-05-31 09:48:40.580988] C [MSGID: 106003] [glusterd-server-quorum.c:341:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume data. Starting local bricks. [2017-05-31 09:48:40.585199] I [rpc-clnt.c:1058:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2017-05-31 09:48:40.586056] C [MSGID: 106003] [glusterd-server-quorum.c:341:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume engine. Starting local bricks. [2017-05-31 09:48:40.588661] I [rpc-clnt.c:1058:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2017-05-31 09:48:40.588818] C [MSGID: 106003] [glusterd-server-quorum.c:341:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume vmstore. Starting local bricks. [2017-05-31 09:48:40.591559] I [rpc-clnt.c:1058:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2017-05-31 09:48:40.592001] I [rpc-clnt.c:1058:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 Further analysis to progress.
upstream patch : https://review.gluster.org/17447 The RCA is detailed in the commit message.
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/108338
Verified and works fine with build glusterfs-3.8.4-29.el7rhgs.x86_64. I have followed the procedure below to test: ============================================ 1) Had a HC installation with glusterfs 3.2 async bits. 2) Moved one node to maintenance. 3) Upgraded glusterfs to 3.8.4-29 version. 4) Brought the node up back. 5) I see that self heal has been successfully completed and no issues seen. Port where brick is listening and what is returned by glusterd is same. Repeated the process for all other nodes and did not see any issues.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774