Description of problem: ======================= Modified volume options from peer node are not syncing once glusterd comes up in a node where one of the volume brick is down due to underlying filesystem(xfs) crash. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.9-4 How reproducible: ================= Always Steps to Reproduce: =================== 1. Create a simple distribute volume using two node cluster. 2. Crash one volume brick filesystem part of node-1. 3. stop glusterd on node-1 4. goto node-2 and modify some volume options. 5. start glusterd on node-1. 6. Check handshake has happened using volume get option on node1 (OR) using volume info. Actual results: =============== Handshake not happening. Expected results: ================= Handshake should happen. Additional info:
i will provide the logs
glusterd log from node-1: ========================= [2016-05-12 05:55:30.764115] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-05-12 05:55:30.766892] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707 [2016-05-12 05:55:30.825808] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d [2016-05-12 05:55:30.827399] I [MSGID: 106009] [glusterd-utils.c:2824:glusterd_compare_friend_volume] 0-management: Version of volume Dis differ. local version = 2, remote version = 3 on peer 10.70.43.151 [2016-05-12 05:55:30.827750] C [MSGID: 106425] [glusterd-utils.c:3134:glusterd_import_new_brick] 0-management: realpath() failed for brick /bricks/brick0/h0. The underlying file system may be in bad state [Input/output error] [2016-05-12 05:55:30.827838] E [MSGID: 106376] [glusterd-sm.c:1401:glusterd_friend_sm] 0-glusterd: handler returned: -1 [2016-05-12 05:55:30.834463] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d, host: 10.70.43.151, port: 0 [2016-05-12 05:55:30.890545] I [MSGID: 106492] [glusterd-handler.c:2776:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d [2016-05-12 05:55:30.890595] I [MSGID: 106502] [glusterd-handler.c:2821:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2016-05-12 05:55:30.891211] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 [2016-05-12 05:55:31.892768] I [MSGID: 106540] [glusterd-utils.c:4384:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV3 successfully [2016-05-12 05:55:31.893132] I [MSGID: 106540] [glusterd-utils.c:4393:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV1 successfully [2016-05-12 05:55:31.893455] I [MSGID: 106540] [glusterd-utils.c:4402:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NFSV3 successfully [2016-05-12 05:55:31.893808] I [MSGID: 106540] [glusterd-utils.c:4411:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v4 successfully [2016-05-12 05:55:31.894266] I [MSGID: 106540] [glusterd-utils.c:4420:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v1 successfully [2016-05-12 05:55:31.894626] I [MSGID: 106540] [glusterd-utils.c:4429:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered ACL v3 successfully [2016-05-12 05:55:31.900965] W [socket.c:3133:socket_connect] 0-nfs: Ignore failed connection attempt on , (No such file or directory) [2016-05-12 05:55:31.901072] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2016-05-12 05:55:31.901196] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already stopped [2016-05-12 05:55:31.901274] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600 [2016-05-12 05:55:31.901408] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped [2016-05-12 05:55:31.901463] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600 [2016-05-12 05:55:31.901589] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2016-05-12 05:55:31.901636] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2016-05-12 05:55:31.901745] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2016-05-12 05:55:31.909398] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2016-05-12 05:55:31.909624] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2016-05-12 05:55:31.909847] I [MSGID: 106493] [glusterd-rpc-ops.c:696:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d [2016-05-12 05:55:31.910016] W [socket.c:701:__socket_rwv] 0-nfs: readv on /var/run/gluster/b7f9049f47b3ab44cdbaf600d7f5c6f8.socket failed (Invalid argument) [2016-05-12 05:55:31.910034] I [MSGID: 106006] [glusterd-svc-mgmt.c:323:glusterd_svc_common_rpc_notify] 0-management: nfs has disconnected from glusterd. [2016-05-12 05:55:31.910296] W [socket.c:701:__socket_rwv] 0-management: readv on /var/run/gluster/c1eec530a1c811faf8e3d20e6c09c320.socket failed (Invalid argument) [2016-05-12 05:55:31.910527] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 05:55:34.993921] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Dis [2016-05-12 05:55:35.742207] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 05:55:36.189912] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Dis [2016-05-12 05:55:38.742298] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 05:55:41.742760] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 05:55:41.962352] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2016-05-12 05:55:41.963934] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2016-05-12 05:55:44.743559] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. (END) glusterd log from node-2 ======================== [2016-05-12 05:55:27.968405] W [socket.c:984:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 7, Invalid argument [2016-05-12 05:55:27.968450] E [socket.c:3089:socket_connect] 0-management: Failed to set keep-alive: Invalid argument The message "I [MSGID: 106004] [glusterd-handler.c:5192:__glusterd_peer_rpc_notify] 0-management: Peer <dhcp42-77.lab.eng.blr.redhat.com> (<f6f03f6a-dc88-4592-95b4-76482b1742d5>), in state <Peer in Cluster>, has disconnected from glusterd." repeated 11 times between [2016-05-12 05:54:44.009808] and [2016-05-12 05:55:24.962631] [2016-05-12 05:55:30.769240] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707 [2016-05-12 05:55:30.828060] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5 [2016-05-12 05:55:30.836792] I [MSGID: 106493] [glusterd-handler.c:3842:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to dhcp42-77.lab.eng.blr.redhat.com (0), ret: 0 [2016-05-12 05:55:30.894602] I [MSGID: 106493] [glusterd-rpc-ops.c:696:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5 [2016-05-12 05:55:30.894652] I [MSGID: 106492] [glusterd-handler.c:2776:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5 [2016-05-12 05:55:30.894683] I [MSGID: 106502] [glusterd-handler.c:2821:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend The message "I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req" repeated 3 times between [2016-05-12 05:54:20.746141] and [2016-05-12 05:55:18.238141] (END)
RCA : Since the volume configuration was changed on glusterd instance on Node 1 where glusterd was down on Node 2, when glusterd restarts on Node 2 importing the updated volume happens at Node 2. However the brickinfo objects are created freshly and here since the underlying the file system was crashed, realpath () call in glusterd_import_new_brick () fails resulting into importing the volume and volume data goes inconsistent.
An upstream patch http://review.gluster.org/14306 posted for review.
Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/74663/ Upstream patches: mainline : http://review.gluster.org/#/c/14306 release-3.7 : http://review.gluster.org/#/c/14410 release-3.8 : http://review.gluster.org/#/c/14411
Verified this bug using the build "glusterfs-3.7.9-6" and found that fix is working good. Steps done: Repeated the reproducing steps mentioned in the description section. Moving to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240