+++ This bug was initially created as a clone of Bug #1335531 +++ +++ This bug was initially created as a clone of Bug #1335357 +++ Description of problem: ======================= Modified volume options from peer node are not syncing once glusterd comes up in a node where one of the volume brick is down due to underlying filesystem(xfs) crash. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.9-4 How reproducible: ================= Always Steps to Reproduce: =================== 1. Create a simple distribute volume using two node cluster. 2. Crash one volume brick filesystem part of node-1. 3. stop glusterd on node-1 4. goto node-2 and modify some volume options. 5. start glusterd on node-1. 6. Check handshake has happened using volume get option on node1 (OR) using volume info. Actual results: =============== Handshake not happening. Expected results: ================= Handshake should happen. Additional info: --- Additional comment from Red Hat Bugzilla Rules Engine on 2016-05-12 00:08:12 EDT --- This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from Byreddy on 2016-05-12 00:09:39 EDT --- i will provide the logs --- Additional comment from Byreddy on 2016-05-12 00:15:27 EDT --- Console logs from both the nodes. On node-1: ========== [root@dhcp42-77 ~]# gluster volume status Status of volume: Dis Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.77:/bricks/brick0/af N/A N/A N N/A Brick 10.70.42.77:/bricks/brick1/kl 49153 0 Y 1711 NFS Server on localhost 2049 0 Y 3549 NFS Server on 10.70.43.151 2049 0 Y 14653 Task Status of Volume Dis ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# gluster peer status Number of Peers: 1 Hostname: 10.70.43.151 Uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d State: Peer in Cluster (Connected) [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# gluster volume info Volume Name: Dis Type: Distribute Volume ID: 24485f18-5dcf-4483-af2e-2b6a41e6b965 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.70.42.77:/bricks/brick0/af Brick2: 10.70.42.77:/bricks/brick1/kl Options Reconfigured: performance.readdir-ahead: on [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# systemctl stop glusterd [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# systemctl start glusterd [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# gluster peer status Number of Peers: 1 Hostname: 10.70.43.151 Uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d State: Peer in Cluster (Connected) [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# gluster volume status Status of volume: Dis Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.42.77:/bricks/brick0/af N/A N/A N N/A Brick 10.70.42.77:/bricks/brick1/kl 49153 0 Y 1711 NFS Server on localhost 2049 0 Y 14187 NFS Server on 10.70.43.151 2049 0 Y 14653 Task Status of Volume Dis ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# gluster volume info Volume Name: Dis Type: Distribute Volume ID: 24485f18-5dcf-4483-af2e-2b6a41e6b965 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.70.42.77:/bricks/brick0/af Brick2: 10.70.42.77:/bricks/brick1/kl Options Reconfigured: performance.readdir-ahead: on [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# gluster volume get Dis performance.readdir-ahead Option Value ------ ----- performance.readdir-ahead on [root@dhcp42-77 ~]# [root@dhcp42-77 ~]# On node-2: ========== [root@dhcp43-151 ~]# gluster peer status Number of Peers: 1 Hostname: dhcp42-77.lab.eng.blr.redhat.com Uuid: bc217a6c-d3de-40d7-be12-0cd51c98a877 State: Peer in Cluster (Connected) [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# gluster volume info Volume Name: Dis Type: Distribute Volume ID: 24485f18-5dcf-4483-af2e-2b6a41e6b965 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.70.42.77:/bricks/brick0/af Brick2: 10.70.42.77:/bricks/brick1/kl Options Reconfigured: performance.readdir-ahead: on [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# gluster volume set Dis performance.readdir-ahead off volume set: success [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# gluster volume info Volume Name: Dis Type: Distribute Volume ID: 24485f18-5dcf-4483-af2e-2b6a41e6b965 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.70.42.77:/bricks/brick0/af Brick2: 10.70.42.77:/bricks/brick1/kl Options Reconfigured: performance.readdir-ahead: off [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# [root@dhcp43-151 ~]# gluster volume get Dis performance.readdir-ahead Option Value ------ ----- performance.readdir-ahead off [root@dhcp43-151 ~]# --- Additional comment from RHEL Product and Program Management on 2016-05-12 01:02:29 EDT --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. --- Additional comment from Byreddy on 2016-05-12 02:02:30 EDT --- glusterd log from node-1: ========================= [2016-05-12 05:55:30.764115] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-05-12 05:55:30.766892] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707 [2016-05-12 05:55:30.825808] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d [2016-05-12 05:55:30.827399] I [MSGID: 106009] [glusterd-utils.c:2824:glusterd_compare_friend_volume] 0-management: Version of volume Dis differ. local version = 2, remote version = 3 on peer 10.70.43.151 [2016-05-12 05:55:30.827750] C [MSGID: 106425] [glusterd-utils.c:3134:glusterd_import_new_brick] 0-management: realpath() failed for brick /bricks/brick0/h0. The underlying file system may be in bad state [Input/output error] [2016-05-12 05:55:30.827838] E [MSGID: 106376] [glusterd-sm.c:1401:glusterd_friend_sm] 0-glusterd: handler returned: -1 [2016-05-12 05:55:30.834463] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d, host: 10.70.43.151, port: 0 [2016-05-12 05:55:30.890545] I [MSGID: 106492] [glusterd-handler.c:2776:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d [2016-05-12 05:55:30.890595] I [MSGID: 106502] [glusterd-handler.c:2821:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2016-05-12 05:55:30.891211] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 [2016-05-12 05:55:31.892768] I [MSGID: 106540] [glusterd-utils.c:4384:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV3 successfully [2016-05-12 05:55:31.893132] I [MSGID: 106540] [glusterd-utils.c:4393:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV1 successfully [2016-05-12 05:55:31.893455] I [MSGID: 106540] [glusterd-utils.c:4402:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NFSV3 successfully [2016-05-12 05:55:31.893808] I [MSGID: 106540] [glusterd-utils.c:4411:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v4 successfully [2016-05-12 05:55:31.894266] I [MSGID: 106540] [glusterd-utils.c:4420:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v1 successfully [2016-05-12 05:55:31.894626] I [MSGID: 106540] [glusterd-utils.c:4429:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered ACL v3 successfully [2016-05-12 05:55:31.900965] W [socket.c:3133:socket_connect] 0-nfs: Ignore failed connection attempt on , (No such file or directory) [2016-05-12 05:55:31.901072] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2016-05-12 05:55:31.901196] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already stopped [2016-05-12 05:55:31.901274] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600 [2016-05-12 05:55:31.901408] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped [2016-05-12 05:55:31.901463] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600 [2016-05-12 05:55:31.901589] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2016-05-12 05:55:31.901636] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2016-05-12 05:55:31.901745] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped [2016-05-12 05:55:31.909398] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2016-05-12 05:55:31.909624] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2016-05-12 05:55:31.909847] I [MSGID: 106493] [glusterd-rpc-ops.c:696:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d [2016-05-12 05:55:31.910016] W [socket.c:701:__socket_rwv] 0-nfs: readv on /var/run/gluster/b7f9049f47b3ab44cdbaf600d7f5c6f8.socket failed (Invalid argument) [2016-05-12 05:55:31.910034] I [MSGID: 106006] [glusterd-svc-mgmt.c:323:glusterd_svc_common_rpc_notify] 0-management: nfs has disconnected from glusterd. [2016-05-12 05:55:31.910296] W [socket.c:701:__socket_rwv] 0-management: readv on /var/run/gluster/c1eec530a1c811faf8e3d20e6c09c320.socket failed (Invalid argument) [2016-05-12 05:55:31.910527] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 05:55:34.993921] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Dis [2016-05-12 05:55:35.742207] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 05:55:36.189912] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Dis [2016-05-12 05:55:38.742298] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 05:55:41.742760] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 05:55:41.962352] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2016-05-12 05:55:41.963934] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2016-05-12 05:55:44.743559] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. (END) glusterd log from node-2 ======================== [2016-05-12 05:55:27.968405] W [socket.c:984:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 7, Invalid argument [2016-05-12 05:55:27.968450] E [socket.c:3089:socket_connect] 0-management: Failed to set keep-alive: Invalid argument The message "I [MSGID: 106004] [glusterd-handler.c:5192:__glusterd_peer_rpc_notify] 0-management: Peer <dhcp42-77.lab.eng.blr.redhat.com> (<f6f03f6a-dc88-4592-95b4-76482b1742d5>), in state <Peer in Cluster>, has disconnected from glusterd." repeated 11 times between [2016-05-12 05:54:44.009808] and [2016-05-12 05:55:24.962631] [2016-05-12 05:55:30.769240] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707 [2016-05-12 05:55:30.828060] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5 [2016-05-12 05:55:30.836792] I [MSGID: 106493] [glusterd-handler.c:3842:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to dhcp42-77.lab.eng.blr.redhat.com (0), ret: 0 [2016-05-12 05:55:30.894602] I [MSGID: 106493] [glusterd-rpc-ops.c:696:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5 [2016-05-12 05:55:30.894652] I [MSGID: 106492] [glusterd-handler.c:2776:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5 [2016-05-12 05:55:30.894683] I [MSGID: 106502] [glusterd-handler.c:2821:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend The message "I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req" repeated 3 times between [2016-05-12 05:54:20.746141] and [2016-05-12 05:55:18.238141] (END) --- Additional comment from Atin Mukherjee on 2016-05-12 03:21:47 EDT --- RCA : Since the volume configuration was changed on glusterd instance on Node 1 where glusterd was down on Node 2, when glusterd restarts on Node 2 importing the updated volume happens at Node 2. However the brickinfo objects are created freshly and here since the underlying the file system was crashed, realpath () call in glusterd_import_new_brick () fails resulting into importing the volume and volume data goes inconsistent. --- Additional comment from Vijay Bellur on 2016-05-12 08:35:11 EDT --- REVIEW: http://review.gluster.org/14306 (glusterd: copy real_path from older brickinfo during brick import) posted (#1) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Vijay Bellur on 2016-05-12 12:09:16 EDT --- REVIEW: http://review.gluster.org/14306 (glusterd: copy real_path from older brickinfo during brick import) posted (#2) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Vijay Bellur on 2016-05-13 12:03:40 EDT --- REVIEW: http://review.gluster.org/14306 (glusterd: copy real_path from older brickinfo during brick import) posted (#3) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Vijay Bellur on 2016-05-16 02:27:57 EDT --- REVIEW: http://review.gluster.org/14306 (glusterd: copy real_path from older brickinfo during brick import) posted (#4) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Vijay Bellur on 2016-05-16 02:34:18 EDT --- REVIEW: http://review.gluster.org/14306 (glusterd: copy real_path from older brickinfo during brick import) posted (#5) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Vijay Bellur on 2016-05-18 01:53:34 EDT --- REVIEW: http://review.gluster.org/14306 (glusterd: copy real_path from older brickinfo during brick import) posted (#6) for review on master by Atin Mukherjee (amukherj) --- Additional comment from Vijay Bellur on 2016-05-18 05:06:11 EDT --- COMMIT: http://review.gluster.org/14306 committed in master by Kaushal M (kaushal) ------ commit 5a4f4a945661a8bb24735524e152ccd5b1ba571a Author: Atin Mukherjee <amukherj> Date: Wed May 11 18:24:40 2016 +0530 glusterd: copy real_path from older brickinfo during brick import In glusterd_import_new_brick () new_brickinfo->real_path will not be populated for the first time and hence if the underlying file system is bad for the same brick, import will fail resulting in inconsistent configuration data. Fix is to populate real_path from old brickinfo object. Also there were many cases where we were unnecessarily calling realpath() and that may cause in failure. For eg - if a remove brick is executed with a brick whoose underlying file system has crashed, remove-brick fails since realpath() call fails. We'd need to call realpath() here as the value is of no use.Hence passing construct_realpath as _gf_false in glusterd_volume_brickinfo_get_by_brick () is a must in such cases. Change-Id: I7ec93871dc9e616f5d565ad5e540b2f1cacaf9dc BUG: 1335531 Signed-off-by: Atin Mukherjee <amukherj> Reviewed-on: http://review.gluster.org/14306 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> Reviewed-by: Kaushal M <kaushal>
REVIEW: http://review.gluster.org/14410 (glusterd: copy real_path from older brickinfo during brick import) posted (#1) for review on release-3.7 by Atin Mukherjee (amukherj)
COMMIT: http://review.gluster.org/14410 committed in release-3.7 by Kaushal M (kaushal) ------ commit 93b72a135a7793b5b71c0eb6498c5fe529827ca6 Author: Atin Mukherjee <amukherj> Date: Wed May 11 18:24:40 2016 +0530 glusterd: copy real_path from older brickinfo during brick import Backport of http://review.gluster.org/14306 In glusterd_import_new_brick () new_brickinfo->real_path will not be populated for the first time and hence if the underlying file system is bad for the same brick, import will fail resulting in inconsistent configuration data. Fix is to populate real_path from old brickinfo object. Also there were many cases where we were unnecessarily calling realpath() and that may cause in failure. For eg - if a remove brick is executed with a brick whoose underlying file system has crashed, remove-brick fails since realpath() call fails. We'd need to call realpath() here as the value is of no use.Hence passing construct_realpath as _gf_false in glusterd_volume_brickinfo_get_by_brick () is a must in such cases. Change-Id: I7ec93871dc9e616f5d565ad5e540b2f1cacaf9dc BUG: 1337113 Signed-off-by: Atin Mukherjee <amukherj> Reviewed-on: http://review.gluster.org/14306 Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> Reviewed-by: Kaushal M <kaushal> Reviewed-on: http://review.gluster.org/14410
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.12, please open a new bug report. glusterfs-3.7.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-devel/2016-June/049918.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user