Description of problem: ======================= Adding of identical brick from peer node is failing if similar brick path part of volume is down due to underlying filesystem crash in some other peer node. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.9-4 How reproducible: ================= Always. Steps to Reproduce: =================== 1. Create a simple distributed volume using one node (node-1) cluster 2. Crash the brick0 underlying filesystem (eg: node1_ip:/bricks/brick0 3. Probe new node node-2 from node-1. 4. Try to add identical brick (node2_ip:/bricks/brick0) part of node-2 // it will fail. Actual results: =============== Adding of identical brick (with diff IP/hostname) from peer node is failing. Expected results: ================= Adding of identical brick from peer node should work. Additional info:
I will provide the logs
glusterd log from node where add-brick failed. ============ [2016-05-12 06:01:49.703293] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Dis [2016-05-12 06:01:50.800424] W [socket.c:701:__socket_rwv] 0-management: readv on /var/run/gluster/c1eec530a1c811faf8e3d20e6c09c320.socket failed (Invalid argument) [2016-05-12 06:02:46.167785] I [MSGID: 106482] [glusterd-brick-ops.c:443:__glusterd_handle_add_brick] 0-management: Received add brick req [2016-05-12 06:02:46.170433] C [MSGID: 106425] [glusterd-utils.c:1125:glusterd_brickinfo_new_from_brick] 0-management: realpath () failed for brick /bricks/brick0/j0. The underlying filesystem may be in bad state [Input/output error] [2016-05-12 06:02:46.170912] W [MSGID: 106050] [glusterd-store.c:176:glusterd_store_is_valid_brickpath] 0-management: Failed to create brick info for brick 10.70.43.151:/bricks/brick0/j0 [2016-05-12 06:02:46.170927] E [MSGID: 106257] [glusterd-brick-ops.c:1703:glusterd_op_stage_add_brick] 0-management: brick path 10.70.43.151:/bricks/brick0/j0 is too long [2016-05-12 06:02:46.170940] W [MSGID: 106122] [glusterd-mgmt.c:188:gd_mgmt_v3_pre_validate_fn] 0-management: ADD-brick prevalidation failed. [2016-05-12 06:02:46.170950] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Add brick on local node [2016-05-12 06:02:46.170958] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed The message "I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd." repeated 39 times between [2016-05-12 06:01:29.797544] and [2016-05-12 06:03:26.815524] [2016-05-12 06:03:29.815965] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd. [2016-05-12 06:03:56.819667] W [socket.c:701:__socket_rwv] 0-management: readv on /var/run/gluster/c1eec530a1c811faf8e3d20e6c09c320.socket failed (Invalid argument)
RCA: While creating a new brickinfo object we issue a realpath () call irrespective of whether the brick belongs to the same brick. We are still safe here as we mask an ENOENT. But in this case since the patch of the new brick matches with the old one (only the host name differs) and the underlying file system is bad, realpath () fails with an errno different than ENOENT and hence causes add-brick to fail.
Fix of BZ 1335357 will take care of this issue too and hence moving the state to Post.
Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/74663/ Upstream patches: mainline : http://review.gluster.org/#/c/14306 release-3.7 : http://review.gluster.org/#/c/14410 release-3.8 : http://review.gluster.org/#/c/14411
Verified this bug using the build "glusterfs-3.7.9-6" and found that fix is working good. Steps done: Repeated the reproducing steps mentioned in the description section. Moving to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240