Bug 1335359

Summary: Adding of identical brick (with diff IP/hostname) from peer node is failing.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Byreddy <bsrirama>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED ERRATA QA Contact: Byreddy <bsrirama>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: asrivast, rhinduja, rhs-bugs, storage-qa-internal, vbellur
Target Milestone: ---Keywords: Regression, ZStream
Target Release: RHGS 3.1.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.9-6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-23 05:22:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1335357    
Bug Blocks: 1311817    

Description Byreddy 2016-05-12 04:33:34 UTC
Description of problem:
=======================
Adding of identical brick from peer node is failing if similar brick path part of volume is down due to underlying filesystem crash in some other peer node.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4


How reproducible:
=================
Always.


Steps to Reproduce:
===================
1. Create a simple distributed volume using one node (node-1) cluster
2. Crash the brick0 underlying filesystem (eg: node1_ip:/bricks/brick0
3. Probe new node node-2 from node-1.
4. Try to add identical brick (node2_ip:/bricks/brick0) part of node-2 // it will fail.


Actual results:
===============
Adding of identical brick (with diff IP/hostname) from peer node is failing.


Expected results:
=================
Adding of identical brick from peer node should work.


Additional info:

Comment 2 Byreddy 2016-05-12 04:35:30 UTC
I will provide the logs

Comment 5 Byreddy 2016-05-12 06:07:20 UTC
glusterd log from node where add-brick failed.
============

[2016-05-12 06:01:49.703293] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Dis
[2016-05-12 06:01:50.800424] W [socket.c:701:__socket_rwv] 0-management: readv on /var/run/gluster/c1eec530a1c811faf8e3d20e6c09c320.socket failed (Invalid argument)
[2016-05-12 06:02:46.167785] I [MSGID: 106482] [glusterd-brick-ops.c:443:__glusterd_handle_add_brick] 0-management: Received add brick req
[2016-05-12 06:02:46.170433] C [MSGID: 106425] [glusterd-utils.c:1125:glusterd_brickinfo_new_from_brick] 0-management: realpath () failed for brick /bricks/brick0/j0. The underlying filesystem may be in bad state [Input/output error]
[2016-05-12 06:02:46.170912] W [MSGID: 106050] [glusterd-store.c:176:glusterd_store_is_valid_brickpath] 0-management: Failed to create brick info for brick 10.70.43.151:/bricks/brick0/j0
[2016-05-12 06:02:46.170927] E [MSGID: 106257] [glusterd-brick-ops.c:1703:glusterd_op_stage_add_brick] 0-management: brick path 10.70.43.151:/bricks/brick0/j0 is too long
[2016-05-12 06:02:46.170940] W [MSGID: 106122] [glusterd-mgmt.c:188:gd_mgmt_v3_pre_validate_fn] 0-management: ADD-brick prevalidation failed.
[2016-05-12 06:02:46.170950] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Add brick on local node
[2016-05-12 06:02:46.170958] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
The message "I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd." repeated 39 times between [2016-05-12 06:01:29.797544] and [2016-05-12 06:03:26.815524]
[2016-05-12 06:03:29.815965] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd.
[2016-05-12 06:03:56.819667] W [socket.c:701:__socket_rwv] 0-management: readv on /var/run/gluster/c1eec530a1c811faf8e3d20e6c09c320.socket failed (Invalid argument)

Comment 6 Atin Mukherjee 2016-05-12 08:29:52 UTC
RCA:

While creating a new brickinfo object we issue a realpath () call irrespective of whether the brick belongs to the same brick. We are still safe here as we mask an ENOENT. But in this case since the patch of the new brick matches with the old one (only the host name differs) and the underlying file system is bad, realpath () fails with an errno different than ENOENT and hence causes add-brick to fail.

Comment 7 Atin Mukherjee 2016-05-12 12:38:03 UTC
Fix of BZ 1335357 will take care of this issue too and hence moving the state to Post.

Comment 9 Atin Mukherjee 2016-05-20 11:14:33 UTC
Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/74663/

Upstream patches:

mainline : http://review.gluster.org/#/c/14306 
release-3.7 : http://review.gluster.org/#/c/14410 
release-3.8 : http://review.gluster.org/#/c/14411

Comment 11 Byreddy 2016-05-23 15:54:38 UTC
Verified this bug using the build "glusterfs-3.7.9-6" and found that fix is working good.

Steps done: Repeated the reproducing steps mentioned in the description section.

Moving to verified state.

Comment 13 errata-xmlrpc 2016-06-23 05:22:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240