Bug 1104626

Summary: rebalance: peer probe fails to add a new peer when rebalance is in progress
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Saurabh <saujain>
Component: glusterfsAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WORKSFORME QA Contact: Matt Zywusko <mzywusko>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: amukherj, kparthas, mzywusko, nlevinki, nsathyan, saujain, vagarwal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-04-20 05:43:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
new-node sosreport
none
sosreport from where peer probe was executed none

Description Saurabh 2014-06-04 11:28:23 UTC
Created attachment 902156 [details]
new-node sosreport

Description of problem:
While rebalance was in progress tried out to peer probe an rhs node, the peer probe was unsuccessful.


Version-Release number of selected component (if applicable):
glusterfs-3.6.0.12-1.el6rhs.x86_64

How reproducible:
happened to be seen this time

Steps to Reproduce:
take a four node cluster
1. create a volume of 6x2 type, start it
2. mount the volume over nfs
3. create some directories and files.
4. once the data creation is finished, add-brick and start rebalance
5. while rebalance is going on, try probe a new rhs node.
6. gluster peer status

Actual results:
step 6 result on the node, where the peer probe command was executed,
[root@nfs1 ~]# gluster peer status
Number of Peers: 4

Hostname: 10.70.37.215
Uuid: 77f03019-30a1-4e81-b8df-6613159c8890
State: Peer in Cluster (Connected)

Hostname: 10.70.37.44
Uuid: ad14a2bb-d39c-4bdf-93e8-32c7568c6d05
State: Peer in Cluster (Connected)

Hostname: 10.70.37.201
Uuid: 3aaa0a5e-91d9-46c9-bb46-a46947ddaca5
State: Peer in Cluster (Connected)

Hostname: rhsauto049.lab.eng.blr.redhat.com
Uuid: 821e3f6f-5438-41fb-8a5d-f060704d0e8a
State: Probe Sent to Peer (Connected)


peer status from the already node existing of the cluster,
[root@nfs2 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.62
Uuid: cb4a3869-24e0-4817-be29-73621ff218cb
State: Peer in Cluster (Connected)

Hostname: 10.70.37.201
Uuid: 3aaa0a5e-91d9-46c9-bb46-a46947ddaca5
State: Peer in Cluster (Connected)

Hostname: 10.70.37.44
Uuid: ad14a2bb-d39c-4bdf-93e8-32c7568c6d05
State: Peer in Cluster (Connected)


gluster peer status from the new rhs node, 
[root@rhsauto049 ~]# gluster peer status
[root@rhsauto049 ~]# 

on new node, the glusterd logs
2014-06-04 10:58:27.655442] I [glusterd-handler.c:1314:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-06-04 10:58:27.655567] I [socket.c:3148:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2014-06-04 10:58:27.655596] E [rpcsvc.c:1247:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 3) to rpc-transport (socket.management)
[2014-06-04 10:58:27.655621] E [glusterd-utils.c:410:glusterd_submit_reply] 0-: Reply submission failed
[2014-06-04 10:58:27.658359] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now
[2014-06-04 10:58:27.658409] I [glusterd-handler.c:1314:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-06-04 10:58:27.658501] I [socket.c:3148:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2014-06-04 10:58:27.658550] E [rpcsvc.c:1247:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 3) to rpc-transport (socket.management)
[2014-06-04 10:58:27.658583] E [glusterd-utils.c:410:glusterd_submit_reply] 0-: Reply submission failed
[2014-06-04 10:58:27.659642] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now
[2014-06-04 10:58:27.659699] I [glusterd-handler.c:1314:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2014-06-04 10:58:27.659838] I [socket.c:3148:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2014-06-04 10:58:27.659868] E [rpcsvc.c:1247:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 3) to rpc-transport (socket.management)
[2014-06-04 10:58:27.659892] E [glusterd-utils.c:410:glusterd_submit_reply] 0-: Reply submission failed

Expected results:

peer probe is expected to pass and the status on all nodes should be same which is not the case at this time

Additional info:

Comment 1 Saurabh 2014-06-04 11:31:15 UTC
Created attachment 902157 [details]
sosreport from where peer probe was executed

Comment 5 Atin Mukherjee 2015-04-06 09:32:52 UTC
As per the logs it looks like there was a flaky n/w around 11:38 because of which peer probe command was bailed out:

[2014-06-04 10:38:26.033906] E [rpc-clnt.c:201:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x1 sent = 2014-06-04 10:28:25.894138. timeout = 600 for 10.70.37.62:24007
[2014-06-04 10:38:26.034024] E [glusterd-handshake.c:1650:__glusterd_peer_dump_version_cbk] 0-: Error through RPC layer, retry again later

This doesn't look like a issue at the application layer. Could you retest and confirm the behaviour. Request you to close this bug and kindly re-open if the problem persists.

Comment 7 Vivek Agarwal 2015-04-20 05:43:01 UTC
Per discussion with Atin, this works. Please reopen if you see this.

Comment 8 Red Hat Bugzilla 2023-09-14 02:09:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days