Created attachment 902156 [details] new-node sosreport Description of problem: While rebalance was in progress tried out to peer probe an rhs node, the peer probe was unsuccessful. Version-Release number of selected component (if applicable): glusterfs-3.6.0.12-1.el6rhs.x86_64 How reproducible: happened to be seen this time Steps to Reproduce: take a four node cluster 1. create a volume of 6x2 type, start it 2. mount the volume over nfs 3. create some directories and files. 4. once the data creation is finished, add-brick and start rebalance 5. while rebalance is going on, try probe a new rhs node. 6. gluster peer status Actual results: step 6 result on the node, where the peer probe command was executed, [root@nfs1 ~]# gluster peer status Number of Peers: 4 Hostname: 10.70.37.215 Uuid: 77f03019-30a1-4e81-b8df-6613159c8890 State: Peer in Cluster (Connected) Hostname: 10.70.37.44 Uuid: ad14a2bb-d39c-4bdf-93e8-32c7568c6d05 State: Peer in Cluster (Connected) Hostname: 10.70.37.201 Uuid: 3aaa0a5e-91d9-46c9-bb46-a46947ddaca5 State: Peer in Cluster (Connected) Hostname: rhsauto049.lab.eng.blr.redhat.com Uuid: 821e3f6f-5438-41fb-8a5d-f060704d0e8a State: Probe Sent to Peer (Connected) peer status from the already node existing of the cluster, [root@nfs2 ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.37.62 Uuid: cb4a3869-24e0-4817-be29-73621ff218cb State: Peer in Cluster (Connected) Hostname: 10.70.37.201 Uuid: 3aaa0a5e-91d9-46c9-bb46-a46947ddaca5 State: Peer in Cluster (Connected) Hostname: 10.70.37.44 Uuid: ad14a2bb-d39c-4bdf-93e8-32c7568c6d05 State: Peer in Cluster (Connected) gluster peer status from the new rhs node, [root@rhsauto049 ~]# gluster peer status [root@rhsauto049 ~]# on new node, the glusterd logs 2014-06-04 10:58:27.655442] I [glusterd-handler.c:1314:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-06-04 10:58:27.655567] I [socket.c:3148:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1) [2014-06-04 10:58:27.655596] E [rpcsvc.c:1247:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 3) to rpc-transport (socket.management) [2014-06-04 10:58:27.655621] E [glusterd-utils.c:410:glusterd_submit_reply] 0-: Reply submission failed [2014-06-04 10:58:27.658359] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now [2014-06-04 10:58:27.658409] I [glusterd-handler.c:1314:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-06-04 10:58:27.658501] I [socket.c:3148:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1) [2014-06-04 10:58:27.658550] E [rpcsvc.c:1247:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 3) to rpc-transport (socket.management) [2014-06-04 10:58:27.658583] E [glusterd-utils.c:410:glusterd_submit_reply] 0-: Reply submission failed [2014-06-04 10:58:27.659642] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now [2014-06-04 10:58:27.659699] I [glusterd-handler.c:1314:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2014-06-04 10:58:27.659838] I [socket.c:3148:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1) [2014-06-04 10:58:27.659868] E [rpcsvc.c:1247:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 3) to rpc-transport (socket.management) [2014-06-04 10:58:27.659892] E [glusterd-utils.c:410:glusterd_submit_reply] 0-: Reply submission failed Expected results: peer probe is expected to pass and the status on all nodes should be same which is not the case at this time Additional info:
Created attachment 902157 [details] sosreport from where peer probe was executed
As per the logs it looks like there was a flaky n/w around 11:38 because of which peer probe command was bailed out: [2014-06-04 10:38:26.033906] E [rpc-clnt.c:201:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x1 sent = 2014-06-04 10:28:25.894138. timeout = 600 for 10.70.37.62:24007 [2014-06-04 10:38:26.034024] E [glusterd-handshake.c:1650:__glusterd_peer_dump_version_cbk] 0-: Error through RPC layer, retry again later This doesn't look like a issue at the application layer. Could you retest and confirm the behaviour. Request you to close this bug and kindly re-open if the problem persists.
Per discussion with Atin, this works. Please reopen if you see this.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days