Bug 1020421 - [RHSC] Console Fails To Sync With RHS2.1 U1 Peer Probed Node
Summary: [RHSC] Console Fails To Sync With RHS2.1 U1 Peer Probed Node
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Kaushal
QA Contact: Matt Mahoney
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-17 15:54 UTC by Matt Mahoney
Modified: 2015-05-15 18:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-10-24 09:30:40 UTC
Embargoed:


Attachments (Terms of Use)
Sync With Peer Probe Failure (2.18 MB, application/x-gzip)
2013-10-17 15:54 UTC, Matt Mahoney
no flags Details

Description Matt Mahoney 2013-10-17 15:54:58 UTC
Created attachment 813434 [details]
Sync With Peer Probe Failure

Description of problem:
RHS 2.1 U1 test:

After adding a single node to an RHSC cluster, doing a peer probe of a second node from the node that was added to RHSC cluster does not sync with the console. The console reports a "Peer Status" error, and does not provide the Cluster.General Actions Import/Detach links. 

Note: This issues does was not reproducible when running against RHS2.1 (Big Bend) nodes.

Version-Release number of selected component (if applicable):
RHSC 2.1
RHS 2.1 U1

How reproducible:
Most every time.

Steps to Reproduce:
1. From RHSC: Add an RHS node-1
2. From RHS node-1, peer probe RNS node-2
3. From RHSC: Click Cluster name, and validate whether Import/Detach links are present for probed node. Also, note the Console messages.

Actual results:
RHSC does not sync with probed RHS node.

Expected results:
RHSC should properly sync with probed RHS node.

Additional info:

Comment 2 Dusmant 2013-10-23 05:49:56 UTC
 Following are the server details...On which it was tried
10.70.37.123 - RHSC (root/redhat)
10.70.37.71 & 10.70.37.79 - RHS nodes (root/redhat)

Aravinda debugged this and following is his finding...

I was looking at the log files and found below line in engine.log, Also gluster peer status failure in one of the vdsm log. 

2013-10-17 06:10:32,791 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-99) [3c4ec3f7] gluster peer status command failed on all servers of the cluster automation_cluster1.Cant refresh its data at this point.

Thread-227::DEBUG::2013-10-17 10:15:01,657::BindingXMLRPC::920::vds::(wrapper) return hostsList with {'status': {'message': 'Done', 'code': 0}, 'hosts': [{'status': 'CONNECTED', 'hostname': '10.70.37.71', 'uuid': 'b62e4043-4e3d-4f53-967c-6a5331709857'}, {'status': 'CONNECTED', 'hostname': '10.70.37.79', 'uuid': '486976b7-8af2-4bc4-9648-b8a1603ec077'}]}

Looks like RHS U1 issue because in the same vdsm log(10.70.37.71) file peer status failed at 2013-10-17 10:14:10,253 and sucess at 2013-10-17 10:15:01,657.

Comment 4 Kaushal 2013-10-24 09:30:40 UTC
The problem here is with the timing of execution of the peer status command after the peer probe.

First some information on how the 'peer probe' command works.

When glusterd receives a peer probe request from cli for an unknown peer, it creates an empty peer object for the peer, establishes a connection with the peer and begins the probe process, which does some exchanges between the peers during which the empty peer object is updated. Once the probe process begins the cli returns with success.

The probe process should generally take a small time to finish, most times it finishes before the cli returns, but sometimes it can take longer. So there can be cases where the peer object is still empty when the cli returned. If in such cases, a 'peer status' is done immediately after the cli returns, the command will error out. This is what has happened in this case.

A snippet of the glusterd log from 10.70.37.71 (times are different as gluster logs in UTC and the vdsm logs in EDT (UTC+4)),
<snip>
[2013-10-17 14:14:10.193165] I [glusterd-handler.c:821:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req 10.70.37.79 24007
[2013-10-17 14:14:10.200041] I [glusterd-handler.c:2905:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: 10.70.37.79 (24007)
[2013-10-17 14:14:10.456991] I [rpc-clnt.c:974:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-10-17 14:14:10.457113] I [socket.c:3485:socket_init] 0-management: SSL support is NOT enabled
[2013-10-17 14:14:10.457131] I [socket.c:3500:socket_init] 0-management: using system polling thread
[2013-10-17 14:14:10.460211] I [glusterd-handler.c:2886:glusterd_friend_add] 0-management: connect returned 0
[2013-10-17 14:14:10.460260] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2013-10-17 14:14:10.460855] W [dict.c:778:str_to_data] (-->/usr/lib64/glusterfs/3.4.0.35rhs/xlator/mgmt/glusterd.so(__glusterd_handle_cli_list_friends+0xa2) [0x7f26f01701d2] (-->/usr/lib64/glusterfs/3.4.0.35rhs/xlator/mgmt/glusterd.so(glusterd_list_friends+0x114) [0x7f26f016fd84] (-->/usr/lib64/libglusterfs.so.0(dict_set_str+0x1c) [0x3acf417acc]))) 0-dict: value is NULL
[2013-10-17 14:14:11.224336] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-10-17 14:14:11.270804] I [glusterd-rpc-ops.c:235:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077, host: 10.70.37.79
[2013-10-17 14:14:11.677060] I [glusterd-rpc-ops.c:307:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2013-10-17 14:14:11.677123] I [glusterd-rpc-ops.c:357:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077, host: 10.70.37.79, port: 0
[2013-10-17 14:14:12.244230] I [glusterd-handshake.c:556:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 2
[2013-10-17 14:14:12.594504] I [glusterd-handler.c:2324:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077
[2013-10-17 14:14:12.594635] I [glusterd-handler.c:2376:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.37.79, op_ret: 0, op_errno: 0, ret: 0
[2013-10-17 14:14:12.728421] I [glusterd-handler.c:2028:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077
[2013-10-17 14:14:12.728596] I [glusterd-handler.c:3059:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.37.79 (0), ret: 0
[2013-10-17 14:14:13.101467] I [glusterd-sm.c:495:glusterd_ac_send_friend_update] 0-: Added uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077, host: 10.70.37.79
[2013-10-17 14:14:13.451990] I [glusterd-handler.c:2190:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077
[2013-10-17 14:14:13.452034] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: b62e4043-4e3d-4f53-967c-6a5331709857, hostname:10.70.37.71
[2013-10-17 14:14:13.452075] I [glusterd-handler.c:2244:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
<snip>

In the log we see that the peer probe command was received at 14:14:10.193165, and the peer status was received at 14:14:10.460260. The gap between the commands is so small (~0.2s) that the probe command hadn't even returned yet, and was yet to begin the probe process. Since the probe process hadn't finished when the status request was received, glusterd still had an empty peer object. So it failed to add the peer information to cli response (timestamp 14:14:10.460855) and the peer status command failed. 
Further peer status commands were issued later, by which time the probe had completed, and hence they were successful as observed by Aravinda.

The automation script needs to be fixed to make sure that the 'peer status' command is sent only after a 'peer probe' command returns. It will be even better to wait for some time (~2-3 s) before sending the status command.

Closing this as NOTABUG.


Note You need to log in before you can comment on or make changes to this bug.