Created attachment 813434 [details] Sync With Peer Probe Failure Description of problem: RHS 2.1 U1 test: After adding a single node to an RHSC cluster, doing a peer probe of a second node from the node that was added to RHSC cluster does not sync with the console. The console reports a "Peer Status" error, and does not provide the Cluster.General Actions Import/Detach links. Note: This issues does was not reproducible when running against RHS2.1 (Big Bend) nodes. Version-Release number of selected component (if applicable): RHSC 2.1 RHS 2.1 U1 How reproducible: Most every time. Steps to Reproduce: 1. From RHSC: Add an RHS node-1 2. From RHS node-1, peer probe RNS node-2 3. From RHSC: Click Cluster name, and validate whether Import/Detach links are present for probed node. Also, note the Console messages. Actual results: RHSC does not sync with probed RHS node. Expected results: RHSC should properly sync with probed RHS node. Additional info:
Following are the server details...On which it was tried 10.70.37.123 - RHSC (root/redhat) 10.70.37.71 & 10.70.37.79 - RHS nodes (root/redhat) Aravinda debugged this and following is his finding... I was looking at the log files and found below line in engine.log, Also gluster peer status failure in one of the vdsm log. 2013-10-17 06:10:32,791 ERROR [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-99) [3c4ec3f7] gluster peer status command failed on all servers of the cluster automation_cluster1.Cant refresh its data at this point. Thread-227::DEBUG::2013-10-17 10:15:01,657::BindingXMLRPC::920::vds::(wrapper) return hostsList with {'status': {'message': 'Done', 'code': 0}, 'hosts': [{'status': 'CONNECTED', 'hostname': '10.70.37.71', 'uuid': 'b62e4043-4e3d-4f53-967c-6a5331709857'}, {'status': 'CONNECTED', 'hostname': '10.70.37.79', 'uuid': '486976b7-8af2-4bc4-9648-b8a1603ec077'}]} Looks like RHS U1 issue because in the same vdsm log(10.70.37.71) file peer status failed at 2013-10-17 10:14:10,253 and sucess at 2013-10-17 10:15:01,657.
The problem here is with the timing of execution of the peer status command after the peer probe. First some information on how the 'peer probe' command works. When glusterd receives a peer probe request from cli for an unknown peer, it creates an empty peer object for the peer, establishes a connection with the peer and begins the probe process, which does some exchanges between the peers during which the empty peer object is updated. Once the probe process begins the cli returns with success. The probe process should generally take a small time to finish, most times it finishes before the cli returns, but sometimes it can take longer. So there can be cases where the peer object is still empty when the cli returned. If in such cases, a 'peer status' is done immediately after the cli returns, the command will error out. This is what has happened in this case. A snippet of the glusterd log from 10.70.37.71 (times are different as gluster logs in UTC and the vdsm logs in EDT (UTC+4)), <snip> [2013-10-17 14:14:10.193165] I [glusterd-handler.c:821:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req 10.70.37.79 24007 [2013-10-17 14:14:10.200041] I [glusterd-handler.c:2905:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: 10.70.37.79 (24007) [2013-10-17 14:14:10.456991] I [rpc-clnt.c:974:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2013-10-17 14:14:10.457113] I [socket.c:3485:socket_init] 0-management: SSL support is NOT enabled [2013-10-17 14:14:10.457131] I [socket.c:3500:socket_init] 0-management: using system polling thread [2013-10-17 14:14:10.460211] I [glusterd-handler.c:2886:glusterd_friend_add] 0-management: connect returned 0 [2013-10-17 14:14:10.460260] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2013-10-17 14:14:10.460855] W [dict.c:778:str_to_data] (-->/usr/lib64/glusterfs/3.4.0.35rhs/xlator/mgmt/glusterd.so(__glusterd_handle_cli_list_friends+0xa2) [0x7f26f01701d2] (-->/usr/lib64/glusterfs/3.4.0.35rhs/xlator/mgmt/glusterd.so(glusterd_list_friends+0x114) [0x7f26f016fd84] (-->/usr/lib64/libglusterfs.so.0(dict_set_str+0x1c) [0x3acf417acc]))) 0-dict: value is NULL [2013-10-17 14:14:11.224336] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2013-10-17 14:14:11.270804] I [glusterd-rpc-ops.c:235:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077, host: 10.70.37.79 [2013-10-17 14:14:11.677060] I [glusterd-rpc-ops.c:307:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req [2013-10-17 14:14:11.677123] I [glusterd-rpc-ops.c:357:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077, host: 10.70.37.79, port: 0 [2013-10-17 14:14:12.244230] I [glusterd-handshake.c:556:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 2 [2013-10-17 14:14:12.594504] I [glusterd-handler.c:2324:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077 [2013-10-17 14:14:12.594635] I [glusterd-handler.c:2376:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.37.79, op_ret: 0, op_errno: 0, ret: 0 [2013-10-17 14:14:12.728421] I [glusterd-handler.c:2028:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077 [2013-10-17 14:14:12.728596] I [glusterd-handler.c:3059:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.37.79 (0), ret: 0 [2013-10-17 14:14:13.101467] I [glusterd-sm.c:495:glusterd_ac_send_friend_update] 0-: Added uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077, host: 10.70.37.79 [2013-10-17 14:14:13.451990] I [glusterd-handler.c:2190:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 486976b7-8af2-4bc4-9648-b8a1603ec077 [2013-10-17 14:14:13.452034] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: b62e4043-4e3d-4f53-967c-6a5331709857, hostname:10.70.37.71 [2013-10-17 14:14:13.452075] I [glusterd-handler.c:2244:__glusterd_handle_friend_update] 0-: Received my uuid as Friend <snip> In the log we see that the peer probe command was received at 14:14:10.193165, and the peer status was received at 14:14:10.460260. The gap between the commands is so small (~0.2s) that the probe command hadn't even returned yet, and was yet to begin the probe process. Since the probe process hadn't finished when the status request was received, glusterd still had an empty peer object. So it failed to add the peer information to cli response (timestamp 14:14:10.460855) and the peer status command failed. Further peer status commands were issued later, by which time the probe had completed, and hence they were successful as observed by Aravinda. The automation script needs to be fixed to make sure that the 'peer status' command is sent only after a 'peer probe' command returns. It will be even better to wait for some time (~2-3 s) before sending the status command. Closing this as NOTABUG.