Bug 1031613

Summary: [RHS-C] Adding 2nd RHS node failed with error "Failed to fetch gluster peer list from server Server1 on Cluster"
Product: Red Hat Gluster Storage Reporter: Prasanth <pprakash>
Component: rhscAssignee: Bala.FA <barumuga>
Status: CLOSED ERRATA QA Contact: Prasanth <pprakash>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.1CC: dpati, dtsang, knarra, mmahoney, pprakash, rhs-bugs, sdharane, ssampat
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhsc-2.1.2-0.25.master.el6_5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-25 08:04:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 972619    
Bug Blocks:    
Attachments:
Description Flags
screensot of error none

Description Prasanth 2013-11-18 11:48:56 UTC
Created attachment 825562 [details]
screensot of error

Description of problem:

Adding the 2nd RHS node failed and the Host went to non-operational.

Version-Release number of selected component (if applicable): cb8 build

---------------------
[root@vm10 ~]# rpm -qa |grep vdsm
vdsm-gluster-4.13.0-21.el6rhs.noarch
vdsm-cli-4.13.0-21.el6rhs.noarch
vdsm-python-4.13.0-21.el6rhs.x86_64
vdsm-python-cpopen-4.13.0-21.el6rhs.x86_64
vdsm-4.13.0-21.el6rhs.x86_64
vdsm-reg-4.13.0-21.el6rhs.noarch
vdsm-xmlrpc-4.13.0-21.el6rhs.noarch

[root@vm10 ~]# rpm -qa |grep glusterfs
glusterfs-server-3.4.0.43.1u2rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.43.1u2rhs-1.el6rhs.x86_64
glusterfs-3.4.0.43.1u2rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.43.1u2rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.43.1u2rhs-1.el6rhs.x86_64
samba-glusterfs-3.6.9-160.7.el6rhs.x86_64
glusterfs-rdma-3.4.0.43.1u2rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.43.1u2rhs-1.el6rhs.x86_64
---------------------


How reproducible: Seen once in cb8 and a couple of times in older builds as well


Steps to Reproduce:

PS: Kindly note that both my RHS nodes were Freshly installed with http://download.lab.bos.redhat.com/composes/nightly/RHSS-2.1u2-20131117.n.0/2/RHS/x86_64/iso/

1. Install the latest cb8 build as per http://rhsm.pad.engineering.redhat.com/rhsc-build-cb8
2. Install 2 RHS nodes with http://download.lab.bos.redhat.com/composes/nightly/RHSS-2.1u2-20131117.n.0/2/RHS/x86_64/iso/
3. Create a new cluster and add 1 RHS node and wait till it's set to UP
4. Add a 2nd RHS node and watch the Events.

Actual results: 2nd RHS node was set to "Non-operational" and following messages are seen in the Events:

------------
2013-Nov-18, 16:55 State was set to Up for host Server1.
2013-Nov-18, 16:55 Network virbr0 is not attached to any interface on host Server1.
2013-Nov-18, 16:55 Host Server1 was autorecovered.
2013-Nov-18, 16:55 Network virbr0 is not attached to any interface on host Server1.
2013-Nov-18, 16:51 Gluster command [gluster peer status] failed on server Server1.
2013-Nov-18, 16:51 Failed to fetch gluster peer list from server Server1 on Cluster Cluster-CB8.
2013-Nov-18, 16:51 State was set to Up for host Server2.
2013-Nov-18, 16:51 Network virbr0 is not attached to any interface on host Server2.
2013-Nov-18, 16:51 Network virbr0 is not attached to any interface on host Server2.
2013-Nov-18, 16:51 Host Server2 installed
2013-Nov-18, 16:51 Network changes were saved on host Server2
2013-Nov-18, 16:51 Network virbr0 is not attached to any interface on host Server2.
2013-Nov-18, 16:51 Network virbr0 is not attached to any interface on host Server2.
2013-Nov-18, 16:51 Installing Host Server2. Stage: Termination.
------------

Expected results: Add Hosts should succeed all the time without any issues. 


Additional info: log collector report will be attached soon.

Comment 1 Prasanth 2013-11-18 11:52:40 UTC
rhsc-log-collector reports can be downloaded from: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1031613/

Comment 2 Prasanth 2013-11-18 11:53:50 UTC
Ifconfig output before adding the host to cluster:

[root@vm10 ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1A:4A:46:26:2D  
          inet addr:10.70.36.82  Bcast:10.70.37.255  Mask:255.255.254.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4910 errors:0 dropped:0 overruns:0 frame:0
          TX packets:95 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000                                                                                                                                      
          RX bytes:307109 (299.9 KiB)  TX bytes:12543 (12.2 KiB)                                                                                                            
                                                                                                                                                                            
lo        Link encap:Local Loopback                                                                                                                                         
          inet addr:127.0.0.1  Mask:255.0.0.0                                                                                                                               
          UP LOOPBACK RUNNING  MTU:16436  Metric:1                                                                                                                          
          RX packets:700 errors:0 dropped:0 overruns:0 frame:0
          TX packets:700 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:111428 (108.8 KiB)  TX bytes:111428 (108.8 KiB)

virbr0    Link encap:Ethernet  HWaddr 52:54:00:13:24:07  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Comment 3 Bala.FA 2013-11-20 10:59:02 UTC
I see below traceback in vdsm of vm10.lab.eng.blr.redhat.com

MainProcess|Thread-154::ERROR::2013-11-18 16:51:17,565::supervdsmServer::94::SuperVdsm.ServerCallback::(wrapper) Error in wrapper
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer", line 92, in wrapper
    res = func(*args, **kwargs)
  File "/usr/share/vdsm/supervdsmServer", line 362, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/vdsm/gluster/__init__.py", line 31, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/vdsm/gluster/cli.py", line 876, in peerStatus
    raise ge.GlusterHostsListFailedException(rc=e.rc, err=e.err)
GlusterHostsListFailedException: Hosts list failed
error: (null)
return code: -22
MainProcess|Thread-155::DEBUG::2013-11-18 16:51:17,898::supervdsmServer::90::SuperVdsm.ServerCallback::(wrapper) call wrapper with (None,) {}


It looks like glusterfs fails on 'gluster peer info'

Comment 4 Bala.FA 2013-11-20 11:26:13 UTC
Below is the log from glusterfs.  It looks like 'peer probe' and 'peer status' are called immediately which causes 'peer status' to return empty.  The similar behaviour is also addressed in bz#972619


[2013-11-18 11:21:17.141761] I [glusterd-handler.c:821:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req vm11.lab.eng.blr.redhat.com 24007
[2013-11-18 11:21:17.156228] I [glusterd-handler.c:2905:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: vm11.lab.eng.blr.redhat.com (24007)
[2013-11-18 11:21:17.340987] I [rpc-clnt.c:976:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-11-18 11:21:17.341045] I [socket.c:3505:socket_init] 0-management: SSL support is NOT enabled
[2013-11-18 11:21:17.341055] I [socket.c:3520:socket_init] 0-management: using system polling thread
[2013-11-18 11:21:17.347291] I [glusterd-handler.c:2886:glusterd_friend_add] 0-management: connect returned 0
[2013-11-18 11:21:17.563649] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2013-11-18 11:21:17.563864] W [dict.c:778:str_to_data] (-->/usr/lib64/glusterfs/3.4.0.43.1u2rhs/xlator/mgmt/glusterd.so(__glusterd_handle_cli_list_friends+0xa2) [0x7f9adce28842] (-->/usr/lib64/glusterfs/3.4.0.43.1u2rhs/xlator/mgmt/glusterd.so(glusterd_list_friends+0x114) [0x7f9adce283f4] (-->/usr/lib64/libglusterfs.so.0(dict_set_str+0x1c) [0x7f9ae1652cfc]))) 0-dict: value is NULL
[2013-11-18 11:21:18.025020] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-11-18 11:21:18.860709] I [glusterd-rpc-ops.c:235:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: e05b29ce-8b8c-4b62-9d81-747eac6916da, host: vm11.lab.eng.blr.redhat.com
[2013-11-18 11:21:19.200455] I [glusterd-rpc-ops.c:307:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2013-11-18 11:21:19.200596] I [glusterd-rpc-ops.c:357:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: e05b29ce-8b8c-4b62-9d81-747eac6916da, host: vm11.lab.eng.blr.redhat.com, port: 0
[2013-11-18 11:21:19.728628] I [glusterd-handshake.c:556:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 2
[2013-11-18 11:21:20.136427] I [glusterd-handler.c:2324:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: e05b29ce-8b8c-4b62-9d81-747eac6916da
[2013-11-18 11:21:20.136523] I [glusterd-handler.c:2376:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.36.83, op_ret: 0, op_errno: 0, ret: 0
[2013-11-18 11:21:20.428050] I [glusterd-handler.c:2028:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: e05b29ce-8b8c-4b62-9d81-747eac6916da
[2013-11-18 11:21:20.428118] I [glusterd-handler.c:3059:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to vm11.lab.eng.blr.redhat.com (0), ret: 0
[2013-11-18 11:21:20.818363] I [glusterd-sm.c:495:glusterd_ac_send_friend_update] 0-: Added uuid: e05b29ce-8b8c-4b62-9d81-747eac6916da, host: vm11.lab.eng.blr.redhat.com
[2013-11-18 11:21:21.194497] I [glusterd-handler.c:2190:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e05b29ce-8b8c-4b62-9d81-747eac6916da
[2013-11-18 11:21:21.194544] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: 1f65487d-8859-4e70-b518-682368aaf91c, hostname:10.70.36.82
[2013-11-18 11:21:21.194553] I [glusterd-handler.c:2244:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2013-11-18 11:21:21.336185] I [glusterd-rpc-ops.c:554:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: e05b29ce-8b8c-4b62-9d81-747eac6916da
[2013-11-18 11:24:29.680426] I [glusterd-handler.c:1018:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req

Comment 5 Bala.FA 2013-11-25 08:55:42 UTC
Moving to ON_QA as per bz#972619

Comment 6 Prasanth 2013-11-27 12:24:01 UTC
Verified in cb9 (rhsc-2.1.2-0.25.master.el6_5.noarch)

Haven't seen this issue so far during my testing with cb9 build. Able to add 2nd node immediately after node1 comes up. 

So marking it as verified. Will re-open the bug if it happens again.

Comment 8 Dusmant 2014-01-07 15:16:57 UTC
As this bug is similar to https://bugzilla.redhat.com/show_bug.cgi?id=972619, its not required to add another doctext to this bz.

Comment 10 errata-xmlrpc 2014-02-25 08:04:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html