Bug 840810

Summary:	Peer probe of new instance is unsuccessful after the Migration from Old instances(GVSA) to new instances(RHS)
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Scott Haines <shaines>
Component:	glusterd	Assignee:	Kaushal <kaushal>
Status:	CLOSED ERRATA	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	medium	Docs Contact:
Priority:	high
Version:	2.0	CC:	enakai, gluster-bugs, kaushal, rhinduja, sdharane
Target Milestone:	---
Target Release:	RHGS 2.1.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0qa5-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	839397	Environment:
Last Closed:	2013-09-23 22:38:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	834229, 839397, 1000986
Bug Blocks:

Description Scott Haines 2012-07-17 09:38:52 UTC

+++ This bug was initially created as a clone of Bug #839397 +++

+++ This bug was initially created as a clone of Bug #834229 +++

Created attachment 593397 [details]
glusterd logs

Description of problem:

Version-Release number of selected component (if applicable):

The behavior of probing new machine after the migration is strange. Following is the sequence.

1. Migration is successful from GVSA instances to RHS instances.
2. Trying to probe from the RHS instance to new RHS instance.
3. It doesn't say successful or unsuccessful. It returns to prompt after some time.
4. Peer status from the Old RHS machine says the "Establishing Connection (Connected)" and the UUID is mentioned as "d30bee2c-4fd1-4662-a01b-ae2ac3fb1831"
5. But the peer status from the new RHS machine reports UUID's to all zero "00000000-0000-0000-0000-000000000000" and it says "connected to peer (connected)"
6. Restarting the glusterd on old rhs machine displays the UUID's to all "00000000-0000-0000-0000-000000000000" from earlier "d30bee2c-4fd1-4662-a01b-ae2ac3fb1831".


How reproducible:


Steps to Reproduce:
1. Migrate from GVSA to RHS
2. peer probe new machine

  
Actual results:

Console Output of peer status:
=============================
Old RHS Machine
===============
[root@ip-10-138-30-187 ~]# gluster peer status
Number of Peers: 4

Hostname: ec2-54-251-62-150.ap-southeast-1.compute.amazonaws.com
Uuid: 9bb0d8c4-538a-f4e8-db66-36a53b213da9
State: Peer in Cluster (Connected)

Hostname: ec2-54-251-62-152.ap-southeast-1.compute.amazonaws.com
Uuid: 2fad5c20-3d37-5d3a-cffc-b4355cc83ff1
State: Peer in Cluster (Connected)

Hostname: ec2-54-251-60-39.ap-southeast-1.compute.amazonaws.com
Uuid: 39effae6-1457-dfcd-8bd2-c7cf3d940dde
State: Peer in Cluster (Connected)

Hostname: ec2-46-137-231-143.ap-southeast-1.compute.amazonaws.com
Uuid: d30bee2c-4fd1-4662-a01b-ae2ac3fb1831
State: Establishing Connection (Connected)
[root@ip-10-138-30-187 ~]# 

New RHS Machine
===============

[root@ip-10-138-109-140 ~]# gluster peer status
Number of Peers: 1

Hostname: 10.138.30.187
Uuid: 00000000-0000-0000-0000-000000000000
State: Connected to Peer (Connected)
[root@ip-10-138-109-140 ~]# 


Snippet from "etc-glusterfs-glusterd.vol.log"
=============================================

[2012-06-21 08:32:04.744455] I [glusterd-handler.c:423:glusterd_friend_find] 0-glusterd: Unable to find hostname: ec2-46-137-231-143.ap-southeast-1.compute.amazonaws.com
[2012-06-21 08:32:04.744506] I [glusterd-handler.c:2222:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: ec2-46-137-231-143.ap-southeast-1.compute.amazonaws.com (24007)
[2012-06-21 08:32:04.748926] I [glusterd-handler.c:2204:glusterd_friend_add] 0-management: connect returned 0
[2012-06-21 08:32:04.749745] I [glusterd-handshake.c:397:glusterd_set_clnt_mgmt_program] 0-: Using Program glusterd mgmt, Num (1238433), Version (2)
[2012-06-21 08:32:04.749774] I [glusterd-handshake.c:403:glusterd_set_clnt_mgmt_program] 0-: Using Program Peer mgmt, Num (1238437), Version (2)
[2012-06-21 08:32:04.756568] I [glusterd-rpc-ops.c:218:glusterd3_1_probe_cbk] 0-glusterd: Received probe resp from uuid: d30bee2c-4fd1-4662-a01b-ae2ac3fb1831, host: ec2-46-137-231-143.ap-southeast-1.compute.amazonaws.com
[2012-06-21 08:32:04.756608] I [glusterd-handler.c:411:glusterd_friend_find] 0-glusterd: Unable to find peer by uuid
[2012-06-21 08:32:04.756667] E [glusterd-sm.c:1022:glusterd_friend_sm] 0-glusterd: handler returned: -1


Additional info:

--- Additional comment from kaushal on 2012-06-21 08:15:18 EDT ---

Rahul,
Can you provide the full logs from both the servers. I'd like to look into it in more detail.

--- Additional comment from rhinduja on 2012-06-22 02:20:11 EDT ---

Hi Kaushal,

As discussed please find the log snippet:

[2012-06-22 05:57:04.746347] I [glusterd-rpc-ops.c:218:glusterd3_1_probe_cbk] 0-glusterd: Received probe resp from uuid: d30bee2c-4fd1-4662-a01b-ae2ac3fb1831, host: ec2-46-137-231-143.ap-southeast-1.compute.amazonaws.com
[2012-06-22 05:57:04.746398] D [glusterd-utils.c:4063:glusterd_friend_find_by_uuid] 0-glusterd: Friend with uuid: d30bee2c-4fd1-4662-a01b-ae2ac3fb1831, not found
[2012-06-22 05:57:04.746448] I [glusterd-handler.c:411:glusterd_friend_find] 0-glusterd: Unable to find peer by uuid
[2012-06-22 05:57:04.746466] D [glusterd-utils.c:4100:glusterd_friend_find_by_hostname] 0-management: Friend ec2-46-137-231-143.ap-southeast-1.compute.amazonaws.com found.. state: 0
[2012-06-22 05:57:04.746482] D [glusterd-sm.c:949:glusterd_friend_sm_inject_event] 0-glusterd: Enqueue event: 'GD_FRIEND_EVENT_INIT_FRIEND_REQ'
[2012-06-22 05:57:04.746495] D [glusterd-sm.c:1004:glusterd_friend_sm] 0-: Dequeued event of type: 'GD_FRIEND_EVENT_INIT_FRIEND_REQ'
[2012-06-22 05:57:04.746564] D [glusterd-utils.c:1823:glusterd_add_volume_to_dict] 0-: Returning with -1
[2012-06-22 05:57:04.746586] D [glusterd-utils.c:1858:glusterd_build_volume_dict] 0-: Returning with -1
[2012-06-22 05:57:04.746602] D [glusterd-rpc-ops.c:1513:glusterd3_1_friend_add] 0-glusterd: Returning -1
[2012-06-22 05:57:04.746615] D [glusterd-sm.c:302:glusterd_ac_friend_add] 0-: Returning with -1
[2012-06-22 05:57:04.746627] E [glusterd-sm.c:1022:glusterd_friend_sm] 0-glusterd: handler returned: -1
[2012-06-22 05:57:04.746656] I [glusterd-rpc-ops.c:286:glusterd3_1_probe_cbk] 0-glusterd: Received resp to probe req
[2012-06-22 05:59:04.729093] D [socket.c:184:__socket_rwv] 0-socket.management: EOF from peer 127.0.0.1:1022
[2012-06-22 05:59:04.729138] D [socket.c:1798:socket_event_handler] 0-transport: disconnecting now
[2012-06-22 06:00:17.190236] I [glusterd-handler.c:813:glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2012-06-22 06:00:17.191502] D [socket.c:184:__socket_rwv] 0-socket.management: EOF from peer 127.0.0.1:1022
[2012-06-22 06:00:17.191530] D [socket.c:1798:socket_event_handler] 0-transport: disconnecting now
[2012-06-22 06:01:24.167230] W [glusterfsd.c:831:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3c33ae5ccd] (-->/lib64/libpthread.so.0() [0x3c342077f1] (-->glusterd(glusterfs_sigwaiter+0xdd) [0x405cfd]))) 0-: received signum (15), shutting down

--- Additional comment from kaushal on 2012-06-22 03:12:31 EDT ---

Okay, tracked this down to the changes done in regard with username/password authentication in 3.3 . And this problem should be occur for other pre-3.3 to 3.3 migrations done using similar steps. The steps which to my knowledge are as follows,
1) Bring down 3.2 cluster
2) Copy config dir (/etc/glusterd/*) of each peer to a safe place
3) Install 3.3
4) Copy back the saved config to /var/lib/glusterd

The reason peer probe is failing, glusterd is failing to build the volume dictionary, which needs to be sent to the new peer, when volinfo.auth.{username,password} are missing. An easy fix will be to prevent failure when these are not present and just continue with building the rest of the dictionary. But this wouldn't be correct.

The main problem here is that, when the volinfos are created when glusterd starts, volinfo.auth.{username,password} are only filled in if they are present in the info file. Since these two values are not present in pre-3.3 versions of gluster, they are not in the info files. This leads to the fields being empty for volumes migrated from pre-3.3 to 3.3 .
The solution here is to generate these values when not found, similar to the backward compatibility measures used for other volinfo changes.

--- Additional comment from kaushal on 2012-07-04 02:25:15 EDT ---

Review http://review.gluster.com/3619 (glusterd: Fix peer probe when username/password is missing) fixes this issue on master.

Comment 1 krishnan parthasarathi 2012-08-08 07:01:11 UTC

Assigning the bug to Kaushal as he has fixed this on master.

Comment 2 Kaushal 2012-10-15 10:34:53 UTC

Fix already accepted upstream in the master branch.
The commit-id is b583363 (glusterd: Fix peer probe when username/password is missing) reviewed at http://review.gluster.com/3619

Comment 5 Scott Haines 2013-09-23 22:38:53 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 6 Scott Haines 2013-09-23 22:41:31 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html