Bug 1043032

Summary: [RHSC] Server removed from DB after being in Up state
Product: Red Hat Gluster Storage Reporter: Matt Mahoney <mmahoney>
Component: rhscAssignee: Kanagaraj <kmayilsa>
Status: CLOSED ERRATA QA Contact: Matt Mahoney <mmahoney>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: dpati, dtsang, grajaiya, kaushal, kmayilsa, knarra, mmahoney, pprakash, rhs-bugs, sankarshan, sharne, ssampat, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: CB14 Doc Type: Bug Fix
Doc Text:
Previously, newly added servers were getting removed from the Database by sync job as the new server was not found in the cluster. This was due to 'gluster peer status' being invoked on a different server. Now, with this update, the servers will not be removed from the sync job when they are added to an existing cluster.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-25 08:09:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Matt Mahoney 2013-12-13 19:38:33 UTC
Description of problem:

Server removed from DB by Engine, after the server was already in Up state

2013-Dec-13, 03:07 Detected server 192.168.122.56 removed from Cluster automation_cluster1, and removed it from engine DB.	
2013-Dec-13, 03:07 State was set to Up for host 192.168.122.56.

Version-Release number of selected component (if applicable):
cb11

How reproducible:


Steps to Reproduce:
1. Add two servers (no issues were encountered)
2. Added 3rd server (was in Up sate, and then was removed) 
4. Add 4th server (just after 3rd server is Up)

Actual results:
Server was removed from cluster/DB

Expected results:
No server's should be removed from cluster/DB

Additional info:

Comment 4 Dusmant 2013-12-16 09:10:14 UTC
2. Added 3rd server (was in Up sate, and then was removed) 
[Dusmant]   Was it automatically removed or some user action triggered this? It's not clear from this step
4. Add 4th server (just after 3rd server is Up)

Actual results:
Server was removed from cluster/DB 
[Dusmant] Which server? 3rd one. May be a name would be easy to refer and co-relate with the logs

Expected results:
No server's should be removed from cluster/DB

Comment 5 Sahina Bose 2013-12-16 09:39:51 UTC
From the logs:

1. command to peer probe  192.168.122.56 sent to 192.168.122.211 and this returned successfully

2013-12-13 03:07:29,495 INFO  [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-74) [12a4733e] START, AddGlusterServerVDSCommand(HostName = 192.168.122.211, HostId = c34cda3b-fe7f-4152-a79f-15f45b806395), log id: 59c7d5e0
2013-12-13 03:07:31,492 INFO  [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-74) [12a4733e] FINISH, AddGlusterServerVDSCommand, log id: 59c7d5e0

2. gluster peer list command sent to 192.168.122.124, which returns  [192.168.122.124:CONNECTED, 192.168.122.211:CONNECTED]


2013-12-13 03:07:32,958 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-62) [509af0bc] START, GlusterServersListVDSCommand(HostName = 192.168.122.124, HostId = f7594008-5540-4d66-8c2b-ed306ad387ff), log id: 4d46a229
2013-12-13 03:07:33,267 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-62) [509af0bc] FINISH, GlusterServersListVDSCommand, return: [192.168.122.124:CONNECTED, 192.168.122.211:CONNECTED], log id: 4d46a229

3. Since 192.168.122.56 was not returned in the list, this was marked to be removed.

The problem seems to be that after gluster peer probe executes, this does not appear in gluster peer list. (Possibly related to the explanation of how gluster peer probe works in https://bugzilla.redhat.com/show_bug.cgi?id=1020421)

Vijay, Kaushal - any pointers on how to fix this?

Comment 6 Kaushal 2013-12-16 10:38:18 UTC
Most likely, the peer probe and peer list commands were issued on two different nodes. A peer which was probed will appear in the peer list of the node which issued the command immediately. The information regarding the new peer is propagated through the cluster after this. So there is a slight delay before other peers start listing the new peer. If a peer list was done on these peers during this delay, the new peer would be missing.

Comment 7 Sahina Bose 2013-12-17 11:53:10 UTC
As per discussions, made 2 changes

1. After gluster peer probe, also check if gluster peer status returns the newly added host, before changing the host state to UP

2. For non-distributed commands like peer status and volume info, always issue commands on the same server.

Comment 8 Sahina Bose 2013-12-19 06:35:49 UTC
1. Host state is in Initializing after bootstrap completes. At this stage gluster peer probe is done, and waits for gluster peer status to return the newly added host from another host.
Once this is done the host state is changed to UP . If on 2 retries for peer status, the host is still not returned, the host state is moved to Non-Operational.
The number of retries is a configurable element

 2. For non-distributed commands like peer status and volume info, commands are always issued on the same server.

Comment 9 Dusmant 2013-12-24 04:48:17 UTC
Moved this bug to Modified state, because a part of the fix for this could not make into build CB13. It will be in CB14. Kanagaraj is going to take care of it in absence of Sahina...

Comment 10 Matt Mahoney 2014-01-03 14:14:31 UTC
Verified in CB14.

Comment 11 Shalaka 2014-01-07 05:40:23 UTC
Please review the edited DocText and signoff.

Comment 12 Kanagaraj 2014-01-16 06:43:00 UTC
doc_text looks good.

Comment 14 errata-xmlrpc 2014-02-25 08:09:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html