Bug 1043032 - [RHSC] Server removed from DB after being in Up state
Summary: [RHSC] Server removed from DB after being in Up state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rhsc
Version: 2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 2.1.2
Assignee: Kanagaraj
QA Contact: Matt Mahoney
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-12-13 19:38 UTC by Matt Mahoney
Modified: 2016-04-18 10:06 UTC (History)
13 users (show)

Fixed In Version: CB14
Doc Type: Bug Fix
Doc Text:
Previously, newly added servers were getting removed from the Database by sync job as the new server was not found in the cluster. This was due to 'gluster peer status' being invoked on a different server. Now, with this update, the servers will not be removed from the sync job when they are added to an existing cluster.
Clone Of:
Environment:
Last Closed: 2014-02-25 08:09:03 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:0208 0 normal SHIPPED_LIVE Red Hat Storage 2.1 enhancement and bug fix update #2 2014-02-25 12:20:30 UTC
oVirt gerrit 22448 0 None None None Never

Description Matt Mahoney 2013-12-13 19:38:33 UTC
Description of problem:

Server removed from DB by Engine, after the server was already in Up state

2013-Dec-13, 03:07 Detected server 192.168.122.56 removed from Cluster automation_cluster1, and removed it from engine DB.	
2013-Dec-13, 03:07 State was set to Up for host 192.168.122.56.

Version-Release number of selected component (if applicable):
cb11

How reproducible:


Steps to Reproduce:
1. Add two servers (no issues were encountered)
2. Added 3rd server (was in Up sate, and then was removed) 
4. Add 4th server (just after 3rd server is Up)

Actual results:
Server was removed from cluster/DB

Expected results:
No server's should be removed from cluster/DB

Additional info:

Comment 4 Dusmant 2013-12-16 09:10:14 UTC
2. Added 3rd server (was in Up sate, and then was removed) 
[Dusmant]   Was it automatically removed or some user action triggered this? It's not clear from this step
4. Add 4th server (just after 3rd server is Up)

Actual results:
Server was removed from cluster/DB 
[Dusmant] Which server? 3rd one. May be a name would be easy to refer and co-relate with the logs

Expected results:
No server's should be removed from cluster/DB

Comment 5 Sahina Bose 2013-12-16 09:39:51 UTC
From the logs:

1. command to peer probe  192.168.122.56 sent to 192.168.122.211 and this returned successfully

2013-12-13 03:07:29,495 INFO  [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-74) [12a4733e] START, AddGlusterServerVDSCommand(HostName = 192.168.122.211, HostId = c34cda3b-fe7f-4152-a79f-15f45b806395), log id: 59c7d5e0
2013-12-13 03:07:31,492 INFO  [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-74) [12a4733e] FINISH, AddGlusterServerVDSCommand, log id: 59c7d5e0

2. gluster peer list command sent to 192.168.122.124, which returns  [192.168.122.124:CONNECTED, 192.168.122.211:CONNECTED]


2013-12-13 03:07:32,958 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-62) [509af0bc] START, GlusterServersListVDSCommand(HostName = 192.168.122.124, HostId = f7594008-5540-4d66-8c2b-ed306ad387ff), log id: 4d46a229
2013-12-13 03:07:33,267 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-62) [509af0bc] FINISH, GlusterServersListVDSCommand, return: [192.168.122.124:CONNECTED, 192.168.122.211:CONNECTED], log id: 4d46a229

3. Since 192.168.122.56 was not returned in the list, this was marked to be removed.

The problem seems to be that after gluster peer probe executes, this does not appear in gluster peer list. (Possibly related to the explanation of how gluster peer probe works in https://bugzilla.redhat.com/show_bug.cgi?id=1020421)

Vijay, Kaushal - any pointers on how to fix this?

Comment 6 Kaushal 2013-12-16 10:38:18 UTC
Most likely, the peer probe and peer list commands were issued on two different nodes. A peer which was probed will appear in the peer list of the node which issued the command immediately. The information regarding the new peer is propagated through the cluster after this. So there is a slight delay before other peers start listing the new peer. If a peer list was done on these peers during this delay, the new peer would be missing.

Comment 7 Sahina Bose 2013-12-17 11:53:10 UTC
As per discussions, made 2 changes

1. After gluster peer probe, also check if gluster peer status returns the newly added host, before changing the host state to UP

2. For non-distributed commands like peer status and volume info, always issue commands on the same server.

Comment 8 Sahina Bose 2013-12-19 06:35:49 UTC
1. Host state is in Initializing after bootstrap completes. At this stage gluster peer probe is done, and waits for gluster peer status to return the newly added host from another host.
Once this is done the host state is changed to UP . If on 2 retries for peer status, the host is still not returned, the host state is moved to Non-Operational.
The number of retries is a configurable element

 2. For non-distributed commands like peer status and volume info, commands are always issued on the same server.

Comment 9 Dusmant 2013-12-24 04:48:17 UTC
Moved this bug to Modified state, because a part of the fix for this could not make into build CB13. It will be in CB14. Kanagaraj is going to take care of it in absence of Sahina...

Comment 10 Matt Mahoney 2014-01-03 14:14:31 UTC
Verified in CB14.

Comment 11 Shalaka 2014-01-07 05:40:23 UTC
Please review the edited DocText and signoff.

Comment 12 Kanagaraj 2014-01-16 06:43:00 UTC
doc_text looks good.

Comment 14 errata-xmlrpc 2014-02-25 08:09:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html


Note You need to log in before you can comment on or make changes to this bug.