Hide Forgot
Description of problem: Server removed from DB by Engine, after the server was already in Up state 2013-Dec-13, 03:07 Detected server 192.168.122.56 removed from Cluster automation_cluster1, and removed it from engine DB. 2013-Dec-13, 03:07 State was set to Up for host 192.168.122.56. Version-Release number of selected component (if applicable): cb11 How reproducible: Steps to Reproduce: 1. Add two servers (no issues were encountered) 2. Added 3rd server (was in Up sate, and then was removed) 4. Add 4th server (just after 3rd server is Up) Actual results: Server was removed from cluster/DB Expected results: No server's should be removed from cluster/DB Additional info:
2. Added 3rd server (was in Up sate, and then was removed) [Dusmant] Was it automatically removed or some user action triggered this? It's not clear from this step 4. Add 4th server (just after 3rd server is Up) Actual results: Server was removed from cluster/DB [Dusmant] Which server? 3rd one. May be a name would be easy to refer and co-relate with the logs Expected results: No server's should be removed from cluster/DB
From the logs: 1. command to peer probe 192.168.122.56 sent to 192.168.122.211 and this returned successfully 2013-12-13 03:07:29,495 INFO [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-74) [12a4733e] START, AddGlusterServerVDSCommand(HostName = 192.168.122.211, HostId = c34cda3b-fe7f-4152-a79f-15f45b806395), log id: 59c7d5e0 2013-12-13 03:07:31,492 INFO [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-74) [12a4733e] FINISH, AddGlusterServerVDSCommand, log id: 59c7d5e0 2. gluster peer list command sent to 192.168.122.124, which returns [192.168.122.124:CONNECTED, 192.168.122.211:CONNECTED] 2013-12-13 03:07:32,958 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-62) [509af0bc] START, GlusterServersListVDSCommand(HostName = 192.168.122.124, HostId = f7594008-5540-4d66-8c2b-ed306ad387ff), log id: 4d46a229 2013-12-13 03:07:33,267 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-62) [509af0bc] FINISH, GlusterServersListVDSCommand, return: [192.168.122.124:CONNECTED, 192.168.122.211:CONNECTED], log id: 4d46a229 3. Since 192.168.122.56 was not returned in the list, this was marked to be removed. The problem seems to be that after gluster peer probe executes, this does not appear in gluster peer list. (Possibly related to the explanation of how gluster peer probe works in https://bugzilla.redhat.com/show_bug.cgi?id=1020421) Vijay, Kaushal - any pointers on how to fix this?
Most likely, the peer probe and peer list commands were issued on two different nodes. A peer which was probed will appear in the peer list of the node which issued the command immediately. The information regarding the new peer is propagated through the cluster after this. So there is a slight delay before other peers start listing the new peer. If a peer list was done on these peers during this delay, the new peer would be missing.
As per discussions, made 2 changes 1. After gluster peer probe, also check if gluster peer status returns the newly added host, before changing the host state to UP 2. For non-distributed commands like peer status and volume info, always issue commands on the same server.
1. Host state is in Initializing after bootstrap completes. At this stage gluster peer probe is done, and waits for gluster peer status to return the newly added host from another host. Once this is done the host state is changed to UP . If on 2 retries for peer status, the host is still not returned, the host state is moved to Non-Operational. The number of retries is a configurable element 2. For non-distributed commands like peer status and volume info, commands are always issued on the same server.
Moved this bug to Modified state, because a part of the fix for this could not make into build CB13. It will be in CB14. Kanagaraj is going to take care of it in absence of Sahina...
Verified in CB14.
Please review the edited DocText and signoff.
doc_text looks good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html