Bug 972619

Summary: [RHS-C] Detected server host3 removed from Cluster <UNKNOWN>, and removed it from engine DB.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Prasanth <pprakash>
Component: rhscAssignee: Sahina Bose <sabose>
Status: CLOSED ERRATA QA Contact: Prasanth <pprakash>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: dpati, dtsang, knarra, miguel, mmahoney, pprakash, rhs-bugs, sabose, sdharane, sharne, ssampat, vagarwal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: cb9 Doc Type: Bug Fix
Doc Text:
Previously, while adding hosts, the newly added hosts were getting removed from the database automatically. Now, with this update, the user can add/move hosts successfully.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-25 07:30:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1031613    
Attachments:
Description Flags
screenshot of error
none
Adding RHS2.0+ and RHS2.1 Nodes To 3.1 Cluster none

Description Prasanth 2013-06-10 08:43:21 UTC
Created attachment 759110 [details]
screenshot of error

Description of problem:

After adding a host from the UI, it came UP and suddenly gets removed from the engine DB with the following event message:

---

2013-Jun-10, 01:51
	
Detected server host3 removed from Cluster <UNKNOWN>, and removed it from engine DB.
	
	
2013-Jun-10, 01:51
	
Detected new Host host3. Host state was set to Up.
---

However, the host is still part of the cluster

Version-Release number of selected component (if applicable):  Red Hat Storage Console Version: 2.1.0-0.bb2.el6rhs 


How reproducible: Intermittent issue

Steps to Reproduce:
1. Create a Cluster and add a server from the UI
2. Add a 2nd server from the CLI and import it from the UI
3. Add a 3rd server from the UI and watch the event messages.

----
	
2013-Jun-10, 01:51
	
Detected server host3 removed from Cluster <UNKNOWN>, and removed it from engine DB.
	
	
2013-Jun-10, 01:51
	
Detected new Host host3. Host state was set to Up.
	
	
2013-Jun-10, 01:51
	
Host host3 installed
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Termination.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Retrieving installation logs to: '/var/log/ovirt-engine/host-deploy/ovirt-20130610015132-qa-vm07.lab.eng.blr.redhat.com-f6e09c7.log'.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Pre-termination.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Starting vdsm.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Starting gluster.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Closing up.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Transaction commit.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Enrolling certificate.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Setting time.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Misc configuration.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Package installation.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Transaction setup.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Setup validation.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Logs at host located at: '/tmp/ovirt-host-deploy-20130609201900.log'.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Environment customization.
	
	
2013-Jun-10, 01:51
	
Installing Host host3. Stage: Programs detection.
	
	
2013-Jun-10, 01:50
	
Installing Host host3. Yum Downloading: rhel-x86_64-server-6.4.z/primary 9.9 M(73%).
	
	
2013-Jun-10, 01:50
	
Installing Host host3. Yum Downloading: rhel-x86_64-server-6.4.z/primary 6.2 M(45%).
	
	
2013-Jun-10, 01:49
	
Installing Host host3. Yum Downloading: rhel-x86_64-server-6.4.z/primary 2.8 M(20%).
	
	
2013-Jun-10, 01:49
	
Installing Host host3. Stage: Installation packages setup.
	
	
2013-Jun-10, 01:49
	
Installing Host host3. Stage: Environment setup.
	
	
2013-Jun-10, 01:49
	
Installing Host host3. Stage: Initializing.
	
	
2013-Jun-10, 01:49
	
Installing Host host3. Connected to host qa-vm07.lab.eng.blr.redhat.com with SSH key fingerprint: 87:9a:7f:07:3e:ae:c8:40:ac:e1:03:bc:73:8c:1d:3c.
	
	
2013-Jun-10, 01:49
	
Host host3 was added by admin@internal.
-----


Actual results: Host is getting removed as soon as the server is UP


Expected results: Host should not get removed from the engine db in a valid case


Additional info: Screenshot attached.

engine and vdsm logs will be attached soon

Comment 1 Prasanth 2013-06-10 08:50:24 UTC
gluster peer status outputs from the 3 servers:

---------------
[root@qa-vm05 /]# hostname
qa-vm05.lab.eng.blr.redhat.com
[root@qa-vm05 /]# 
[root@qa-vm05 /]# gluster peer status
Number of Peers: 2

Hostname: 10.70.36.78
Port: 24007
Uuid: 5f057817-fa16-498c-b683-aefcf0863c3c
State: Peer in Cluster (Connected)

Hostname: qa-vm07.lab.eng.blr.redhat.com
Uuid: 3c4acfe2-36e8-47fd-b45f-5cb00d78b9d8
State: Peer in Cluster (Connected)



[root@qa-vm06 ~]# hostname
qa-vm06.lab.eng.blr.redhat.com
[root@qa-vm06 ~]# 
[root@qa-vm06 ~]# gluster peer status
Number of Peers: 2

Hostname: 10.70.36.77
Uuid: a719992e-0f15-419b-99be-999e1f786813
State: Peer in Cluster (Connected)

Hostname: qa-vm07.lab.eng.blr.redhat.com
Port: 24007
Uuid: 3c4acfe2-36e8-47fd-b45f-5cb00d78b9d8
State: Peer in Cluster (Connected)


[root@qa-vm07 /]# hostname
qa-vm07.lab.eng.blr.redhat.com
[root@qa-vm07 /]# 
[root@qa-vm07 /]# gluster peer status
Number of Peers: 2

Hostname: 10.70.36.78
Port: 24007
Uuid: 5f057817-fa16-498c-b683-aefcf0863c3c
State: Peer in Cluster (Connected)

Hostname: 10.70.36.77
Uuid: a719992e-0f15-419b-99be-999e1f786813
State: Peer in Cluster (Connected)
---------------

PS: The 2nd server was peer probed using the HOSTNAME only. I'm not sure why some of the peers are with IP and some are with hostname.

Comment 6 Sahina Bose 2013-06-19 10:08:49 UTC
In 3.2 clusters (using RHS 2.1 nodes) using gluster host UUID to compare hosts instead of server IP/ hostnames.
This should fix the issue where engine cannot identify a gluster host correctly.

Comment 7 Matt Mahoney 2013-06-21 19:14:39 UTC
Tested against bb4. This issue appears to be partially fixed in that when peer probing a server by its hostname the resulting RHSC message is correct. However, when peer detaching the same server by its hostname the RHSC message is incorrect:

	
2013-Jun-21, 04:02
	
Detected server rhs4 removed from Cluster <UNKNOWN>, and removed it from engine DB.


Note: The detach is successful.

Comment 9 Matt Mahoney 2013-06-27 19:30:48 UTC
Created attachment 766300 [details]
Adding RHS2.0+ and RHS2.1 Nodes To 3.1 Cluster

Comment 10 Prasanth 2013-10-17 12:28:53 UTC
This is seen in the latest build cb4 as well. See below:

----
Detected server Host3 removed from Cluster Cluster1, and removed it from engine DB.
----


vdsm.log says the following:

----------------------------
2013-10-17 16:56:07,583 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host Host3.
2013-10-17 16:56:07,596 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (DefaultQuartzScheduler_Worker-7) Running command: InitVdsOnUpCommand internal: true. Entities affected :  ID: 5849b030-626e-47cb-ad90-3ce782d831b3 Type: StoragePool
2013-10-17 16:56:07,598 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterHostUUIDVDSCommand] (DefaultQuartzScheduler_Worker-7) START, GetGlusterHostUUIDVDSCommand(HostName = Host3, HostId = 1784c92a-1dfa-4db1-aa00-4e7c3b3c0ee1), log id: 6a65761b
2013-10-17 16:56:07,726 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterHostUUIDVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, GetGlusterHostUUIDVDSCommand, return: ee62b9db-3aa2-4121-9bc6-b1dbcd7581ee, log id: 6a65761b
2013-10-17 16:56:07,751 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-7) START, GlusterServersListVDSCommand(HostName = Host2, HostId = 783e37c6-fba2-4115-8a97-875772426117), log id: 647dc785
2013-10-17 16:56:08,000 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, GlusterServersListVDSCommand, return: [10.70.36.85:CONNECTED, 10.70.36.84:CONNECTED], log id: 647dc785
2013-10-17 16:56:08,002 INFO  [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-7) START, AddGlusterServerVDSCommand(HostName = Host2, HostId = 783e37c6-fba2-4115-8a97-875772426117), log id: 45b3613d
2013-10-17 16:56:11,002 INFO  [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, AddGlusterServerVDSCommand, log id: 45b3613d
2013-10-17 16:56:11,119 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (DefaultQuartzScheduler_Worker-7) Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: 1784c92a-1dfa-4db1-aa00-4e7c3b3c0ee1 Type: VDS
2013-10-17 16:56:11,631 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-2) START, GlusterServersListVDSCommand(HostName = Host1, HostId = a325459f-24f2-4eb3-abf8-de36a460a8ab), log id: 4d713d87
2013-10-17 16:56:11,889 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-2) FINISH, GlusterServersListVDSCommand, return: [10.70.36.84:CONNECTED, vm13.lab.eng.blr.redhat.com:CONNECTED], log id: 4d713d87
2013-10-17 16:56:11,889 INFO  [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-2) Server Host3 has been removed directly using the gluster CLI. Removing it from engine as well.
2013-10-17 16:56:11,967 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-2) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Detected server Host3 removed from Cluster Cluster1, and removed it from engine DB.
2013-10-17 16:56:12,032 INFO  [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (DefaultQuartzScheduler_Worker-2) START, RemoveVdsVDSCommand( HostId = 1784c92a-1dfa-4db1-aa00-4e7c3b3c0ee1), log id: 6def688c
2013-10-17 16:56:12,032 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-2) vdsManager::disposing
2013-10-17 16:56:12,033 INFO  [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (DefaultQuartzScheduler_Worker-2) FINISH, RemoveVdsVDSCommand, log id: 6def688c
2013-10-17 16:56:12,033 INFO  [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-2) Servers detached using gluster CLI  is removed from engine after inspecting the Gluster servers list [10.70.36.84:CONNECTED, vm13.lab.eng.blr.redhat.com:CONNECTED]
----------------------------

Looks like this is a serious race condition.

Steps to reproduce:
* Added 2 servers to a cluster from UI using the hostname
* Added 1 more server to the same cluster from the UI using the hostname later
* The bootstrap was successful but the server was immediately removed from the engine db.
* However, if you look at the "gluster peer status", you can see that the server got peer probed successfully and got part of the cluster. 

-----------
[root@vm13 ]# gluster peer status
Number of Peers: 2

Hostname: 10.70.36.84
Uuid: 08e029e4-4c29-452d-8faa-2bb66fc63b19
State: Peer in Cluster (Connected)

Hostname: vm11.lab.eng.blr.redhat.com
Uuid: ee62b9db-3aa2-4121-9bc6-b1dbcd7581ee
State: Peer in Cluster (Connected)

[root@vm13 /]# gluster pool list
UUID                                    Hostname        State
08e029e4-4c29-452d-8faa-2bb66fc63b19    10.70.36.84     Connected 
ee62b9db-3aa2-4121-9bc6-b1dbcd7581ee    vm11.lab.eng.blr.redhat.com     Connected 
3156792a-be8a-40ae-83e5-3d12a2d7eb01    localhost       Connected 
-----------

Host details for your reference:

Host1   vm12.lab.eng.blr.redhat.com   10.70.36.84
Host2   vm13.lab.eng.blr.redhat.com   10.70.36.85
Host3   vm11.lab.eng.blr.redhat.com   10.70.36.83


sosreports of engine and 3 nodes can be downloaded from: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/972619/

Let me know if you need any additional details to resolve this issue.

Comment 11 Shruti Sampat 2013-10-25 16:14:44 UTC
Issue observed in cb5 as well, while editing a server and moving it to a different cluster. Did the following - 

1. Had two clusters, one with 3 servers and another one with 1 server.
2. Moved the server belonging to the single-server cluster to Maintenance and edited the name of the server, and the cluster field ( to move it to the other cluster )

After updating the server details, immediately the message "Detected server 10.70.37.44 removed from Cluster test1, and removed it from engine DB." was seen in the events log.

Comment 12 Shruti Sampat 2013-10-25 16:33:12 UTC
Find sosreports and engine logs at - 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/972619/

Comment 13 Sahina Bose 2013-10-30 08:44:46 UTC
There are 2 issues here.

1) Issue in Comment 10 seems strange.
From the logs, it looks like the request to gluster peer probe from engine returned successfully without actually executing it. The gluster peer probe happened 2 minutes later.

So though the engine removed the server from DB, it should have been available for import later, as peer list returns 3.

engine.log ->
2013-10-17 16:56:08,002 INFO  [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-7) START, AddGlusterServerVDSCommand(HostName = Host2, HostId = 783e37c6-fba2-4115-8a97-875772426117), log id: 45b3613d
2013-10-17 16:56:11,002 INFO  [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, AddGlusterServerVDSCommand, log id: 45b3613d

host2 vdsm.log ->
Thread-9323::DEBUG::2013-10-17 16:59:29,053::BindingXMLRPC::974::vds::(wrapper) client [10.70.36.79]::call hostAdd with ('vm11.lab.eng.blr.redhat.com',) {}
Thread-9323::DEBUG::2013-10-17 16:59:32,049::BindingXMLRPC::981::vds::(wrapper) return hostAdd with {'status': {'message': 'Done', 'code': 0}}

2) Issue with moving host to different cluster and host getting removed. This is due to the 2 flows running in parallel. And will be fixed by http://gerrit.ovirt.org/#/c/20647/

Comment 14 Sahina Bose 2013-10-31 08:50:40 UTC
For issue 1, according to Bug 1020421#Comment 4, when "gluster peer probe" is done, the CLI returns success - though the peer probe has not yet actually happened. This would cause the behaviour we see in Comment 10 of this bug.

Comment 15 Sahina Bose 2013-11-11 10:37:24 UTC
Added a lock at cluster level in engine to make sure that a gluster peer status is not called at same time when gluster peer probe is done.
http://gerrit.ovirt.org/#/c/21015/

Comment 17 Shalaka 2014-01-07 09:13:19 UTC
Please review the edited DocText and signoff.

Comment 18 Sahina Bose 2014-01-30 07:02:57 UTC
Looks good.

Comment 21 errata-xmlrpc 2014-02-25 07:30:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html