Bug 972619
Summary: | [RHS-C] Detected server host3 removed from Cluster <UNKNOWN>, and removed it from engine DB. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Prasanth <pprakash> | ||||||
Component: | rhsc | Assignee: | Sahina Bose <sabose> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Prasanth <pprakash> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 2.1 | CC: | dpati, dtsang, knarra, miguel, mmahoney, pprakash, rhs-bugs, sabose, sdharane, sharne, ssampat, vagarwal | ||||||
Target Milestone: | --- | Keywords: | ZStream | ||||||
Target Release: | RHGS 2.1.2 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | cb9 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Previously, while adding hosts, the newly added hosts were getting removed from the database automatically. Now, with this update, the user can add/move hosts successfully.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-02-25 07:30:21 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1031613 | ||||||||
Attachments: |
|
gluster peer status outputs from the 3 servers: --------------- [root@qa-vm05 /]# hostname qa-vm05.lab.eng.blr.redhat.com [root@qa-vm05 /]# [root@qa-vm05 /]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.78 Port: 24007 Uuid: 5f057817-fa16-498c-b683-aefcf0863c3c State: Peer in Cluster (Connected) Hostname: qa-vm07.lab.eng.blr.redhat.com Uuid: 3c4acfe2-36e8-47fd-b45f-5cb00d78b9d8 State: Peer in Cluster (Connected) [root@qa-vm06 ~]# hostname qa-vm06.lab.eng.blr.redhat.com [root@qa-vm06 ~]# [root@qa-vm06 ~]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.77 Uuid: a719992e-0f15-419b-99be-999e1f786813 State: Peer in Cluster (Connected) Hostname: qa-vm07.lab.eng.blr.redhat.com Port: 24007 Uuid: 3c4acfe2-36e8-47fd-b45f-5cb00d78b9d8 State: Peer in Cluster (Connected) [root@qa-vm07 /]# hostname qa-vm07.lab.eng.blr.redhat.com [root@qa-vm07 /]# [root@qa-vm07 /]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.78 Port: 24007 Uuid: 5f057817-fa16-498c-b683-aefcf0863c3c State: Peer in Cluster (Connected) Hostname: 10.70.36.77 Uuid: a719992e-0f15-419b-99be-999e1f786813 State: Peer in Cluster (Connected) --------------- PS: The 2nd server was peer probed using the HOSTNAME only. I'm not sure why some of the peers are with IP and some are with hostname. In 3.2 clusters (using RHS 2.1 nodes) using gluster host UUID to compare hosts instead of server IP/ hostnames. This should fix the issue where engine cannot identify a gluster host correctly. Tested against bb4. This issue appears to be partially fixed in that when peer probing a server by its hostname the resulting RHSC message is correct. However, when peer detaching the same server by its hostname the RHSC message is incorrect: 2013-Jun-21, 04:02 Detected server rhs4 removed from Cluster <UNKNOWN>, and removed it from engine DB. Note: The detach is successful. Created attachment 766300 [details]
Adding RHS2.0+ and RHS2.1 Nodes To 3.1 Cluster
This is seen in the latest build cb4 as well. See below: ---- Detected server Host3 removed from Cluster Cluster1, and removed it from engine DB. ---- vdsm.log says the following: ---------------------------- 2013-10-17 16:56:07,583 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host Host3. 2013-10-17 16:56:07,596 INFO [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (DefaultQuartzScheduler_Worker-7) Running command: InitVdsOnUpCommand internal: true. Entities affected : ID: 5849b030-626e-47cb-ad90-3ce782d831b3 Type: StoragePool 2013-10-17 16:56:07,598 INFO [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterHostUUIDVDSCommand] (DefaultQuartzScheduler_Worker-7) START, GetGlusterHostUUIDVDSCommand(HostName = Host3, HostId = 1784c92a-1dfa-4db1-aa00-4e7c3b3c0ee1), log id: 6a65761b 2013-10-17 16:56:07,726 INFO [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterHostUUIDVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, GetGlusterHostUUIDVDSCommand, return: ee62b9db-3aa2-4121-9bc6-b1dbcd7581ee, log id: 6a65761b 2013-10-17 16:56:07,751 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-7) START, GlusterServersListVDSCommand(HostName = Host2, HostId = 783e37c6-fba2-4115-8a97-875772426117), log id: 647dc785 2013-10-17 16:56:08,000 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, GlusterServersListVDSCommand, return: [10.70.36.85:CONNECTED, 10.70.36.84:CONNECTED], log id: 647dc785 2013-10-17 16:56:08,002 INFO [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-7) START, AddGlusterServerVDSCommand(HostName = Host2, HostId = 783e37c6-fba2-4115-8a97-875772426117), log id: 45b3613d 2013-10-17 16:56:11,002 INFO [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, AddGlusterServerVDSCommand, log id: 45b3613d 2013-10-17 16:56:11,119 INFO [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (DefaultQuartzScheduler_Worker-7) Running command: HandleVdsVersionCommand internal: true. Entities affected : ID: 1784c92a-1dfa-4db1-aa00-4e7c3b3c0ee1 Type: VDS 2013-10-17 16:56:11,631 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-2) START, GlusterServersListVDSCommand(HostName = Host1, HostId = a325459f-24f2-4eb3-abf8-de36a460a8ab), log id: 4d713d87 2013-10-17 16:56:11,889 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler_Worker-2) FINISH, GlusterServersListVDSCommand, return: [10.70.36.84:CONNECTED, vm13.lab.eng.blr.redhat.com:CONNECTED], log id: 4d713d87 2013-10-17 16:56:11,889 INFO [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-2) Server Host3 has been removed directly using the gluster CLI. Removing it from engine as well. 2013-10-17 16:56:11,967 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-2) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Detected server Host3 removed from Cluster Cluster1, and removed it from engine DB. 2013-10-17 16:56:12,032 INFO [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (DefaultQuartzScheduler_Worker-2) START, RemoveVdsVDSCommand( HostId = 1784c92a-1dfa-4db1-aa00-4e7c3b3c0ee1), log id: 6def688c 2013-10-17 16:56:12,032 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-2) vdsManager::disposing 2013-10-17 16:56:12,033 INFO [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (DefaultQuartzScheduler_Worker-2) FINISH, RemoveVdsVDSCommand, log id: 6def688c 2013-10-17 16:56:12,033 INFO [org.ovirt.engine.core.bll.gluster.GlusterSyncJob] (DefaultQuartzScheduler_Worker-2) Servers detached using gluster CLI is removed from engine after inspecting the Gluster servers list [10.70.36.84:CONNECTED, vm13.lab.eng.blr.redhat.com:CONNECTED] ---------------------------- Looks like this is a serious race condition. Steps to reproduce: * Added 2 servers to a cluster from UI using the hostname * Added 1 more server to the same cluster from the UI using the hostname later * The bootstrap was successful but the server was immediately removed from the engine db. * However, if you look at the "gluster peer status", you can see that the server got peer probed successfully and got part of the cluster. ----------- [root@vm13 ]# gluster peer status Number of Peers: 2 Hostname: 10.70.36.84 Uuid: 08e029e4-4c29-452d-8faa-2bb66fc63b19 State: Peer in Cluster (Connected) Hostname: vm11.lab.eng.blr.redhat.com Uuid: ee62b9db-3aa2-4121-9bc6-b1dbcd7581ee State: Peer in Cluster (Connected) [root@vm13 /]# gluster pool list UUID Hostname State 08e029e4-4c29-452d-8faa-2bb66fc63b19 10.70.36.84 Connected ee62b9db-3aa2-4121-9bc6-b1dbcd7581ee vm11.lab.eng.blr.redhat.com Connected 3156792a-be8a-40ae-83e5-3d12a2d7eb01 localhost Connected ----------- Host details for your reference: Host1 vm12.lab.eng.blr.redhat.com 10.70.36.84 Host2 vm13.lab.eng.blr.redhat.com 10.70.36.85 Host3 vm11.lab.eng.blr.redhat.com 10.70.36.83 sosreports of engine and 3 nodes can be downloaded from: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/972619/ Let me know if you need any additional details to resolve this issue. Issue observed in cb5 as well, while editing a server and moving it to a different cluster. Did the following - 1. Had two clusters, one with 3 servers and another one with 1 server. 2. Moved the server belonging to the single-server cluster to Maintenance and edited the name of the server, and the cluster field ( to move it to the other cluster ) After updating the server details, immediately the message "Detected server 10.70.37.44 removed from Cluster test1, and removed it from engine DB." was seen in the events log. Find sosreports and engine logs at - http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/972619/ There are 2 issues here. 1) Issue in Comment 10 seems strange. From the logs, it looks like the request to gluster peer probe from engine returned successfully without actually executing it. The gluster peer probe happened 2 minutes later. So though the engine removed the server from DB, it should have been available for import later, as peer list returns 3. engine.log -> 2013-10-17 16:56:08,002 INFO [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-7) START, AddGlusterServerVDSCommand(HostName = Host2, HostId = 783e37c6-fba2-4115-8a97-875772426117), log id: 45b3613d 2013-10-17 16:56:11,002 INFO [org.ovirt.engine.core.vdsbroker.gluster.AddGlusterServerVDSCommand] (DefaultQuartzScheduler_Worker-7) FINISH, AddGlusterServerVDSCommand, log id: 45b3613d host2 vdsm.log -> Thread-9323::DEBUG::2013-10-17 16:59:29,053::BindingXMLRPC::974::vds::(wrapper) client [10.70.36.79]::call hostAdd with ('vm11.lab.eng.blr.redhat.com',) {} Thread-9323::DEBUG::2013-10-17 16:59:32,049::BindingXMLRPC::981::vds::(wrapper) return hostAdd with {'status': {'message': 'Done', 'code': 0}} 2) Issue with moving host to different cluster and host getting removed. This is due to the 2 flows running in parallel. And will be fixed by http://gerrit.ovirt.org/#/c/20647/ For issue 1, according to Bug 1020421#Comment 4, when "gluster peer probe" is done, the CLI returns success - though the peer probe has not yet actually happened. This would cause the behaviour we see in Comment 10 of this bug. Added a lock at cluster level in engine to make sure that a gluster peer status is not called at same time when gluster peer probe is done. http://gerrit.ovirt.org/#/c/21015/ Please review the edited DocText and signoff. Looks good. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html |
Created attachment 759110 [details] screenshot of error Description of problem: After adding a host from the UI, it came UP and suddenly gets removed from the engine DB with the following event message: --- 2013-Jun-10, 01:51 Detected server host3 removed from Cluster <UNKNOWN>, and removed it from engine DB. 2013-Jun-10, 01:51 Detected new Host host3. Host state was set to Up. --- However, the host is still part of the cluster Version-Release number of selected component (if applicable): Red Hat Storage Console Version: 2.1.0-0.bb2.el6rhs How reproducible: Intermittent issue Steps to Reproduce: 1. Create a Cluster and add a server from the UI 2. Add a 2nd server from the CLI and import it from the UI 3. Add a 3rd server from the UI and watch the event messages. ---- 2013-Jun-10, 01:51 Detected server host3 removed from Cluster <UNKNOWN>, and removed it from engine DB. 2013-Jun-10, 01:51 Detected new Host host3. Host state was set to Up. 2013-Jun-10, 01:51 Host host3 installed 2013-Jun-10, 01:51 Installing Host host3. Stage: Termination. 2013-Jun-10, 01:51 Installing Host host3. Retrieving installation logs to: '/var/log/ovirt-engine/host-deploy/ovirt-20130610015132-qa-vm07.lab.eng.blr.redhat.com-f6e09c7.log'. 2013-Jun-10, 01:51 Installing Host host3. Stage: Pre-termination. 2013-Jun-10, 01:51 Installing Host host3. Starting vdsm. 2013-Jun-10, 01:51 Installing Host host3. Starting gluster. 2013-Jun-10, 01:51 Installing Host host3. Stage: Closing up. 2013-Jun-10, 01:51 Installing Host host3. Stage: Transaction commit. 2013-Jun-10, 01:51 Installing Host host3. Enrolling certificate. 2013-Jun-10, 01:51 Installing Host host3. Setting time. 2013-Jun-10, 01:51 Installing Host host3. Stage: Misc configuration. 2013-Jun-10, 01:51 Installing Host host3. Stage: Package installation. 2013-Jun-10, 01:51 Installing Host host3. Stage: Transaction setup. 2013-Jun-10, 01:51 Installing Host host3. Stage: Setup validation. 2013-Jun-10, 01:51 Installing Host host3. Logs at host located at: '/tmp/ovirt-host-deploy-20130609201900.log'. 2013-Jun-10, 01:51 Installing Host host3. Stage: Environment customization. 2013-Jun-10, 01:51 Installing Host host3. Stage: Programs detection. 2013-Jun-10, 01:50 Installing Host host3. Yum Downloading: rhel-x86_64-server-6.4.z/primary 9.9 M(73%). 2013-Jun-10, 01:50 Installing Host host3. Yum Downloading: rhel-x86_64-server-6.4.z/primary 6.2 M(45%). 2013-Jun-10, 01:49 Installing Host host3. Yum Downloading: rhel-x86_64-server-6.4.z/primary 2.8 M(20%). 2013-Jun-10, 01:49 Installing Host host3. Stage: Installation packages setup. 2013-Jun-10, 01:49 Installing Host host3. Stage: Environment setup. 2013-Jun-10, 01:49 Installing Host host3. Stage: Initializing. 2013-Jun-10, 01:49 Installing Host host3. Connected to host qa-vm07.lab.eng.blr.redhat.com with SSH key fingerprint: 87:9a:7f:07:3e:ae:c8:40:ac:e1:03:bc:73:8c:1d:3c. 2013-Jun-10, 01:49 Host host3 was added by admin@internal. ----- Actual results: Host is getting removed as soon as the server is UP Expected results: Host should not get removed from the engine db in a valid case Additional info: Screenshot attached. engine and vdsm logs will be attached soon