Bug 1331257

Summary: Mismatching host entries in hosted engine GUI and cli
Product: [oVirt] ovirt-engine Reporter: Bhaskarakiran <byarlaga>
Component: BLL.HostedEngineAssignee: bugs <bugs>
Status: CLOSED NOTABUG QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.6.5CC: bugs, dfediuck, mzywusko, nsednev, sabose, sbonazzo, ylavi
Target Milestone: ovirt-4.1.0-betaFlags: dfediuck: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: PM-16
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-16 11:36:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1277939    
Attachments:
Description Flags
sosreport of hosted engine
none
screenshot of GUI
none
Screenshot from 2016-06-09 10:47:40.png none

Description Bhaskarakiran 2016-04-28 06:52:23 UTC
Created attachment 1151740 [details]
sosreport of hosted engine

Description of problem:
----------------------

I tried to add a third machine to the hosted engine cluster and it failed several times due to network issues. I replaced the host with a new one. Though the CPU type mismatched, it was activated. Web gui shows rhsqa1, rhsqa4 and rhsqa13 machines while the cli (hosted-engine --vm-status) shows rhsqa1, rhsqa13 and rhsqa5 (this is the host which failed earlier).

[root@rhsqa5 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : rhsqa1.lab.eng.blr.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 4d6ae8f7
Host timestamp                     : 256335


--== Host 3 status ==--

Status up-to-date                  : True
Hostname                           : rhsqa13.lab.eng.blr.redhat.com
Host ID                            : 3
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : 792ac5bf
Host timestamp                     : 66755


--== Host 4 status ==--

Status up-to-date                  : False
Hostname                           : rhsqa5.lab.eng.blr.redhat.com
Host ID                            : 4
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : 9bdef367
Host timestamp                     : 184434
[root@rhsqa5 ~]#

Will attach the GUI screenshot.


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
3.6.5.3-0.1.el6

How reproducible:
-----------------
Tried once 

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
sosreports of hosted engine are attached.

Comment 1 Bhaskarakiran 2016-04-28 06:52:54 UTC
Created attachment 1151741 [details]
screenshot of GUI

Comment 2 Sahina Bose 2016-05-03 07:08:41 UTC
Is there a way to clean metadata if the node being removed is no longer available?

Comment 3 Yaniv Lavi 2016-05-09 10:53:32 UTC
Moving to first RC, since things should not be targeted to second one at this point.

Comment 4 Roy Golan 2016-06-08 19:03:46 UTC
(In reply to Sahina Bose from comment #2)
> Is there a way to clean metadata if the node being removed is no longer
> available?

4.0 feature for deploying/undeploying an HE host using the engine should fix that by calling the clean-metadata after cleaning the configuration file.

Sandro this is I guess adding a call to clean-metadata in hosted-engine/configureha.py ?

Comment 5 Roy Golan 2016-06-08 19:10:10 UTC
Nikolai can you try to reproduce this but only use the engine to deploy and undeploy using the engine on 4.0?

Comment 6 Nikolai Sednev 2016-06-09 07:47:56 UTC
(In reply to Roy Golan from comment #5)
> Nikolai can you try to reproduce this but only use the engine to deploy and
> undeploy using the engine on 4.0?

Failed to reproduce.
I see 2 hosts in 4.0 viva WEBUI OK and the same via CLI.


[root@alma04 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : False
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : 83ff7751
Host timestamp                     : 19177
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=19177 (Tue Jun  7 20:03:51 2016)
        host-id=1
        score=0
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 63e035c9
Host timestamp                     : 150391
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=150391 (Thu Jun  9 10:45:08 2016)
        host-id=2
        score=3400
        maintenance=False
        state=EngineUp
        stopped=False


Hosts:
mom-0.5.4-1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
libvirt-client-1.2.17-13.el7_2.5.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.15.x86_64
vdsm-4.18.1-11.gita92976e.el7ev.x86_64
ovirt-hosted-engine-setup-2.0.0-1.el7ev.noarch
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch
Red Hat Enterprise Linux Server release 7.2 (Maipo)
Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016

Engine:
ovirt-engine-setup-plugin-ovirt-engine-4.0.0.2-0.1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch
rhevm-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-setup-base-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-websocket-proxy-4.0.0.2-0.1.el7ev.noarch
ovirt-image-uploader-4.0.0-1.el7ev.noarch
ovirt-engine-backend-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-tools-4.0.0.2-0.1.el7ev.noarch
rhevm-guest-agent-common-1.0.12-1.el7ev.noarch
ovirt-engine-lib-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-dwh-setup-4.0.0-2.el7ev.noarch
ovirt-log-collector-4.0.0-1.el7ev.noarch
rhevm-branding-rhev-4.0.0-0.0.master.20160531161414.el7ev.noarch
ovirt-engine-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch
ovirt-host-deploy-java-1.5.0-1.el7ev.noarch
ovirt-engine-dbscripts-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-4.0.0.2-0.1.el7ev.noarch
rhev-guest-tools-iso-4.0-2.el7ev.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-tools-backup-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-userportal-4.0.0.2-0.1.el7ev.noarch
rhev-release-4.0.0-12-001.noarch
ovirt-engine-setup-4.0.0.2-0.1.el7ev.noarch
ovirt-vmconsole-proxy-1.0.3-1.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
ovirt-engine-restapi-4.0.0.2-0.1.el7ev.noarch
rhevm-setup-plugins-4.0.0-1.el7ev.noarch
ovirt-engine-cli-3.6.2.0-1.el7ev.noarch                                                   
rhevm-doc-4.0.0-2.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-extensions-api-impl-4.0.0.2-0.1.el7ev.noarch
ovirt-iso-uploader-4.0.0-1.el7ev.noarch
ovirt-engine-webadmin-portal-4.0.0.2-0.1.el7ev.noarch
ovirt-engine-dwh-4.0.0-2.el7ev.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch
Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016
Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Comment 7 Nikolai Sednev 2016-06-09 07:48:53 UTC
Created attachment 1166192 [details]
Screenshot from 2016-06-09 10:47:40.png

Comment 8 Nikolai Sednev 2016-06-09 08:03:33 UTC
Roy, I don't see metadata being cleaned after deploying/undeploying of HE host using the engine. I still see both hosts via CLI. Also undeployed host kills only agent, while should kill also broker, but this is not what I see:
[root@alma03 ~]# systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; disabled; vendor preset: disabled)
   Active: failed (Result: signal) since Thu 2016-06-09 10:55:11 IDT; 7min ago
 Main PID: 48819 (code=killed, signal=KILL)

Jun 09 10:45:59 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[48819]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found certificate common name: alma03.qa.lab.tlv.redhat.com
Jun 09 10:45:59 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[48819]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM
Jun 09 10:46:07 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[48819]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Connecting the storage
Jun 09 10:46:07 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[48819]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 09 10:53:41 alma03.qa.lab.tlv.redhat.com systemd[1]: Stopping oVirt Hosted Engine High Availability Monitoring Agent...
Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service stop-sigterm timed out. Killing.
Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service: main process exited, code=killed, status=9/KILL
Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: Stopped oVirt Hosted Engine High Availability Monitoring Agent.
Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state.
Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service failed.
[root@alma03 ~]# systemctl status ovirt-ha-broker -l
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2016-06-09 10:45:55 IDT; 17min ago
 Main PID: 48585 (ovirt-ha-broker)
   CGroup: /system.slice/ovirt-ha-broker.service
           └─48585 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon

Jun 09 10:51:00 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata'
Jun 09 10:51:00 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to read metadata from /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata
                                                                     Traceback (most recent call last):
                                                                       File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 129, in get_raw_stats_for_service_type
                                                                         f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
                                                                     OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata'
Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: ERROR:ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker:Failed to read metadata from /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata
Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: Traceback (most recent call last):
Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 129, in get_raw_stats_for_service_type
Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata'
Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed

Comment 10 Martin Sivák 2016-12-16 11:36:46 UTC
You can clean metadata using the hosted-engine --clean-metadata command.

The undeploy on remove from webadmin feature is currently in the planning stage and we have #1369827 and #1349460 that track the UI aspect of this.

I am closing this since I do not see anything wrong, we do not have reproducer steps and we weren't able to reproduce this situation. You can reopen if you have some more information about how to reproduce this or if the newly added host actually appeared in the --vm-status after a while (it usually takes a minute or so after the host is fully initialized).