Created attachment 1151740 [details] sosreport of hosted engine Description of problem: ---------------------- I tried to add a third machine to the hosted engine cluster and it failed several times due to network issues. I replaced the host with a new one. Though the CPU type mismatched, it was activated. Web gui shows rhsqa1, rhsqa4 and rhsqa13 machines while the cli (hosted-engine --vm-status) shows rhsqa1, rhsqa13 and rhsqa5 (this is the host which failed earlier). [root@rhsqa5 ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : rhsqa1.lab.eng.blr.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 4d6ae8f7 Host timestamp : 256335 --== Host 3 status ==-- Status up-to-date : True Hostname : rhsqa13.lab.eng.blr.redhat.com Host ID : 3 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 0 stopped : True Local maintenance : False crc32 : 792ac5bf Host timestamp : 66755 --== Host 4 status ==-- Status up-to-date : False Hostname : rhsqa5.lab.eng.blr.redhat.com Host ID : 4 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 9bdef367 Host timestamp : 184434 [root@rhsqa5 ~]# Will attach the GUI screenshot. Version-Release number of selected component (if applicable): ------------------------------------------------------------- 3.6.5.3-0.1.el6 How reproducible: ----------------- Tried once Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: sosreports of hosted engine are attached.
Created attachment 1151741 [details] screenshot of GUI
Is there a way to clean metadata if the node being removed is no longer available?
Moving to first RC, since things should not be targeted to second one at this point.
(In reply to Sahina Bose from comment #2) > Is there a way to clean metadata if the node being removed is no longer > available? 4.0 feature for deploying/undeploying an HE host using the engine should fix that by calling the clean-metadata after cleaning the configuration file. Sandro this is I guess adding a call to clean-metadata in hosted-engine/configureha.py ?
Nikolai can you try to reproduce this but only use the engine to deploy and undeploy using the engine on 4.0?
(In reply to Roy Golan from comment #5) > Nikolai can you try to reproduce this but only use the engine to deploy and > undeploy using the engine on 4.0? Failed to reproduce. I see 2 hosts in 4.0 viva WEBUI OK and the same via CLI. [root@alma04 ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : False Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : 83ff7751 Host timestamp : 19177 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=19177 (Tue Jun 7 20:03:51 2016) host-id=1 score=0 maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : 63e035c9 Host timestamp : 150391 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=150391 (Thu Jun 9 10:45:08 2016) host-id=2 score=3400 maintenance=False state=EngineUp stopped=False Hosts: mom-0.5.4-1.el7ev.noarch ovirt-vmconsole-1.0.3-1.el7ev.noarch sanlock-3.2.4-2.el7_2.x86_64 libvirt-client-1.2.17-13.el7_2.5.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.15.x86_64 vdsm-4.18.1-11.gita92976e.el7ev.x86_64 ovirt-hosted-engine-setup-2.0.0-1.el7ev.noarch ovirt-host-deploy-1.5.0-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-vmconsole-host-1.0.3-1.el7ev.noarch ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch Red Hat Enterprise Linux Server release 7.2 (Maipo) Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016 Engine: ovirt-engine-setup-plugin-ovirt-engine-4.0.0.2-0.1.el7ev.noarch ovirt-vmconsole-1.0.3-1.el7ev.noarch ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch rhevm-4.0.0.2-0.1.el7ev.noarch ovirt-engine-setup-base-4.0.0.2-0.1.el7ev.noarch ovirt-engine-websocket-proxy-4.0.0.2-0.1.el7ev.noarch ovirt-image-uploader-4.0.0-1.el7ev.noarch ovirt-engine-backend-4.0.0.2-0.1.el7ev.noarch ovirt-engine-tools-4.0.0.2-0.1.el7ev.noarch rhevm-guest-agent-common-1.0.12-1.el7ev.noarch ovirt-engine-lib-4.0.0.2-0.1.el7ev.noarch ovirt-engine-dwh-setup-4.0.0-2.el7ev.noarch ovirt-log-collector-4.0.0-1.el7ev.noarch rhevm-branding-rhev-4.0.0-0.0.master.20160531161414.el7ev.noarch ovirt-engine-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch ovirt-host-deploy-java-1.5.0-1.el7ev.noarch ovirt-engine-dbscripts-4.0.0.2-0.1.el7ev.noarch ovirt-engine-4.0.0.2-0.1.el7ev.noarch rhev-guest-tools-iso-4.0-2.el7ev.noarch ovirt-engine-setup-plugin-websocket-proxy-4.0.0.2-0.1.el7ev.noarch ovirt-engine-tools-backup-4.0.0.2-0.1.el7ev.noarch ovirt-engine-userportal-4.0.0.2-0.1.el7ev.noarch rhev-release-4.0.0-12-001.noarch ovirt-engine-setup-4.0.0.2-0.1.el7ev.noarch ovirt-vmconsole-proxy-1.0.3-1.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch ovirt-engine-restapi-4.0.0.2-0.1.el7ev.noarch rhevm-setup-plugins-4.0.0-1.el7ev.noarch ovirt-engine-cli-3.6.2.0-1.el7ev.noarch rhevm-doc-4.0.0-2.el7ev.noarch ovirt-engine-setup-plugin-ovirt-engine-common-4.0.0.2-0.1.el7ev.noarch ovirt-engine-extensions-api-impl-4.0.0.2-0.1.el7ev.noarch ovirt-iso-uploader-4.0.0-1.el7ev.noarch ovirt-engine-webadmin-portal-4.0.0.2-0.1.el7ev.noarch ovirt-engine-dwh-4.0.0-2.el7ev.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.0.2-0.1.el7ev.noarch ovirt-host-deploy-1.5.0-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-engine-sdk-python-3.6.5.0-1.el7ev.noarch Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016 Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo)
Created attachment 1166192 [details] Screenshot from 2016-06-09 10:47:40.png
Roy, I don't see metadata being cleaned after deploying/undeploying of HE host using the engine. I still see both hosts via CLI. Also undeployed host kills only agent, while should kill also broker, but this is not what I see: [root@alma03 ~]# systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; disabled; vendor preset: disabled) Active: failed (Result: signal) since Thu 2016-06-09 10:55:11 IDT; 7min ago Main PID: 48819 (code=killed, signal=KILL) Jun 09 10:45:59 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[48819]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found certificate common name: alma03.qa.lab.tlv.redhat.com Jun 09 10:45:59 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[48819]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM Jun 09 10:46:07 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[48819]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Connecting the storage Jun 09 10:46:07 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[48819]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server Jun 09 10:53:41 alma03.qa.lab.tlv.redhat.com systemd[1]: Stopping oVirt Hosted Engine High Availability Monitoring Agent... Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service stop-sigterm timed out. Killing. Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service: main process exited, code=killed, status=9/KILL Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: Stopped oVirt Hosted Engine High Availability Monitoring Agent. Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state. Jun 09 10:55:11 alma03.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service failed. [root@alma03 ~]# systemctl status ovirt-ha-broker -l ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2016-06-09 10:45:55 IDT; 17min ago Main PID: 48585 (ovirt-ha-broker) CGroup: /system.slice/ovirt-ha-broker.service └─48585 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon Jun 09 10:51:00 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata' Jun 09 10:51:00 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to read metadata from /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 129, in get_raw_stats_for_service_type f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC) OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata' Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: ERROR:ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker:Failed to read metadata from /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: Traceback (most recent call last): Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 129, in get_raw_stats_for_service_type Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC) Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/b3051ff3-9728-4ac8-a36d-4fd4c5d12869/ha_agent/hosted-engine.metadata' Jun 09 10:51:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-broker[48585]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
You can clean metadata using the hosted-engine --clean-metadata command. The undeploy on remove from webadmin feature is currently in the planning stage and we have #1369827 and #1349460 that track the UI aspect of this. I am closing this since I do not see anything wrong, we do not have reproducer steps and we weren't able to reproduce this situation. You can reopen if you have some more information about how to reproduce this or if the newly added host actually appeared in the --vm-status after a while (it usually takes a minute or so after the host is fully initialized).