Description of problem: Failed to add NGN 4.0 as additional hosted-engine-host via WEBUI. Host was added, but not as HE host, although I've chosen "Depoly" option via WEBUI. Version-Release number of selected component (if applicable): Engine: rhevm-doc-4.0.0-2.el7ev.noarch rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch rhevm-4.0.0.6-0.1.el7ev.noarch rhev-release-4.0.0-19-001.noarch rhevm-guest-agent-common-1.0.12-2.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-branding-rhev-4.0.0-2.el7ev.noarch rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch rhev-guest-tools-iso-4.0-2.el7ev.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) Host: ovirt-setup-lib-1.0.2-1.el7ev.noarch mom-0.5.4-1.el7ev.noarch ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch ovirt-vmconsole-1.0.3-1.el7ev.noarch ovirt-host-deploy-1.5.0-1.el7ev.noarch ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch sanlock-3.2.4-1.el7.x86_64 qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64 ovirt-vmconsole-host-1.0.3-1.el7ev.noarch libvirt-client-1.2.17-13.el7_2.4.x86_64 vdsm-4.18.4-2.el7ev.x86_64 Linux version 3.10.0-327.18.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Fri Apr 8 05:09:53 EDT 2016 Linux 3.10.0-327.18.2.el7.x86_64 #1 SMP Fri Apr 8 05:09:53 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 7.2 Beta How reproducible: 100% Steps to Reproduce: 1.Deploy HE on el7.2 host over NFS. 2.Try adding additional NGN host via WEBUI as additional HE host (choose "Deploy" at hosted-engine section. Actual results: Addition of NGN as hosted-engine host have failed. Expected results: NGN host should be added as HE-host successfully. Additional info: Logs from NGN host and engine attached.
Created attachment 1171575 [details] sosreport from host alma03
Created attachment 1171576 [details] sosreport from engine
Martin, can you check if it's an infra bug?
Hmm, AFAIK new hosted-engine host can be added only using hosted-engine command line tool, right Simone?
(In reply to Martin Perina from comment #4) > Hmm, AFAIK new hosted-engine host can be added only using hosted-engine > command line tool, right Simone? No, see bug 1167262.
(In reply to Martin Perina from comment #4) > Hmm, AFAIK new hosted-engine host can be added only using hosted-engine > command line tool, right Simone? More then that, it's the only method to deploy in 4.0. Please address this urgently.
Is this a host deploy bug?
Yaniv- This bug is about deploying an additional host from engine, not node. Adding a host through the command line tool (and through cockpit) works. This appears to be a failure somewhere in ovirt-hosted-engine-setup after being invoked from engine. It may also be an environment problem: From the host: jsonrpc.Executor/6::ERROR::2016-06-23 18:40:39,583::api::195::root::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 174, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats stats = broker.get_stats_from_storage(service) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 232, in get_stats_from_storage result = self._checked_communicate(request) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 260, in _checked_communicate .format(message or response)) RequestError: Request failed: failed to read metadata: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/29d459ea-989d-4127-b996-248928adf543/ha_agent/hosted-engine.metadata'
Didi, please look into this.
The engine doesn't call ovirt-hosted-engine-setup at all. If enabled to use hosted-engine, it will call a specific host-deploy plugin.
So this is a SLA engine issue?
(In reply to Ryan Barry from comment #8) > RequestError: Request failed: failed to read metadata: [Errno 2] No such > file or directory: > '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/ > 29d459ea-989d-4127-b996-248928adf543/ha_agent/hosted-engine.metadata' This is the fallback code that looks for file based hosted-engine metadata as in 3.4 but this shouldn't trigger at all here since we have a specific metadata volume and it seams correctly configured as in hosted-engine.conf from Nikolai host I can see: metadata_image_UUID=75887b27-408e-4a60-aeba-f5168d10dd01 metadata_volume_UUID=311123d6-1cb6-4e56-9b3c-354456caa1eb The file based fallback code should trigger only if in hosted-engine.conf we miss the metadata uuids. Maybe it's a race conditions with the host-deploy plugins starting ovirt-ha-broker before writing hosted-engine.conf
It fails every time according to Nikolai, so probably a really bad race. Why didn't we see it on RHEL-H?
(In reply to Yaniv Dary from comment #13) > It fails every time according to Nikolai, so probably a really bad race. > Why didn't we see it on RHEL-H? I think it's also there. If I correctly understood it the issue is here: ovirt-ha-broker always tries to consume ha_agent/hosted-engine.metadata; if we have the metadata volume in the configuration file it creates a link to the volume under /var/run/vdsm/storage and it tries that also here. /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/29d459ea-989d-4127-b996-248928adf543/ha_agent/hosted-engine.metadata -> /var/run/vdsm/storage/29d459ea-989d-4127-b996-248928adf543/75887b27-408e-4a60-aeba-f5168d10dd01/311123d6-1cb6-4e56-9b3c-354456caa1eb The issue now is that when we run hosted-engine-setup on additional hosts, hosted-engine-setup prepares all the hosted-engine volume before exiting so everything is ready by construction. When we deploy from the engine instead, we just let host-deploy write the configuration files and start the services (ovirt-ha-agent and ovirt-ha-broker). And the broker tries to create the symlinks: Thread-4::ERROR::2016-06-28 17:27:50,750::listener::192::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=nfs3 sd_uuid=29d459ea-989d-4127-b996-248928adf543' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch .set_storage_domain(client, sd_type, **options) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain self._backends[client].connect() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 456, in connect self._dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path " in {1}".format(sd_uuid, parent)) BackendFailureException: path to storage domain 29d459ea-989d-4127-b996-248928adf543 not found in /rhev/data-center/mnt But if fails to create them since the storage is still not connected. The broker was trying to acess the metadata since VDSM was polling it: jsonrpc.Executor/2::DEBUG::2016-06-28 17:27:50,740::task::995::Storage.TaskManager.Task::(_decref) Task=`d23f716d-e6cd-43e7-8c4c-f5652ce72536`::ref 0 aborting False jsonrpc.Executor/2::ERROR::2016-06-28 17:27:50,751::api::195::root::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 174, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '29d459ea-989d-4127-b996-248928adf543'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'> jsonrpc.Executor/3::DEBUG::2016-06-28 17:27:51,242::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.ping' in bridge with {} And VDSM was already polling it since the engine was asking the host status. So, if the agent is not fast enough to connect the storage we get a few failures in a row and... Martin, what do you think?
Manually restarting ovirt-ha-agent and ovirt-ha-broker is enough to get the host up.
Any updates here?
ovirt-ha-agent logs ends here and the storage is not connected. MainThread::INFO::2016-06-23 18:33:16,665::agent::78::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.0.0 started MainThread::INFO::2016-06-23 18:33:16,774::hosted_engine::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: alma03.qa.lab.tlv.redhat.com MainThread::INFO::2016-06-23 18:33:16,783::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2016-06-23 18:33:23,991::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::INFO::2016-06-23 18:33:23,992::storage_server::218::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server On VDSM side the connection seams to be died. JsonRpc (StompReactor)::ERROR::2016-06-23 18:33:23,990::betterAsyncore::113::vds.dispatcher::(recv) SSL error during reading data: unexpected eof
Cross checking with engine logs: host-deploy starts VDSM at XX:33:27 (logs are not on the same timezone...) 2016-06-23 11:33:27,018 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [7cfc9910] Correlation ID: 7cfc9910, Call Stack: null, Custom Event ID: -1, Message: Installing Host alma03.qa.lab.tlv.redhat.com. Starting vdsm. And indeed VDSM stops here: MainThread::DEBUG::2016-06-23 18:33:26,746::vdsm::72::vds::(sigtermHandler) Received signal 15 and restart only at: MainThread::INFO::2016-06-23 18:33:31,716::vdsm::139::vds::(run) (PID: 19269) I am the actual vdsm 4.18.4-2.el7ev alma03.qa.lab.tlv.redhat.com (3.10.0-327.18.2.el7.x86_64) but at that point the agent is not trying anymore to connect the storageServer and the broker is already polling and so the issue.
*** Bug 1352286 has been marked as a duplicate of this bug. ***
Not yet ON_QA, I still have ovirt-host-deploy-1.5.0-1.el7ev.noarch.
Tried to verify on regular el7.2 over iSCSI first and failed, opened a new bug https://bugzilla.redhat.com/show_bug.cgi?id=1357615, it might be related to this bug too.
Works for me on NFS deployed HE from rhevm-appliance-20160714.0-1.el7ev.noarch.rpm and pair of hosts, which where provisioned from ISO RHVH-7.2-20160718.1-RHVH-x86_64-dvd1.iso image over virtual media. Here goes components on hosts: sanlock-3.2.4-2.el7_2.x86_64 ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch ovirt-imageio-daemon-0.3.0-0.el7ev.noarch ovirt-host-deploy-1.5.1-1.el7ev.noarch ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64 mom-0.5.5-1.el7ev.noarch ovirt-setup-lib-1.0.2-1.el7ev.noarch ovirt-vmconsole-host-1.0.4-1.el7ev.noarch libvirt-client-1.2.17-13.el7_2.5.x86_64 vdsm-4.18.6-1.el7ev.x86_64 ovirt-hosted-engine-setup-2.0.1-1.el7ev.noarch ovirt-imageio-common-0.3.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 7.2 Engine: rhevm-doc-4.0.0-2.el7ev.noarch rhev-guest-tools-iso-4.0-4.el7ev.noarch rhevm-4.0.1.1-0.1.el7ev.noarch rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch rhevm-branding-rhev-4.0.0-3.el7ev.noarch rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch rhevm-guest-agent-common-1.0.12-2.el7ev.noarch rhevm-dependencies-4.0.0-1.el7ev.noarch rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch rhev-release-4.0.1-2-001.noarch Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016 Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.2 (Maipo) Steps of reproduction: 1)Provisioned from Foreman both hosts to RHEL7.2. 2)Provisioned from virtual media (aka CD-ROM) both hosts to RHVH-7.2-20160718.1-RHVH-x86_64-dvd1.iso. 3)Deployed via WEBUI of the Cockpit the HE from first host on NFS, while using rhevm-appliance-20160714.0-1.el7ev.noarch.rpm. I had an issue with insufficient space for extracting the OVA, so I've mounted external NFS share for temporary place of appliance to be decompressed to. 4)Once HE deployment succeeded on first host, I've added the second host via WEBUI as HE-host. 5)After a few minutes from host became active in engine's WEBUI, it also received it's HA=3400 score and became fully active HE-host.