1349532 – Failed to add NGN 4.0 as additional hosted-engine-host via WEBUI

Bug 1349532 - Failed to add NGN 4.0 as additional hosted-engine-host via WEBUI

Summary: Failed to add NGN 4.0 as additional hosted-engine-host via WEBUI

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-host-deploy
Classification:	oVirt
Component:	Plugins.Hosted-Engine
Sub Component:
Version:	1.5.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.0.1
Target Release:	1.5.1
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1352286 (view as bug list)
Depends On:	1354199
Blocks:	1200469 1306711
TreeView+	depends on / blocked

Reported:	2016-06-23 15:51 UTC by Nikolai Sednev
Modified:	2016-08-04 13:31 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-04 13:31:11 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.0.z+ rule-engine: blocker+ ylavi: planning_ack+ sbonazzo: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
sosreport from host alma03 (5.97 MB, application/x-xz) 2016-06-23 15:52 UTC, Nikolai Sednev	no flags	Details
sosreport from engine (7.02 MB, application/x-xz) 2016-06-23 15:54 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1263602	high	CLOSED	[RFE] Ability to set different mount options for hosted_engine nfs storage than the default	2021-06-10 11:06:51 UTC
Red Hat Bugzilla	1352286	unspecified	CLOSED	Addition of additional hosted-engine-host failed via REST.	2021-02-22 00:41:40 UTC
oVirt gerrit	60200	master	MERGED	hosted-engine: prevent a race cond. between ha-agent and vdsm	2016-07-06 12:59:19 UTC

Internal Links: 1263602 1352286

Description Nikolai Sednev 2016-06-23 15:51:32 UTC

Description of problem:
Failed to add NGN 4.0 as additional hosted-engine-host via WEBUI.

Host was added, but not as HE host, although I've chosen "Depoly" option via WEBUI.

Version-Release number of selected component (if applicable):
Engine:
rhevm-doc-4.0.0-2.el7ev.noarch
rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
rhevm-4.0.0.6-0.1.el7ev.noarch
rhev-release-4.0.0-19-001.noarch
rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-branding-rhev-4.0.0-2.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
rhev-guest-tools-iso-4.0-2.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Host:
ovirt-setup-lib-1.0.2-1.el7ev.noarch
mom-0.5.4-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.4.x86_64
vdsm-4.18.4-2.el7ev.x86_64
Linux version 3.10.0-327.18.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Fri Apr 8 05:09:53 EDT 2016
Linux 3.10.0-327.18.2.el7.x86_64 #1 SMP Fri Apr 8 05:09:53 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 7.2 Beta


How reproducible:
100%

Steps to Reproduce:
1.Deploy HE on el7.2 host over NFS.
2.Try adding additional NGN host via WEBUI as additional HE host (choose "Deploy" at hosted-engine section.


Actual results:
Addition of NGN as hosted-engine host have failed.

Expected results:
NGN host should be added as HE-host successfully.

Additional info:
Logs from NGN host and engine attached.

Comment 1 Nikolai Sednev 2016-06-23 15:52:54 UTC

Created attachment 1171575 [details]
sosreport from host alma03

Comment 2 Nikolai Sednev 2016-06-23 15:54:16 UTC

Created attachment 1171576 [details]
sosreport from engine

Comment 3 Moran Goldboim 2016-06-28 09:35:45 UTC

Martin, can you check if it's an infra bug?

Comment 4 Martin Perina 2016-06-28 13:54:05 UTC

Hmm, AFAIK new hosted-engine host can be added only using hosted-engine command line tool, right Simone?

Comment 5 Yedidyah Bar David 2016-06-28 14:02:01 UTC

(In reply to Martin Perina from comment #4)
> Hmm, AFAIK new hosted-engine host can be added only using hosted-engine
> command line tool, right Simone?

No, see bug 1167262.

Comment 6 Yaniv Lavi 2016-06-28 14:35:36 UTC

(In reply to Martin Perina from comment #4)
> Hmm, AFAIK new hosted-engine host can be added only using hosted-engine
> command line tool, right Simone?

More then that, it's the only method to deploy in 4.0. Please address this urgently.

Comment 7 Yaniv Lavi 2016-06-28 14:47:15 UTC

Is this a host deploy bug?

Comment 8 Ryan Barry 2016-06-28 14:58:30 UTC

Yaniv- 

This bug is about deploying an additional host from engine, not node. Adding a host through the command line tool (and through cockpit) works.

This appears to be a failure somewhere in ovirt-hosted-engine-setup after being invoked from engine. It may also be an environment problem:

From the host:
jsonrpc.Executor/6::ERROR::2016-06-23 18:40:39,583::api::195::root::(_getHaInfo) failed to retrieve Hosted Engine HA info
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 174, in _getHaInfo
    stats = instance.get_all_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats
    stats = broker.get_stats_from_storage(service)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 232, in get_stats_from_storage
    result = self._checked_communicate(request)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 260, in _checked_communicate
    .format(message or response))
RequestError: Request failed: failed to read metadata: [Errno 2] No such file or directory: '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/29d459ea-989d-4127-b996-248928adf543/ha_agent/hosted-engine.metadata'

Comment 9 Yaniv Lavi 2016-06-28 15:04:41 UTC

Didi, please look into this.

Comment 10 Simone Tiraboschi 2016-06-28 15:06:09 UTC

The engine doesn't call ovirt-hosted-engine-setup at all.
If enabled to use hosted-engine, it will call a specific host-deploy plugin.

Comment 11 Yaniv Lavi 2016-06-28 15:10:10 UTC

So this is a SLA engine issue?

Comment 12 Simone Tiraboschi 2016-06-28 15:14:15 UTC

(In reply to Ryan Barry from comment #8)
> RequestError: Request failed: failed to read metadata: [Errno 2] No such
> file or directory:
> '/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/
> 29d459ea-989d-4127-b996-248928adf543/ha_agent/hosted-engine.metadata'

This is the fallback code that looks for file based hosted-engine metadata as in 3.4 but this shouldn't trigger at all here since we have a specific metadata volume and it seams correctly configured as in hosted-engine.conf from Nikolai host I can see:

metadata_image_UUID=75887b27-408e-4a60-aeba-f5168d10dd01
metadata_volume_UUID=311123d6-1cb6-4e56-9b3c-354456caa1eb

The file based fallback code should trigger only if in hosted-engine.conf we miss the metadata uuids.

Maybe it's a race conditions with the host-deploy plugins starting ovirt-ha-broker before writing hosted-engine.conf

Comment 13 Yaniv Lavi 2016-06-28 15:18:13 UTC

It fails every time according to Nikolai, so probably a really bad race.
Why didn't we see it on RHEL-H?

Comment 14 Simone Tiraboschi 2016-06-28 16:00:44 UTC

(In reply to Yaniv Dary from comment #13)
> It fails every time according to Nikolai, so probably a really bad race.
> Why didn't we see it on RHEL-H?

I think it's also there.

If I correctly understood it the issue is here:

ovirt-ha-broker always tries to consume ha_agent/hosted-engine.metadata; if we have the metadata volume in the configuration file it creates a link to the volume under /var/run/vdsm/storage and it tries that also here.

/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__1/29d459ea-989d-4127-b996-248928adf543/ha_agent/hosted-engine.metadata -> /var/run/vdsm/storage/29d459ea-989d-4127-b996-248928adf543/75887b27-408e-4a60-aeba-f5168d10dd01/311123d6-1cb6-4e56-9b3c-354456caa1eb

The issue now is that when we run hosted-engine-setup on additional hosts, hosted-engine-setup prepares all the hosted-engine volume before exiting so everything is ready by construction.

When we deploy from the engine instead, we just let host-deploy write the configuration files and start the services (ovirt-ha-agent and ovirt-ha-broker).

And the broker tries to create the symlinks:

Thread-4::ERROR::2016-06-28 17:27:50,750::listener::192::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'set-storage-domain FilesystemBackend dom_type=nfs3 sd_uuid=29d459ea-989d-4127-b996-248928adf543'
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle
    data)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 299, in _dispatch
    .set_storage_domain(client, sd_type, **options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 66, in set_storage_domain
    self._backends[client].connect()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 456, in connect
    self._dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 108, in get_domain_path
    " in {1}".format(sd_uuid, parent))
BackendFailureException: path to storage domain 29d459ea-989d-4127-b996-248928adf543 not found in /rhev/data-center/mnt

But if fails to create them since the storage is still not connected.

The broker was trying to acess the metadata since VDSM was polling it:
jsonrpc.Executor/2::DEBUG::2016-06-28 17:27:50,740::task::995::Storage.TaskManager.Task::(_decref) Task=`d23f716d-e6cd-43e7-8c4c-f5652ce72536`::ref 0 aborting False
jsonrpc.Executor/2::ERROR::2016-06-28 17:27:50,751::api::195::root::(_getHaInfo) failed to retrieve Hosted Engine HA info
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 174, in _getHaInfo
    stats = instance.get_all_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain
    .format(sd_type, options, e))
RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '29d459ea-989d-4127-b996-248928adf543'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'>
jsonrpc.Executor/3::DEBUG::2016-06-28 17:27:51,242::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.ping' in bridge with {}

And VDSM was already polling it since the engine was asking the host status.

So, if the agent is not fast enough to connect the storage we get a few failures in a row and...

Martin, what do you think?

Comment 15 Simone Tiraboschi 2016-06-28 16:08:23 UTC

Manually restarting ovirt-ha-agent and ovirt-ha-broker is enough to get the host up.

Comment 16 Yaniv Lavi 2016-06-30 12:53:31 UTC

Any updates here?

Comment 17 Simone Tiraboschi 2016-07-05 08:10:54 UTC

ovirt-ha-agent logs ends here and the storage is not connected.

MainThread::INFO::2016-06-23 18:33:16,665::agent::78::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.0.0 started
MainThread::INFO::2016-06-23 18:33:16,774::hosted_engine::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: alma03.qa.lab.tlv.redhat.com
MainThread::INFO::2016-06-23 18:33:16,783::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM
MainThread::INFO::2016-06-23 18:33:23,991::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage
MainThread::INFO::2016-06-23 18:33:23,992::storage_server::218::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server

On VDSM side the connection seams to be died.
JsonRpc (StompReactor)::ERROR::2016-06-23 18:33:23,990::betterAsyncore::113::vds.dispatcher::(recv) SSL error during reading data: unexpected eof

Comment 18 Simone Tiraboschi 2016-07-05 09:13:00 UTC

Cross checking with engine logs:

host-deploy starts VDSM at XX:33:27 (logs are not on the same timezone...)

2016-06-23 11:33:27,018 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [7cfc9910] Correlation ID: 7cfc9910, Call Stack: null, Custom Event ID: -1, Message: Installing Host alma03.qa.lab.tlv.redhat.com. Starting vdsm.

And indeed VDSM stops here:
MainThread::DEBUG::2016-06-23 18:33:26,746::vdsm::72::vds::(sigtermHandler) Received signal 15

and restart only at:
MainThread::INFO::2016-06-23 18:33:31,716::vdsm::139::vds::(run) (PID: 19269) I am the actual vdsm 4.18.4-2.el7ev alma03.qa.lab.tlv.redhat.com (3.10.0-327.18.2.el7.x86_64)

but at that point the agent is not trying anymore to connect the storageServer and the broker is already polling and so the issue.

Comment 19 Roy Golan 2016-07-06 10:11:27 UTC

*** Bug 1352286 has been marked as a duplicate of this bug. ***

Comment 21 Nikolai Sednev 2016-07-13 11:45:38 UTC

Not yet ON_QA, I still have ovirt-host-deploy-1.5.0-1.el7ev.noarch.

Comment 22 Nikolai Sednev 2016-07-18 16:13:46 UTC

Tried to verify on regular el7.2 over iSCSI first and failed, opened a new bug https://bugzilla.redhat.com/show_bug.cgi?id=1357615, it might be related to this bug too.

Comment 23 Nikolai Sednev 2016-07-21 14:31:08 UTC

Works for me on NFS deployed HE from rhevm-appliance-20160714.0-1.el7ev.noarch.rpm and pair of hosts, which where provisioned from ISO RHVH-7.2-20160718.1-RHVH-x86_64-dvd1.iso image over virtual media.

Here goes components on hosts:
sanlock-3.2.4-2.el7_2.x86_64
ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch
ovirt-imageio-daemon-0.3.0-0.el7ev.noarch
ovirt-host-deploy-1.5.1-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
mom-0.5.5-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
vdsm-4.18.6-1.el7ev.x86_64
ovirt-hosted-engine-setup-2.0.1-1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 7.2

Engine:
rhevm-doc-4.0.0-2.el7ev.noarch                  
rhev-guest-tools-iso-4.0-4.el7ev.noarch                             
rhevm-4.0.1.1-0.1.el7ev.noarch                    
rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
rhevm-branding-rhev-4.0.0-3.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
rhev-release-4.0.1-2-001.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Steps of reproduction:
1)Provisioned from Foreman both hosts to RHEL7.2.
2)Provisioned from virtual media (aka CD-ROM) both hosts to RHVH-7.2-20160718.1-RHVH-x86_64-dvd1.iso.
3)Deployed via WEBUI of the Cockpit the HE from first host on NFS, while using rhevm-appliance-20160714.0-1.el7ev.noarch.rpm. I had an issue with insufficient space for extracting the OVA, so I've mounted external NFS share for temporary place of appliance to be decompressed to.
4)Once HE deployment succeeded on first host, I've added the second host via WEBUI as HE-host.
5)After a few minutes from host became active in engine's WEBUI, it also received it's HA=3400 score and became fully active HE-host.

Note You need to log in before you can comment on or make changes to this bug.