1512534 – SHE deployment takes too much time and looks like stuck.

Bug 1512534 - SHE deployment takes too much time and looks like stuck.

Summary: SHE deployment takes too much time and looks like stuck.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Piotr Kliczewski
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1267807 1314377 1454308 1455169 1458711 1478848 1511094 1522641 1524119
TreeView+	depends on / blocked

Reported:	2017-11-13 12:59 UTC by Nikolai Sednev
Modified:	2019-04-28 14:02 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-12-20 10:58:14 UTC
oVirt Team:	Integration
Embargoed:
Flags:	rule-engine: ovirt-4.2+ rule-engine: blocker+

Attachments	(Terms of Use)
sosreport from alma04 (9.64 MB, application/x-xz) 2017-11-13 13:22 UTC, Nikolai Sednev	no flags	Details
cockpit_he.png (112.53 KB, image/png) 2017-12-08 05:08 UTC, Yihui Zhao	no flags	Details
deploy.log (681.53 KB, text/plain) 2017-12-08 05:11 UTC, Yihui Zhao	no flags	Details
vdsm.log (7.10 MB, text/plain) 2017-12-11 02:14 UTC, Yihui Zhao	no flags	Details
logs from purple-vds1 (10.41 MB, application/x-xz) 2017-12-13 16:30 UTC, Nikolai Sednev	no flags	Details
logs from purple-vds2 (9.36 MB, application/x-xz) 2017-12-13 17:43 UTC, Nikolai Sednev	no flags	Details
HE_stuck.png (87.54 KB, image/png) 2017-12-14 08:57 UTC, Yihui Zhao	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	84424	master	MERGED	jsonrpc: let gc to collect client object	2020-06-23 14:33:04 UTC
oVirt gerrit	84425	master	ABANDONED	client: make sure to load certs in sslsocket	2020-06-23 14:33:04 UTC
oVirt gerrit	84429	master	ABANDONED	jsonrpc: force client to reload certs on reconnect	2020-06-23 14:33:04 UTC
oVirt gerrit	84794	master	MERGED	timeout: fixing spelling mistake	2020-06-23 14:33:04 UTC
oVirt gerrit	85416	master	MERGED	json rpc: loop until ping2 is successful on the new connection	2020-06-23 14:33:04 UTC

Description Nikolai Sednev 2017-11-13 12:59:35 UTC

SHE deployment takes too much time and looks like stuck.
Description of problem:
[ INFO  ] Connecting to Engine
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...
[ INFO  ] Still waiting for VDSM host to become operational...
[ INFO  ] Still waiting for VDSM host to become operational...
[ INFO  ] Still waiting for VDSM host to become operational...
[ INFO  ] The VDSM Host is now operational

It looks like vdsm json rpc client issue according to Simone.
2017-11-13 14:45:40,614+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.2.0-0.0.master.20171110120732.git35143a6.el7.centos.noarch

How reproducible:
100%

Steps to Reproduce:
1.Deploy SHE on one host over NFS.

Actual results:
Deployment takes too much time and seems like getting stuck at "The VDSM Host is now operational".

Expected results:
Deployment should run much faster.

Additional info:
Sosreport from host is attached.

Comment 1 Simone Tiraboschi 2017-11-13 13:05:11 UTC

It's a regression in VDSM python jsonrpc client which now doesn't raise anymore  stomp.Disconnected when the connection is not valid:
https://gerrit.ovirt.org/#/c/83155/4/lib/yajsonrpc/stompreactor.py@484

2017-11-13 13:45:37,603+0200 DEBUG otopi.context context._executeMethod:128 Stage closeup METHOD otopi.plugins.gr_he_setup.system.vdsmenv.Plugin._closeup
2017-11-13 14:00:37,604+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready
2017-11-13 14:15:38,608+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready
2017-11-13 14:30:39,611+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready
2017-11-13 14:45:40,614+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready

Comment 2 Nikolai Sednev 2017-11-13 13:22:31 UTC

Created attachment 1351569 [details]
sosreport from alma04

Comment 3 Irit Goihman 2017-11-14 13:36:38 UTC

I'm not sure the problem is a regression in VDSM jsonrpc client.
I agree that we should add the exception again but it seems strange to me that connecting/reconnecting takes such a long time.
Can you give me more details on VDSM availability? When did vdsmd start to run compared to the first connection attempt? Was it up or down during the attempts to communicate with it?
Also, it will be great if you give me steps to reproduce this issue.

Comment 4 Nikolai Sednev 2017-11-14 13:44:43 UTC

(In reply to Irit Goihman from comment #3)
> I'm not sure the problem is a regression in VDSM jsonrpc client.
> I agree that we should add the exception again but it seems strange to me
> that connecting/reconnecting takes such a long time.
> Can you give me more details on VDSM availability? When did vdsmd start to
> run compared to the first connection attempt? Was it up or down during the
> attempts to communicate with it?
> Also, it will be great if you give me steps to reproduce this issue.

https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c0
Steps to Reproduce:
1.Deploy SHE on one host over NFS.

You should simply get deployment stuck at "[ INFO  ] The VDSM Host is now operational" forever.

Comment 5 Simone Tiraboschi 2017-11-14 14:07:45 UTC

Irit,
in the current flow hosted-engine-setup directly uses vdsm to create a storage domain, a management bridge and so on.
Since on boostrap no engine is there, hosted-engine-setup creates temporary boostrap certs for VDSM.
Once the engine is available, hosted-engine-setup will call host.add on the REST API and this will trigger host deploy that replaces vdsm certs with the one signed by the engine CA and restarts vdsmd to make it effective.

At the same time hosted-engine-setup is using vdsm, trough a json rpc connection via python client, to monitor host status.
So, when host-deploy restarts vdsmd, the connection will be disconnected but, if we miss the disconnected exception, hosted-engine-setup don;t realize and tries to loop over the disconnected connection for a long time and by default we have a 900 seconds timeout.

Comment 6 Irit Goihman 2017-11-16 13:47:20 UTC

So, hosted-engine-setup loops over a command and gets indication about VDSM availability from an exception that raises in the code.
Now, when we added reconnect mechanism, we don't need to raise Disconnected exception anywhere since we assume we can reconnect anytime during the timeout.

The situation described here is that VDSM is expected not to be available and you keep polling it with vdsm client commands. Is there a reason for such a long timeout? We're using 90 seconds for slow commands, what is the reason for setting 900 seconds?
I don't think the way of solving this issue is adding the exception back, but get an indication about VDSM availability in other ways.

Comment 7 Simone Tiraboschi 2017-11-20 08:46:17 UTC

(In reply to Irit Goihman from comment #6)
> So, hosted-engine-setup loops over a command and gets indication about VDSM
> availability from an exception that raises in the code.
> Now, when we added reconnect mechanism, we don't need to raise Disconnected
> exception anywhere since we assume we can reconnect anytime during the
> timeout.

Also reconnecting on the same command will be fine but in this case is neither reconnecting.

> The situation described here is that VDSM is expected not to be available
> and you keep polling it with vdsm client commands. 

host-deploy will restart vdsm while hosted-engine-setup is using it to monitor the host, but vdsm is expected to be available also after host-deploy.

> Is there a reason for
> such a long timeout? We're using 90 seconds for slow commands, what is the
> reason for setting 900 seconds?

We set such a long timeout since we have long tasks on storage side but I agree that 900 is really too long.

> I don't think the way of solving this issue is adding the exception back,
> but get an indication about VDSM availability in other ways.

Comment 8 Piotr Kliczewski 2017-11-20 09:52:22 UTC

(In reply to Simone Tiraboschi from comment #7)
> (In reply to Irit Goihman from comment #6)
> > So, hosted-engine-setup loops over a command and gets indication about VDSM
> > availability from an exception that raises in the code.
> > Now, when we added reconnect mechanism, we don't need to raise Disconnected
> > exception anywhere since we assume we can reconnect anytime during the
> > timeout.
> 
> Also reconnecting on the same command will be fine but in this case is
> neither reconnecting.

We use deliver at most once semantics and there could be situation when we won't get the answer at all and we need to retry.

> 
> > The situation described here is that VDSM is expected not to be available
> > and you keep polling it with vdsm client commands. 
> 
> host-deploy will restart vdsm while hosted-engine-setup is using it to
> monitor the host, but vdsm is expected to be available also after
> host-deploy.
> 
> > Is there a reason for
> > such a long timeout? We're using 90 seconds for slow commands, what is the
> > reason for setting 900 seconds?
> 
> We set such a long timeout since we have long tasks on storage side but I
> agree that 900 is really too long.

I think that we should set timeout based on verb being called and do not use such long timeout for every thing.

> 
> > I don't think the way of solving this issue is adding the exception back,
> > but get an indication about VDSM availability in other ways.

A user do not need to know about the state of connection. It should just use the code.

Comment 9 Martin Perina 2017-11-22 13:55:01 UTC

It's not possible to reload certificates content when those certificates are already used in Python, so automatic client reconnection is not possible for HE installation when VDSM certificates are changed during host deploy. For this reason the changes has to be done inside HE code and perform removal of the client and creating it back again.

Comment 10 Martin Perina 2017-11-24 08:37:53 UTC

Moving to HE component as we are not able to fix the issue within jsonrpc client.

Comment 11 Simone Tiraboschi 2017-11-27 21:28:12 UTC

Still not 100% safe:

Look here:
http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/1711/console

17:52:23 [ INFO  ] The VDSM Host is now operational
18:07:26 [ INFO  ] Saving hosted-engine configuration on the shared storage domain

If we are lucky and just ping2 got interrupted by host-deploy restarting vdsm, we are going to pay just a 1 second delay but otherwise we are still going to pay the whole 900 seconds timeout.

Comment 12 Piotr Kliczewski 2017-11-28 09:25:24 UTC

The issue was fixed and now the client is able to connect after one attempt as you can see in the log. The timeout you mentioned is 1 second waiting for a response. The the timeout you mentioned is still set here [1]. I think that there is a broader issue with handling timeouts. We should set this timeout only for verbs which require it.

[1] https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L378

Comment 13 Simone Tiraboschi 2017-11-28 11:09:21 UTC

(In reply to Piotr Kliczewski from comment #12)
> The issue was fixed and now the client is able to connect after one attempt
> as you can see in the log. The timeout you mentioned is 1 second waiting for
> a response. 

Yes, but in that case for some strange reason we still payed the whole 900 seconds timeout on reconnect although we were not sending any command.

From
http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/1711/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171127124214-93e1v7.log

we see that it spent 900 seconds between:
2017-11-27 12:52:24,647-0500 DEBUG otopi.context context.dumpEnvironment:835 ENVIRONMENT DUMP - END
2017-11-27 12:52:24,648-0500 DEBUG otopi.context context._executeMethod:128 Stage closeup METHOD otopi.plugins.gr_he_setup.system.vdsmenv.Plugin._closeup
2017-11-27 13:07:24,649-0500 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready
2017-11-27 13:07:24,651-0500 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Creating a new json-rpc connection to VDSM

but in otopi.plugins.gr_he_setup.system.vdsmenv we have just:
https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/plugins/gr-he-setup/system/vdsmenv.py#L173

to have the vdsm client refreshing the certs without any new command here.


So I think that we fixed the timeout for ping2 test on 
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L413

but after that we are still probably going to pay the whole timeout on:
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L442
calling
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L385
when the 1 second timed out ping2 failed.

So I think that this specific sub issue occurs when ping2 fails due to the 1 second timeout but vdsm is still not ready when we retry the connection and in that case we are still going to pay the whole 900 seconds timeout.

On my opinion raising something when the json rpc client fails due to old certs instead than simply waiting the timeout would be a more robust fix.

Comment 14 Piotr Kliczewski 2017-11-28 11:16:13 UTC

(In reply to Simone Tiraboschi from comment #13)
> (In reply to Piotr Kliczewski from comment #12)
> > The issue was fixed and now the client is able to connect after one attempt
> > as you can see in the log. The timeout you mentioned is 1 second waiting for
> > a response. 
> 
> Yes, but in that case for some strange reason we still payed the whole 900
> seconds timeout on reconnect although we were not sending any command.
> 
> From
> http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-
> tests_manual/1711/artifact/exported-artifacts/test_logs/he-basic-suite-
> master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/ovirt-
> hosted-engine-setup/ovirt-hosted-engine-setup-20171127124214-93e1v7.log
> 
> we see that it spent 900 seconds between:
> 2017-11-27 12:52:24,647-0500 DEBUG otopi.context context.dumpEnvironment:835
> ENVIRONMENT DUMP - END
> 2017-11-27 12:52:24,648-0500 DEBUG otopi.context context._executeMethod:128
> Stage closeup METHOD otopi.plugins.gr_he_setup.system.vdsmenv.Plugin._closeup
> 2017-11-27 13:07:24,649-0500 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv
> util.__log_debug:374 VDSM jsonrpc connection is not ready
> 2017-11-27 13:07:24,651-0500 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv
> util.__log_debug:374 Creating a new json-rpc connection to VDSM
> 
> but in otopi.plugins.gr_he_setup.system.vdsmenv we have just:
> https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/plugins/
> gr-he-setup/system/vdsmenv.py#L173
> 
> to have the vdsm client refreshing the certs without any new command here.
> 
> 
> So I think that we fixed the timeout for ping2 test on 
> https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/
> ovirt_hosted_engine_ha/lib/util.py#L413
> 
> but after that we are still probably going to pay the whole timeout on:
> https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/
> ovirt_hosted_engine_ha/lib/util.py#L442
> calling
> https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/
> ovirt_hosted_engine_ha/lib/util.py#L385
> when the 1 second timed out ping2 failed.
> 
> So I think that this specific sub issue occurs when ping2 fails due to the 1
> second timeout but vdsm is still not ready when we retry the connection and
> in that case we are still going to pay the whole 900 seconds timeout.

There is spelling mistake in timeout param. Please see here:
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/client.py#L257

> 
> On my opinion raising something when the json rpc client fails due to old
> certs instead than simply waiting the timeout would be a more robust fix.

This is not possible because we only see handshake failure and we do not know the reason. Overall raising is against reconnect logic so I do not think it is good idea.

Comment 15 Nikolai Sednev 2017-11-28 16:36:17 UTC

Deployment over Gluster worked for me with these components:
ovirt-hosted-engine-setup-2.2.0-0.0.master.20171127231205.git7c7ab44.el7.centos.noarch
ovirt-hosted-engine-ha-2.2.0-0.0.master.20171122155227.20171122155225.gitbc3ec09.el7.centos.noarch
ovirt-engine-appliance-4.2-20171127.1.el7.centos.noarch

Comment 16 Nikolai Sednev 2017-11-30 15:04:36 UTC

Deployed using:
ovirt-hosted-engine-ha-2.2.0-0.0.master.20171128125909.20171128125907.gitfa5daa6.el7.centos.noarch
ovirt-hosted-engine-setup-2.2.0-0.0.master.20171129192644.git440040c.el7.centos.noarch
ovirt-engine-appliance-4.2-20171129.1.el7.centos.noarch

Works for me now. might be moved to verified.

Comment 17 Simone Tiraboschi 2017-12-07 09:07:55 UTC

*** Bug 1522641 has been marked as a duplicate of this bug. ***

Comment 18 Yihui Zhao 2017-12-08 05:08:25 UTC

Created attachment 1364625 [details]
cockpit_he.png

Comment 19 Yihui Zhao 2017-12-08 05:11:42 UTC

Created attachment 1364627 [details]
deploy.log

Comment 20 Yihui Zhao 2017-12-08 05:13:09 UTC

Meet this issue again.

Test version:
ovirt-hosted-engine-setup-2.2.0-2.el7ev.noarch
ovirt-hosted-engine-ha-2.2.0-1.el7ev.noarch
cockpit-ovirt-dashboard-0.11.1-0.6.el7ev.noarch
rhvh-4.2.0.5-0.20171207.0+1

Actual results:
https://bugzilla.redhat.com/attachment.cgi?id=1364625

Deploy log:
https://bugzilla.redhat.com/attachment.cgi?id=1364627

Comment 21 cshao 2017-12-08 06:28:13 UTC

Change bug to ASSIGNED status due to QE still met this issue with RHVH 4.2 Beta build.

pkliczew,
Could you please help to check this issue ASAP?

Thanks.

Comment 22 Piotr Kliczewski 2017-12-08 09:57:53 UTC

I would not mix this issue with being not able to connect to vdsm at all. I see in the deploy logs referenced in comment #20. As you can see the timeouts are OK and the change is applied:

2017-12-08 11:27:30,699+0800 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready
2017-12-08 11:27:30,699+0800 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Creating a new json-rpc connection to VDSM
2017-12-08 11:27:30,701+0800 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Waiting for VDSM to connect
2017-12-08 11:27:31,704+0800 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Waiting for VDSM to connect

at the end the client failed to connect which is different behavior. I need vdsm logs to understand why this occurred.

Comment 23 cshao 2017-12-08 11:21:39 UTC

@yzhao
Could you please upload vdsm log for debug?

Comment 24 Piotr Kliczewski 2017-12-08 15:46:42 UTC

From provided logs the failure looks like not related moving back to ON_QA.

Comment 25 Nikolai Sednev 2017-12-10 09:55:31 UTC

Moving to verified forth to comments #15 and 16, as not being reproduced anymore on regular RHEL7.4 SHE deployments.
Please reopen if still happens on RHEVH.

Comment 26 Yihui Zhao 2017-12-11 02:14:26 UTC

Created attachment 1365751 [details]
vdsm.log

Comment 27 Piotr Kliczewski 2017-12-11 09:07:41 UTC

Based on provided log vdsm was stopped at:

2017-12-08 11:17:20,836+0800 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:67)

and started at:

2017-12-08 11:27:48,714+0800 INFO  (MainThread) [vds] (PID: 30140) I am the actual vdsm 4.20.9-1.el7ev dhcp-8-176.nay.redhat.com (3.10.0-693.11.1.el7.x86_64) (vdsmd:148)

and `RuntimeError: Couldn't  connect to VDSM within 15 seconds` was at

2017-12-08 11:27:45,739+0800 DEBUG otopi.context context._executeMethod:143 method exception

It is important to note that vdsm entered recovery mode so it was much longer and ended at:

2017-12-08 13:37:37,130+0800 INFO  (vmrecovery) [vds] recovery: completed in 7788s (clientIF:569)

The question is why vdsm entered recovery. In typical HE scenarios it should not happen.

Comment 28 Yihui Zhao 2017-12-11 09:57:39 UTC

(In reply to Piotr Kliczewski from comment #27)
> Based on provided log vdsm was stopped at:
> 
> 2017-12-08 11:17:20,836+0800 INFO  (MainThread) [vds] Received signal 15,
> shutting down (vdsmd:67)
> 
> and started at:
> 
> 2017-12-08 11:27:48,714+0800 INFO  (MainThread) [vds] (PID: 30140) I am the
> actual vdsm 4.20.9-1.el7ev dhcp-8-176.nay.redhat.com
> (3.10.0-693.11.1.el7.x86_64) (vdsmd:148)
> 
> and `RuntimeError: Couldn't  connect to VDSM within 15 seconds` was at
> 
> 2017-12-08 11:27:45,739+0800 DEBUG otopi.context context._executeMethod:143
> method exception
> 
> It is important to note that vdsm entered recovery mode so it was much
> longer and ended at:
> 
> 2017-12-08 13:37:37,130+0800 INFO  (vmrecovery) [vds] recovery: completed in
> 7788s (clientIF:569)
> 
> The question is why vdsm entered recovery. In typical HE scenarios it should
> not happen.

I deployed again, also met this issue, vdsm entered recovery.

From the vdsm.log

2017-12-11 04:00:06,170-0500 ERROR (periodic/0) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkstatsMonitor object at 0x2acd710> operation failed (periodic:215)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__
    self._func()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 522, in __call__
    self._send_metrics()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 538, in _send_metrics
    vm_sample.interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 45, in produce
    networks(vm, stats, first_sample, last_sample, interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 322, in networks
    if nic.name.startswith('hostdev'):

Comment 29 Nikolai Sednev 2017-12-11 10:02:21 UTC

This is RHEVH specific now, its working just fine or regular RHEL7.4 hosts forth to https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c25. I suggest to ope a separate bug on this as RHEVH specific.
RHEVH is also blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1516113.

Comment 33 Nikolai Sednev 2017-12-13 16:15:49 UTC

I've got the original issue reproduced on purple-vds1, that host is a bit older and probably a bit slower. I've got "[ ERROR ] Failed to execute stage 'Environment setup': Couldn't  connect to VDSM within 15 seconds" on it, although on other two pairs of hosts this is not happening anymore.
Reopening this bug as reproduced.
Please see the attachment.
ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch
rhvm-appliance-4.2-20171207.0.el7.noarch

Comment 34 Nikolai Sednev 2017-12-13 16:30:39 UTC

Created attachment 1367494 [details]
logs from purple-vds1

Comment 35 Martin Perina 2017-12-13 17:38:41 UTC

Piotr, could you please take a look at latest logs?

Comment 36 Nikolai Sednev 2017-12-13 17:41:15 UTC

I've just tried to make the same deployment on purple-vds2, it had to be the second ha-host and ended up with "[ ERROR ] Engine is still not reachable
[ ERROR ] Failed to execute stage 'Closing up': Engine is still not reachable [ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171213185521-2fr2ic.log"

I believe that it's failure was caused by the same issue, adding logs also for purple-vds2.

Comment 37 Nikolai Sednev 2017-12-13 17:43:56 UTC

Created attachment 1367501 [details]
logs from purple-vds2

Comment 38 Piotr Kliczewski 2017-12-13 18:31:26 UTC

As stated in comment #27 host entered recover mode

2017-12-13 17:47:15,562+0200 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.ping2' in bridge with {} (__init__:585)

and vdsm was not operational at this time.

Due to this I see following log entry:

2017-12-13 17:47:15,566+0200 ERROR otopi.context context._executeMethod:152 Failed to execute stage 'Environment setup': Couldn't  connect to VDSM within 15 seconds

This BZ was about different issue so please use new BZ to track it.

Comment 39 Piotr Kliczewski 2017-12-13 18:32:25 UTC

Is your environment reused? This should not happen when HE is installed for the first time please make sure to cleanup your env and rerun the test.

Comment 40 Piotr Kliczewski 2017-12-13 18:48:58 UTC

Simone, Why the logs says that "Couldn't  connect to VDSM within 15 seconds" whereas the operation started at 17:47:13,092+0200 and ended at 17:47:15,566+0200? The time taken to wait for vdsm is much shorted than specified in the logs.

Comment 41 Simone Tiraboschi 2017-12-13 22:28:17 UTC

(In reply to Piotr Kliczewski from comment #40)
> Simone, Why the logs says that "Couldn't  connect to VDSM within 15 seconds"
> whereas the operation started at 17:47:13,092+0200 and ended at
> 17:47:15,566+0200? The time taken to wait for vdsm is much shorted than
> specified in the logs.

Great point Piotr, I think that you hit the issue.

Check the code here:
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/fa5daa6ba3a881625bd8029f488602749de589dd/ovirt_hosted_engine_ha/lib/util.py

And compare with failure logs:
 2017-12-13 17:47:13,092+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Creating a new json-rpc connection to VDSM
 2017-12-13 17:47:13,098+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Waiting for VDSM to connect
 2017-12-13 17:47:14,102+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Waiting for VDSM to connect
 2017-12-13 17:47:15,563+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready
 2017-12-13 17:47:15,564+0200 DEBUG otopi.context context._executeMethod:143 method exception
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/otopi/context.py", line 133, in _executeMethod
     method['method']()
   File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/system/vdsmenv.py", line 95, in _late_setup
     timeout=ohostedcons.Const.VDSCLI_SSL_TIMEOUT,
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 442, in connect_vdsm_json_rpc
     __vdsm_json_rpc_connect(logger, timeout)
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 398, in __vdsm_json_rpc_connect
     timeout=VDSM_MAX_RETRY * VDSM_DELAY
 RuntimeError: Couldn't  connect to VDSM within 15 seconds


The while cycle on line 382 should try up to 15 times with a 1 second delay between each try (so the misleading 15 seconds delay).
but in the logs we read 'Waiting for VDSM to connect' only two times so after 2 seconds vdsm was successfully connected.

But...
on line 393 we are calling __vdsm_json_rpc_check to pedantic validate the connection and this check fails: in the logs we see 'VDSM jsonrpc connection is not ready' that comes from line 418 within __vdsm_json_rpc_check.
At that point __vdsm_json_rpc_check sets _vdsm_json_rpc = None on line 429 and the check on line 395 fails and we raise with 
"Couldn't  connect to VDSM within {timeout} seconds".format(timeout=VDSM_MAX_RETRY * VDSM_DELAY) with VDSM_MAX_RETRY=15 and VDSM_DELAY=1.

So, recapping it, we probably have a race conditions here:
vdsm got successfully connected after 2 seconds but just after it (I'm assuming that the connection couldn't be successfully if vdsm is in recovering state) vdsm enters in recovery state and so the ping2 checks fails and we fail with a misleading message.

Comment 42 Simone Tiraboschi 2017-12-13 22:35:13 UTC

I think that we broke it here:
https://gerrit.ovirt.org/#/c/84424/3/ovirt_hosted_engine_ha/lib/util.py@419

that break statement interrupts the loop on line 410 at the first client.Error exception and so this side effect.
Probably the loop on line 410 was almost long enough to wait for vdsm exiting the recovery process.

Comment 43 Martin Perina 2017-12-14 07:44:31 UTC

(In reply to Piotr Kliczewski from comment #39)
> Is your environment reused? This should not happen when HE is installed for
> the first time please make sure to cleanup your env and rerun the test.

Nikolai, I really don't see a reason why you put all those bugs as blockers for this. Could you please reply to Piotr question: do you really deploy HE on clean host (meaning fresly installed, not somehow manually cleaned from previous installations)? Because this is the only reason we could think of why freshly installed VDSM would start in recovery mode

Comment 44 Yihui Zhao 2017-12-14 08:57:30 UTC

Created attachment 1367892 [details]
HE_stuck.png

Comment 48 Martin Perina 2017-12-14 10:15:13 UTC

(In reply to Nikolai Sednev from comment #36)
> I've just tried to make the same deployment on purple-vds2, it had to be the
> second ha-host and ended up with "[ ERROR ] Engine is still not reachable
> [ ERROR ] Failed to execute stage 'Closing up': Engine is still not
> reachable [ ERROR ] Hosted Engine deployment failed: this system is not
> reliable, please check the issue,fix and redeploy
>           Log file is located at
> /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171213185521-
> 2fr2ic.log"
> 
> I believe that it's failure was caused by the same issue, adding logs also
> for purple-vds2.

Could you please create a new bug for that? Why VDSM is in recovery mode after a restart in fresh installation needs to be investigated in separate bug, it has nothing to do with automatic reconnection. Thanks

Comment 50 Nikolai Sednev 2017-12-14 11:22:12 UTC

To add more data of what happened, this is what happened on first cleanly reprovisioned host purple-vds1 http://pastebin.test.redhat.com/540761, and here what happened on second cleanly reprovisioned host purple-vds2 http://pastebin.test.redhat.com/540764.

https://bugzilla.redhat.com/show_bug.cgi?id=1525907 was opened forth to https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c48.

Comment 51 Nikolai Sednev 2017-12-18 16:22:04 UTC

Failed to reproduce, while deployed over NFS, iSCSI and Gluster.
Moving to verified.
Please feel free to reopen if you still experience this issue.

Components on hosts:
ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64
sanlock-3.5.0-1.el7.x86_64
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
libvirt-client-3.2.0-14.el7_4.5.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
ovirt-host-dependencies-4.2.0-1.el7ev.x86_64
mom-0.5.11-1.el7ev.noarch
ovirt-setup-lib-1.1.4-1.el7ev.noarch
vdsm-4.20.9.2-1.el7ev.x86_64
ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch
ovirt-imageio-daemon-1.2.0-0.el7ev.noarch
ovirt-host-4.2.0-1.el7ev.x86_64
ovirt-host-deploy-1.7.0-1.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
ovirt-imageio-common-1.2.0-0.el7ev.noarch
Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017
Linux alma03.qa.lab.tlv.redhat.com 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)

Comment 52 Sandro Bonazzola 2017-12-20 10:58:14 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.