SHE deployment takes too much time and looks like stuck. Description of problem: [ INFO ] Connecting to Engine [ INFO ] Waiting for the host to become operational in the engine. This may take several minutes... [ INFO ] Still waiting for VDSM host to become operational... [ INFO ] Still waiting for VDSM host to become operational... [ INFO ] Still waiting for VDSM host to become operational... [ INFO ] The VDSM Host is now operational It looks like vdsm json rpc client issue according to Simone. 2017-11-13 14:45:40,614+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.2.0-0.0.master.20171110120732.git35143a6.el7.centos.noarch How reproducible: 100% Steps to Reproduce: 1.Deploy SHE on one host over NFS. Actual results: Deployment takes too much time and seems like getting stuck at "The VDSM Host is now operational". Expected results: Deployment should run much faster. Additional info: Sosreport from host is attached.
It's a regression in VDSM python jsonrpc client which now doesn't raise anymore stomp.Disconnected when the connection is not valid: https://gerrit.ovirt.org/#/c/83155/4/lib/yajsonrpc/stompreactor.py@484 2017-11-13 13:45:37,603+0200 DEBUG otopi.context context._executeMethod:128 Stage closeup METHOD otopi.plugins.gr_he_setup.system.vdsmenv.Plugin._closeup 2017-11-13 14:00:37,604+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready 2017-11-13 14:15:38,608+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready 2017-11-13 14:30:39,611+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready 2017-11-13 14:45:40,614+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready
Created attachment 1351569 [details] sosreport from alma04
I'm not sure the problem is a regression in VDSM jsonrpc client. I agree that we should add the exception again but it seems strange to me that connecting/reconnecting takes such a long time. Can you give me more details on VDSM availability? When did vdsmd start to run compared to the first connection attempt? Was it up or down during the attempts to communicate with it? Also, it will be great if you give me steps to reproduce this issue.
(In reply to Irit Goihman from comment #3) > I'm not sure the problem is a regression in VDSM jsonrpc client. > I agree that we should add the exception again but it seems strange to me > that connecting/reconnecting takes such a long time. > Can you give me more details on VDSM availability? When did vdsmd start to > run compared to the first connection attempt? Was it up or down during the > attempts to communicate with it? > Also, it will be great if you give me steps to reproduce this issue. https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c0 Steps to Reproduce: 1.Deploy SHE on one host over NFS. You should simply get deployment stuck at "[ INFO ] The VDSM Host is now operational" forever.
Irit, in the current flow hosted-engine-setup directly uses vdsm to create a storage domain, a management bridge and so on. Since on boostrap no engine is there, hosted-engine-setup creates temporary boostrap certs for VDSM. Once the engine is available, hosted-engine-setup will call host.add on the REST API and this will trigger host deploy that replaces vdsm certs with the one signed by the engine CA and restarts vdsmd to make it effective. At the same time hosted-engine-setup is using vdsm, trough a json rpc connection via python client, to monitor host status. So, when host-deploy restarts vdsmd, the connection will be disconnected but, if we miss the disconnected exception, hosted-engine-setup don;t realize and tries to loop over the disconnected connection for a long time and by default we have a 900 seconds timeout.
So, hosted-engine-setup loops over a command and gets indication about VDSM availability from an exception that raises in the code. Now, when we added reconnect mechanism, we don't need to raise Disconnected exception anywhere since we assume we can reconnect anytime during the timeout. The situation described here is that VDSM is expected not to be available and you keep polling it with vdsm client commands. Is there a reason for such a long timeout? We're using 90 seconds for slow commands, what is the reason for setting 900 seconds? I don't think the way of solving this issue is adding the exception back, but get an indication about VDSM availability in other ways.
(In reply to Irit Goihman from comment #6) > So, hosted-engine-setup loops over a command and gets indication about VDSM > availability from an exception that raises in the code. > Now, when we added reconnect mechanism, we don't need to raise Disconnected > exception anywhere since we assume we can reconnect anytime during the > timeout. Also reconnecting on the same command will be fine but in this case is neither reconnecting. > The situation described here is that VDSM is expected not to be available > and you keep polling it with vdsm client commands. host-deploy will restart vdsm while hosted-engine-setup is using it to monitor the host, but vdsm is expected to be available also after host-deploy. > Is there a reason for > such a long timeout? We're using 90 seconds for slow commands, what is the > reason for setting 900 seconds? We set such a long timeout since we have long tasks on storage side but I agree that 900 is really too long. > I don't think the way of solving this issue is adding the exception back, > but get an indication about VDSM availability in other ways.
(In reply to Simone Tiraboschi from comment #7) > (In reply to Irit Goihman from comment #6) > > So, hosted-engine-setup loops over a command and gets indication about VDSM > > availability from an exception that raises in the code. > > Now, when we added reconnect mechanism, we don't need to raise Disconnected > > exception anywhere since we assume we can reconnect anytime during the > > timeout. > > Also reconnecting on the same command will be fine but in this case is > neither reconnecting. We use deliver at most once semantics and there could be situation when we won't get the answer at all and we need to retry. > > > The situation described here is that VDSM is expected not to be available > > and you keep polling it with vdsm client commands. > > host-deploy will restart vdsm while hosted-engine-setup is using it to > monitor the host, but vdsm is expected to be available also after > host-deploy. > > > Is there a reason for > > such a long timeout? We're using 90 seconds for slow commands, what is the > > reason for setting 900 seconds? > > We set such a long timeout since we have long tasks on storage side but I > agree that 900 is really too long. I think that we should set timeout based on verb being called and do not use such long timeout for every thing. > > > I don't think the way of solving this issue is adding the exception back, > > but get an indication about VDSM availability in other ways. A user do not need to know about the state of connection. It should just use the code.
It's not possible to reload certificates content when those certificates are already used in Python, so automatic client reconnection is not possible for HE installation when VDSM certificates are changed during host deploy. For this reason the changes has to be done inside HE code and perform removal of the client and creating it back again.
Moving to HE component as we are not able to fix the issue within jsonrpc client.
Still not 100% safe: Look here: http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/1711/console 17:52:23 [ INFO ] The VDSM Host is now operational 18:07:26 [ INFO ] Saving hosted-engine configuration on the shared storage domain If we are lucky and just ping2 got interrupted by host-deploy restarting vdsm, we are going to pay just a 1 second delay but otherwise we are still going to pay the whole 900 seconds timeout.
The issue was fixed and now the client is able to connect after one attempt as you can see in the log. The timeout you mentioned is 1 second waiting for a response. The the timeout you mentioned is still set here [1]. I think that there is a broader issue with handling timeouts. We should set this timeout only for verbs which require it. [1] https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L378
(In reply to Piotr Kliczewski from comment #12) > The issue was fixed and now the client is able to connect after one attempt > as you can see in the log. The timeout you mentioned is 1 second waiting for > a response. Yes, but in that case for some strange reason we still payed the whole 900 seconds timeout on reconnect although we were not sending any command. From http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/1711/artifact/exported-artifacts/test_logs/he-basic-suite-master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171127124214-93e1v7.log we see that it spent 900 seconds between: 2017-11-27 12:52:24,647-0500 DEBUG otopi.context context.dumpEnvironment:835 ENVIRONMENT DUMP - END 2017-11-27 12:52:24,648-0500 DEBUG otopi.context context._executeMethod:128 Stage closeup METHOD otopi.plugins.gr_he_setup.system.vdsmenv.Plugin._closeup 2017-11-27 13:07:24,649-0500 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready 2017-11-27 13:07:24,651-0500 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Creating a new json-rpc connection to VDSM but in otopi.plugins.gr_he_setup.system.vdsmenv we have just: https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/plugins/gr-he-setup/system/vdsmenv.py#L173 to have the vdsm client refreshing the certs without any new command here. So I think that we fixed the timeout for ping2 test on https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L413 but after that we are still probably going to pay the whole timeout on: https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L442 calling https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L385 when the 1 second timed out ping2 failed. So I think that this specific sub issue occurs when ping2 fails due to the 1 second timeout but vdsm is still not ready when we retry the connection and in that case we are still going to pay the whole 900 seconds timeout. On my opinion raising something when the json rpc client fails due to old certs instead than simply waiting the timeout would be a more robust fix.
(In reply to Simone Tiraboschi from comment #13) > (In reply to Piotr Kliczewski from comment #12) > > The issue was fixed and now the client is able to connect after one attempt > > as you can see in the log. The timeout you mentioned is 1 second waiting for > > a response. > > Yes, but in that case for some strange reason we still payed the whole 900 > seconds timeout on reconnect although we were not sending any command. > > From > http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system- > tests_manual/1711/artifact/exported-artifacts/test_logs/he-basic-suite- > master/post-002_bootstrap.py/lago-he-basic-suite-master-host0/_var_log/ovirt- > hosted-engine-setup/ovirt-hosted-engine-setup-20171127124214-93e1v7.log > > we see that it spent 900 seconds between: > 2017-11-27 12:52:24,647-0500 DEBUG otopi.context context.dumpEnvironment:835 > ENVIRONMENT DUMP - END > 2017-11-27 12:52:24,648-0500 DEBUG otopi.context context._executeMethod:128 > Stage closeup METHOD otopi.plugins.gr_he_setup.system.vdsmenv.Plugin._closeup > 2017-11-27 13:07:24,649-0500 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv > util.__log_debug:374 VDSM jsonrpc connection is not ready > 2017-11-27 13:07:24,651-0500 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv > util.__log_debug:374 Creating a new json-rpc connection to VDSM > > but in otopi.plugins.gr_he_setup.system.vdsmenv we have just: > https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/plugins/ > gr-he-setup/system/vdsmenv.py#L173 > > to have the vdsm client refreshing the certs without any new command here. > > > So I think that we fixed the timeout for ping2 test on > https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ > ovirt_hosted_engine_ha/lib/util.py#L413 > > but after that we are still probably going to pay the whole timeout on: > https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ > ovirt_hosted_engine_ha/lib/util.py#L442 > calling > https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ > ovirt_hosted_engine_ha/lib/util.py#L385 > when the 1 second timed out ping2 failed. > > So I think that this specific sub issue occurs when ping2 fails due to the 1 > second timeout but vdsm is still not ready when we retry the connection and > in that case we are still going to pay the whole 900 seconds timeout. There is spelling mistake in timeout param. Please see here: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/client.py#L257 > > On my opinion raising something when the json rpc client fails due to old > certs instead than simply waiting the timeout would be a more robust fix. This is not possible because we only see handshake failure and we do not know the reason. Overall raising is against reconnect logic so I do not think it is good idea.
Deployment over Gluster worked for me with these components: ovirt-hosted-engine-setup-2.2.0-0.0.master.20171127231205.git7c7ab44.el7.centos.noarch ovirt-hosted-engine-ha-2.2.0-0.0.master.20171122155227.20171122155225.gitbc3ec09.el7.centos.noarch ovirt-engine-appliance-4.2-20171127.1.el7.centos.noarch
Deployed using: ovirt-hosted-engine-ha-2.2.0-0.0.master.20171128125909.20171128125907.gitfa5daa6.el7.centos.noarch ovirt-hosted-engine-setup-2.2.0-0.0.master.20171129192644.git440040c.el7.centos.noarch ovirt-engine-appliance-4.2-20171129.1.el7.centos.noarch Works for me now. might be moved to verified.
*** Bug 1522641 has been marked as a duplicate of this bug. ***
Created attachment 1364625 [details] cockpit_he.png
Created attachment 1364627 [details] deploy.log
Meet this issue again. Test version: ovirt-hosted-engine-setup-2.2.0-2.el7ev.noarch ovirt-hosted-engine-ha-2.2.0-1.el7ev.noarch cockpit-ovirt-dashboard-0.11.1-0.6.el7ev.noarch rhvh-4.2.0.5-0.20171207.0+1 Actual results: https://bugzilla.redhat.com/attachment.cgi?id=1364625 Deploy log: https://bugzilla.redhat.com/attachment.cgi?id=1364627
Change bug to ASSIGNED status due to QE still met this issue with RHVH 4.2 Beta build. pkliczew, Could you please help to check this issue ASAP? Thanks.
I would not mix this issue with being not able to connect to vdsm at all. I see in the deploy logs referenced in comment #20. As you can see the timeouts are OK and the change is applied: 2017-12-08 11:27:30,699+0800 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready 2017-12-08 11:27:30,699+0800 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Creating a new json-rpc connection to VDSM 2017-12-08 11:27:30,701+0800 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Waiting for VDSM to connect 2017-12-08 11:27:31,704+0800 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Waiting for VDSM to connect at the end the client failed to connect which is different behavior. I need vdsm logs to understand why this occurred.
@yzhao Could you please upload vdsm log for debug?
From provided logs the failure looks like not related moving back to ON_QA.
Moving to verified forth to comments #15 and 16, as not being reproduced anymore on regular RHEL7.4 SHE deployments. Please reopen if still happens on RHEVH.
Created attachment 1365751 [details] vdsm.log
Based on provided log vdsm was stopped at: 2017-12-08 11:17:20,836+0800 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:67) and started at: 2017-12-08 11:27:48,714+0800 INFO (MainThread) [vds] (PID: 30140) I am the actual vdsm 4.20.9-1.el7ev dhcp-8-176.nay.redhat.com (3.10.0-693.11.1.el7.x86_64) (vdsmd:148) and `RuntimeError: Couldn't connect to VDSM within 15 seconds` was at 2017-12-08 11:27:45,739+0800 DEBUG otopi.context context._executeMethod:143 method exception It is important to note that vdsm entered recovery mode so it was much longer and ended at: 2017-12-08 13:37:37,130+0800 INFO (vmrecovery) [vds] recovery: completed in 7788s (clientIF:569) The question is why vdsm entered recovery. In typical HE scenarios it should not happen.
(In reply to Piotr Kliczewski from comment #27) > Based on provided log vdsm was stopped at: > > 2017-12-08 11:17:20,836+0800 INFO (MainThread) [vds] Received signal 15, > shutting down (vdsmd:67) > > and started at: > > 2017-12-08 11:27:48,714+0800 INFO (MainThread) [vds] (PID: 30140) I am the > actual vdsm 4.20.9-1.el7ev dhcp-8-176.nay.redhat.com > (3.10.0-693.11.1.el7.x86_64) (vdsmd:148) > > and `RuntimeError: Couldn't connect to VDSM within 15 seconds` was at > > 2017-12-08 11:27:45,739+0800 DEBUG otopi.context context._executeMethod:143 > method exception > > It is important to note that vdsm entered recovery mode so it was much > longer and ended at: > > 2017-12-08 13:37:37,130+0800 INFO (vmrecovery) [vds] recovery: completed in > 7788s (clientIF:569) > > The question is why vdsm entered recovery. In typical HE scenarios it should > not happen. I deployed again, also met this issue, vdsm entered recovery. From the vdsm.log 2017-12-11 04:00:06,170-0500 ERROR (periodic/0) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkstatsMonitor object at 0x2acd710> operation failed (periodic:215) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__ self._func() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 522, in __call__ self._send_metrics() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 538, in _send_metrics vm_sample.interval) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 45, in produce networks(vm, stats, first_sample, last_sample, interval) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 322, in networks if nic.name.startswith('hostdev'):
This is RHEVH specific now, its working just fine or regular RHEL7.4 hosts forth to https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c25. I suggest to ope a separate bug on this as RHEVH specific. RHEVH is also blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1516113.
I've got the original issue reproduced on purple-vds1, that host is a bit older and probably a bit slower. I've got "[ ERROR ] Failed to execute stage 'Environment setup': Couldn't connect to VDSM within 15 seconds" on it, although on other two pairs of hosts this is not happening anymore. Reopening this bug as reproduced. Please see the attachment. ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch rhvm-appliance-4.2-20171207.0.el7.noarch
Created attachment 1367494 [details] logs from purple-vds1
Piotr, could you please take a look at latest logs?
I've just tried to make the same deployment on purple-vds2, it had to be the second ha-host and ended up with "[ ERROR ] Engine is still not reachable [ ERROR ] Failed to execute stage 'Closing up': Engine is still not reachable [ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171213185521-2fr2ic.log" I believe that it's failure was caused by the same issue, adding logs also for purple-vds2.
Created attachment 1367501 [details] logs from purple-vds2
As stated in comment #27 host entered recover mode 2017-12-13 17:47:15,562+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.ping2' in bridge with {} (__init__:585) and vdsm was not operational at this time. Due to this I see following log entry: 2017-12-13 17:47:15,566+0200 ERROR otopi.context context._executeMethod:152 Failed to execute stage 'Environment setup': Couldn't connect to VDSM within 15 seconds This BZ was about different issue so please use new BZ to track it.
Is your environment reused? This should not happen when HE is installed for the first time please make sure to cleanup your env and rerun the test.
Simone, Why the logs says that "Couldn't connect to VDSM within 15 seconds" whereas the operation started at 17:47:13,092+0200 and ended at 17:47:15,566+0200? The time taken to wait for vdsm is much shorted than specified in the logs.
(In reply to Piotr Kliczewski from comment #40) > Simone, Why the logs says that "Couldn't connect to VDSM within 15 seconds" > whereas the operation started at 17:47:13,092+0200 and ended at > 17:47:15,566+0200? The time taken to wait for vdsm is much shorted than > specified in the logs. Great point Piotr, I think that you hit the issue. Check the code here: https://github.com/oVirt/ovirt-hosted-engine-ha/blob/fa5daa6ba3a881625bd8029f488602749de589dd/ovirt_hosted_engine_ha/lib/util.py And compare with failure logs: 2017-12-13 17:47:13,092+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Creating a new json-rpc connection to VDSM 2017-12-13 17:47:13,098+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Waiting for VDSM to connect 2017-12-13 17:47:14,102+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 Waiting for VDSM to connect 2017-12-13 17:47:15,563+0200 DEBUG otopi.plugins.gr_he_setup.system.vdsmenv util.__log_debug:374 VDSM jsonrpc connection is not ready 2017-12-13 17:47:15,564+0200 DEBUG otopi.context context._executeMethod:143 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 133, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-setup/system/vdsmenv.py", line 95, in _late_setup timeout=ohostedcons.Const.VDSCLI_SSL_TIMEOUT, File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 442, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 398, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 15 seconds The while cycle on line 382 should try up to 15 times with a 1 second delay between each try (so the misleading 15 seconds delay). but in the logs we read 'Waiting for VDSM to connect' only two times so after 2 seconds vdsm was successfully connected. But... on line 393 we are calling __vdsm_json_rpc_check to pedantic validate the connection and this check fails: in the logs we see 'VDSM jsonrpc connection is not ready' that comes from line 418 within __vdsm_json_rpc_check. At that point __vdsm_json_rpc_check sets _vdsm_json_rpc = None on line 429 and the check on line 395 fails and we raise with "Couldn't connect to VDSM within {timeout} seconds".format(timeout=VDSM_MAX_RETRY * VDSM_DELAY) with VDSM_MAX_RETRY=15 and VDSM_DELAY=1. So, recapping it, we probably have a race conditions here: vdsm got successfully connected after 2 seconds but just after it (I'm assuming that the connection couldn't be successfully if vdsm is in recovering state) vdsm enters in recovery state and so the ping2 checks fails and we fail with a misleading message.
I think that we broke it here: https://gerrit.ovirt.org/#/c/84424/3/ovirt_hosted_engine_ha/lib/util.py@419 that break statement interrupts the loop on line 410 at the first client.Error exception and so this side effect. Probably the loop on line 410 was almost long enough to wait for vdsm exiting the recovery process.
(In reply to Piotr Kliczewski from comment #39) > Is your environment reused? This should not happen when HE is installed for > the first time please make sure to cleanup your env and rerun the test. Nikolai, I really don't see a reason why you put all those bugs as blockers for this. Could you please reply to Piotr question: do you really deploy HE on clean host (meaning fresly installed, not somehow manually cleaned from previous installations)? Because this is the only reason we could think of why freshly installed VDSM would start in recovery mode
Created attachment 1367892 [details] HE_stuck.png
(In reply to Nikolai Sednev from comment #36) > I've just tried to make the same deployment on purple-vds2, it had to be the > second ha-host and ended up with "[ ERROR ] Engine is still not reachable > [ ERROR ] Failed to execute stage 'Closing up': Engine is still not > reachable [ ERROR ] Hosted Engine deployment failed: this system is not > reliable, please check the issue,fix and redeploy > Log file is located at > /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171213185521- > 2fr2ic.log" > > I believe that it's failure was caused by the same issue, adding logs also > for purple-vds2. Could you please create a new bug for that? Why VDSM is in recovery mode after a restart in fresh installation needs to be investigated in separate bug, it has nothing to do with automatic reconnection. Thanks
To add more data of what happened, this is what happened on first cleanly reprovisioned host purple-vds1 http://pastebin.test.redhat.com/540761, and here what happened on second cleanly reprovisioned host purple-vds2 http://pastebin.test.redhat.com/540764. https://bugzilla.redhat.com/show_bug.cgi?id=1525907 was opened forth to https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c48.
Failed to reproduce, while deployed over NFS, iSCSI and Gluster. Moving to verified. Please feel free to reopen if you still experience this issue. Components on hosts: ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64 sanlock-3.5.0-1.el7.x86_64 ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch libvirt-client-3.2.0-14.el7_4.5.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-host-dependencies-4.2.0-1.el7ev.x86_64 mom-0.5.11-1.el7ev.noarch ovirt-setup-lib-1.1.4-1.el7ev.noarch vdsm-4.20.9.2-1.el7ev.x86_64 ovirt-provider-ovn-driver-1.2.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch ovirt-imageio-daemon-1.2.0-0.el7ev.noarch ovirt-host-4.2.0-1.el7ev.x86_64 ovirt-host-deploy-1.7.0-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch ovirt-imageio-common-1.2.0-0.el7ev.noarch Linux version 3.10.0-693.15.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 14 05:13:32 EST 2017 Linux alma03.qa.lab.tlv.redhat.com 3.10.0-693.15.1.el7.x86_64 #1 SMP Thu Dec 14 05:13:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 (Maipo)
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.