Bug 1522641 - Hosted Engine deployment looks like stuck, but become up with one more hours.
Summary: Hosted Engine deployment looks like stuck, but become up with one more hours.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: cockpit-ovirt
Classification: oVirt
Component: Hosted Engine
Version: 0.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.2.0
: ---
Assignee: Phillip Bailey
QA Contact: Yihui Zhao
URL:
Whiteboard:
Depends On: 1512534
Blocks: 1483586
TreeView+ depends on / blocked
 
Reported: 2017-12-06 07:19 UTC by Yihui Zhao
Modified: 2017-12-22 06:50 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-22 06:50:31 UTC
oVirt Team: Node
rule-engine: ovirt-4.2+
rule-engine: blocker+
cshao: testing_ack+


Attachments (Terms of Use)
he.png (60.59 KB, image/png)
2017-12-06 07:19 UTC, Yihui Zhao
no flags Details
he_deploy.log (704.82 KB, text/plain)
2017-12-07 07:00 UTC, Yihui Zhao
no flags Details
engine.log (357.44 KB, text/plain)
2017-12-07 07:01 UTC, Yihui Zhao
no flags Details
HE_stuck.png (87.54 KB, image/png)
2017-12-14 09:09 UTC, Yihui Zhao
no flags Details
he_deploy_log (939.62 KB, text/plain)
2017-12-14 09:15 UTC, Yihui Zhao
no flags Details
For centos7 (54.20 KB, image/png)
2017-12-15 03:59 UTC, Wei Wang
no flags Details
For centos7 (1.04 MB, application/x-gzip)
2017-12-15 04:01 UTC, Wei Wang
no flags Details
cockpit_iptables.png (72.89 KB, image/png)
2017-12-15 12:55 UTC, Simone Tiraboschi
no flags Details
no_route_deploy_log (674.48 KB, text/plain)
2017-12-18 16:21 UTC, Yihui Zhao
no flags Details
test_he_cli_1218.log (427.17 KB, application/x-bzip)
2017-12-19 05:10 UTC, Yihui Zhao
no flags Details
test_he_cockpit_1218.log (366.26 KB, application/x-bzip)
2017-12-19 05:11 UTC, Yihui Zhao
no flags Details
sosreport+engine.log+vdsm.log+deploy.log (9.96 MB, application/x-bzip)
2017-12-19 07:05 UTC, Yihui Zhao
no flags Details
CLI_deploy+CLI_redeploy+CLI_redeploy_engine.log (102.73 KB, application/x-bzip)
2017-12-19 13:46 UTC, Yihui Zhao
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 85488 'None' MERGED wizard: Correct answer file type/value for firewallManager 2020-04-02 08:50:35 UTC
oVirt gerrit 85489 'None' MERGED wizard: Correct answer file type/value for firewallManager 2020-04-02 08:50:35 UTC

Description Yihui Zhao 2017-12-06 07:19:08 UTC
Created attachment 1363529 [details]
he.png

Description of problem:
In the deploying process, cockpit keep the status to wait the vdsm host become operational, then, can not execute the later operation.


Version-Release number of selected component (if applicable):
rhvh-4.2.0.5-0.20171123.0+1
rhvm-appliance-4.2-20171102.0.el7.noarch
cockpit-ovirt-dashboard-0.11.1-0.6.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1. Install the latest RHVH 4.2
2. Upgrade the cockpit-ovirt-dashboard pkg, restart cockpit
3. Deploy HostedEngine

Actual results:
After step3,  cockpit keep the status to wait the vdsm host become operational, then, can not execute the later operation.


Expected results:
After step3, deploy HostedEngine successfully


Additional info:
HE appliance engine setup success.

Comment 1 Simone Tiraboschi 2017-12-06 13:07:52 UTC

*** This bug has been marked as a duplicate of bug 1517881 ***

Comment 2 Yihui Zhao 2017-12-07 02:25:17 UTC
Update the result:

Actual result: 

After step3, cockpit keep the status to wait the vdsm host become operational,

Wait for a long time, more than four hours, the Hosted Engine Setup successfully completed.

Comment 3 Yihui Zhao 2017-12-07 07:00:53 UTC
Created attachment 1364060 [details]
he_deploy.log

Comment 4 Yihui Zhao 2017-12-07 07:01:35 UTC
Created attachment 1364061 [details]
engine.log

Comment 5 Simone Tiraboschi 2017-12-07 09:07:55 UTC

*** This bug has been marked as a duplicate of bug 1512534 ***

Comment 6 Yihui Zhao 2017-12-07 09:56:23 UTC
Add the ovirt-hosted-engine-ha version:
  ovirt-hosted-engine-ha-2.2.0-0.2.master.gitcbe3c76.el7ev.noarch(In reply to Yihui Zhao from comment #0)
> Created attachment 1363529 [details]
> he.png
> 
> Description of problem:
> In the deploying process, cockpit keep the status to wait the vdsm host
> become operational, then, can not execute the later operation.
> 
> 
> Version-Release number of selected component (if applicable):
> rhvh-4.2.0.5-0.20171123.0+1
> rhvm-appliance-4.2-20171102.0.el7.noarch
> cockpit-ovirt-dashboard-0.11.1-0.6.el7ev.noarch
> 
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. Install the latest RHVH 4.2
> 2. Upgrade the cockpit-ovirt-dashboard pkg, restart cockpit
> 3. Deploy HostedEngine
> 
> Actual results:
> After step3,  cockpit keep the status to wait the vdsm host become
> operational, then, can not execute the later operation.
> 
> 
> Expected results:
> After step3, deploy HostedEngine successfully
> 
> 
> Additional info:
> HE appliance engine setup success.

Add the ovirt-hosted-engine-ha version:
  ovirt-hosted-engine-ha-2.2.0-0.2.master.gitcbe3c76.el7ev.noarch

Comment 7 cshao 2017-12-09 11:09:39 UTC
Re-open this bug according #c22 & C24 of bug 1512534, they are different behavior.

Comment 8 Yihui Zhao 2017-12-11 10:09:01 UTC
Update:

Test version:
cockpit-ovirt-dashboard-0.11.1-0.6.el7ev.noarch
vdsm-4.20.9-1.el7ev.x86_64
ovirt-hosted-engine-setup-2.2.0-2.el7ev.noarch
ovirt-hosted-engine-ha-2.2.0-1.el7ev.noarch
rhvh-4.2.0.5-0.20171207.0+1
rhvm-appliance-4.2-20171207.0.el7.noarch

Test steps:
1.Deploy HostedEngine via cockpit


Test result:

Vdsm recover and  Couldn't connect to VDSM 
"""
 Timed out while waiting for host to start. Please check the logs.
 Unable to add hp-bl460cg9-01.lab.eng.pek2.redhat.com to the manager
 Failed to execute stage 'Closing up': Couldn't connect to VDSM within 15 seconds
 Failed to execute stage 'Clean up': Request Host.stopMonitoringDomain with args {'sdUUID': '089c1971-ea47-457b-9651-11e87a945a48'} timed out after 900 seconds
 Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy
"""

Comment 9 Phillip Bailey 2017-12-11 13:06:21 UTC
Yihui,

To make sure that this is actually related to the wizard, would it be possible for you to attempt running the CLI version of the installer and not use any answer files generated by the wizard?

Comment 10 Yihui Zhao 2017-12-11 14:40:59 UTC
(In reply to Phillip Bailey from comment #9)
> Yihui,
> 
> To make sure that this is actually related to the wizard, would it be
> possible for you to attempt running the CLI version of the installer and not
> use any answer files generated by the wizard?

Deploy HE successfully with CLI.

Comment 11 Red Hat Bugzilla Rules Engine 2017-12-12 09:57:00 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 18 Simone Tiraboschi 2017-12-13 13:04:09 UTC
Sorry for the confusion between 1517881 and 1512534, fault of mine.

in 1517881, engine-setup running on the appliance was stuck due to an SELinux issue there ad we should get:
 [ ERROR ] Engine setup got stuck on the appliance
 [ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on the appliance since 1800 seconds ago.
but this is not our case.

In 1512534 instead the issue is about reconnecting to vdsm to check the deployment status after that hosted-deploy reconfigured vdsm.
Since vdsm cert got renewed by host-deploy, the reconnect mechanism in the json rpc client is silently failing in loop till a timeout.
We had a workaround setting s short timeout to mitigate the issue: http://gerrit.ovirt.org/84794

According to https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c29 this is fine on RHEL7.4 but it's still an issue on RHEV-H.

I don't think that this is UI specific.

Comment 19 Yihui Zhao 2017-12-14 09:09:15 UTC
Created attachment 1367894 [details]
HE_stuck.png

Comment 21 Red Hat Bugzilla Rules Engine 2017-12-14 09:10:21 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 22 Yihui Zhao 2017-12-14 09:15:02 UTC
Created attachment 1367895 [details]
he_deploy_log

Comment 24 Yihui Zhao 2017-12-14 11:44:47 UTC
Please ignore comment 20, I will give the summary:

Deploy HostedEngine looking like stuck , but wait one or two hours, HostedEngine is up.

Test version:
rhvh-4.2.0.6-0.20171213.0+1
cockpit-ovirt-dashboard-0.11.2-0.1.el7ev.noarch
rhvm-appliance-4.2-20171207.0.el7.noarch
ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch


Test steps:
1. Deploy HostedEngine via cockpit

Additional info:
log : attachment 1367895 [details] -- he_deploy_log

Comment 25 Phillip Bailey 2017-12-14 12:58:57 UTC
What has been done to distinguish the delay as a cockpit UI problem? Once the deployment process has been started, the UI is only acting as a passthrough for the output from the CLI version of ovirt-hosted-engine-setup. The only bearing the UI actually has on the success/failure of the deployment is the answer file it generates (/tmp/he-setup-answerfile.conf).

In order to draw a correlation between the UI and the deployment delay, you need to also:

1. Perform an installation from the CLI without using the cockpit-generated answer file.

2. Perform an installation from the CLI using the cockpit-generated answer file.

If there is no delay in 1, but there is in 2, then there would appear to be some relationship between the use of the UI and the delay.

Comment 26 Ryan Barry 2017-12-14 13:02:04 UTC
Is this only on rhvh?

Comment 27 Yihui Zhao 2017-12-14 13:04:47 UTC
(In reply to Ryan Barry from comment #26)
> Is this only on rhvh?

Yes.

Comment 28 Piotr Kliczewski 2017-12-14 13:19:38 UTC
I see that in the logs at 2017-12-14 16:05:37,362+0800 there was issue to connect to vdsm due to 'Connection refused'. At 2017-12-14 16:06:15,898+0800 message changed to 'Operation now in progress' and later (2017-12-14 16:06:44,924+0800) to 'No route to host'. The setup ended with:

2017-12-14 17:02:10,567+0800 ERROR otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:122 Timed out while waiting for host to start. Please check the logs.

In vdsm logs I see that vdsm was killed (possibly OS restart):

2017-12-14 16:05:26,144+0800 INFO  (MainThread) [vds] Exiting (vdsmd:170)

and started at:

2017-12-14 16:07:15,387+0800 INFO  (MainThread) [vds] (PID: 20185) I am the actual vdsm 4.20.9.2-1.el7ev hp-z620-04.qe.lab.eng.nay.redhat.com (3.10.0-693.11.1.el7.x86_64) (vdsmd:148)

Is the network properly configure?

Comment 30 Yihui Zhao 2017-12-14 16:27:45 UTC
(In reply to Piotr Kliczewski from comment #28)
> I see that in the logs at 2017-12-14 16:05:37,362+0800 there was issue to
> connect to vdsm due to 'Connection refused'. At 2017-12-14 16:06:15,898+0800
> message changed to 'Operation now in progress' and later (2017-12-14
> 16:06:44,924+0800) to 'No route to host'. The setup ended with:
> 
> 2017-12-14 17:02:10,567+0800 ERROR otopi.plugins.gr_he_setup.engine.add_host
> add_host._wait_host_ready:122 Timed out while waiting for host to start.
> Please check the logs.
> 
> In vdsm logs I see that vdsm was killed (possibly OS restart):
> 
> 2017-12-14 16:05:26,144+0800 INFO  (MainThread) [vds] Exiting (vdsmd:170)
> 
> and started at:
> 
> 2017-12-14 16:07:15,387+0800 INFO  (MainThread) [vds] (PID: 20185) I am the
> actual vdsm 4.20.9.2-1.el7ev hp-z620-04.qe.lab.eng.nay.redhat.com
> (3.10.0-693.11.1.el7.x86_64) (vdsmd:148)
> 
> Is the network properly configure?

It seems that the VM is down while waiting for the vdsm is operational.

Comment 31 Wei Wang 2017-12-15 03:58:40 UTC
Test Version:
CentOS-7-x86_64-DVD-1708.iso
cockpit-ovirt-dashboard-0.11.2-0.1.el7.centos.noarch
cockpit-system-155-1.el7.centos.noarch
cockpit-dashboard-155-1.el7.centos.x86_64
cockpit-storaged-155-1.el7.centos.noarch
cockpit-networkmanager-155-1.el7.centos.noarch
cockpit-ws-155-1.el7.centos.x86_64
cockpit-155-1.el7.centos.x86_64
cockpit-bridge-155-1.el7.centos.x86_64
ovirt-hosted-engine-setup-2.2.1-1.el7.centos.noarch
ovirt-hosted-engine-ha-2.2.1-1.el7.centos.noarch

Test steps:
1. Clean install CentOS-7-x86_64-DVD-1708.iso
2. Yum install cockpit and cockpit-ovirt-dashboard
3. Deploy HostedEngine via cockpit UI 

Result:
HostedEngine deploys failed with error "Couldn't  connect to VDSM within 15 seconds" 

Bug is also detected with RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso host.

Comment 32 Wei Wang 2017-12-15 03:59:39 UTC
Created attachment 1368267 [details]
For centos7

Comment 33 Wei Wang 2017-12-15 04:01:07 UTC
Created attachment 1368268 [details]
For centos7

Comment 34 Martin Sivák 2017-12-15 08:52:37 UTC
Wei and Yihui: please also add vdsm versions

The 15s timeout issue is related to https://gerrit.ovirt.org/#/c/85416/

Comment 35 Martin Sivák 2017-12-15 08:57:26 UTC
Piotr: > In vdsm logs I see that vdsm was killed (possibly OS restart):

This looks like a manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1522878

Comment 36 Martin Sivák 2017-12-15 09:39:35 UTC
Please retest with fixes for the referenced bugs (nev vdsm, new hosted engine). I believe this is now a duplicate of the original bug 1512534 as well, because of comment #31 (happens on RHEL as well).

Comment 37 Simone Tiraboschi 2017-12-15 12:49:37 UTC
Here we are probably overlapping a lot of different related issues but at least one is for sure specific to the cockpit plugin:
by default "Firewall: configure IPTables:" is off in the cockpit plug (see attached screenshot) but, more than that, the cockpit plugin writes in the answer file for hosted-engine-setup

OVEHOSTED_NETWORK/firewallManager=bool:true or
OVEHOSTED_NETWORK/firewallManager=bool:false

while otopi expects 
OVEHOSTED_NETWORK/firewallManager=str:iptables or 
OVEHOSTED_NETWORK/firewallManager=none:None

with the first one it will open all the needed ports on iptables while with the second option it should print something like
[ INFO  ] Stage: Closing up
          The following network ports should be opened:
              tcp:5900
              tcp:5901
              tcp:9090
...

Please note that also the request for manual firewall configuration is printed only if OVEHOSTED_NETWORK/firewallManager is None and not if false as if configured by the cockpit plugin.

https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/plugins/gr-he-setup/network/firewall_manager.py#L239

Comment 38 Simone Tiraboschi 2017-12-15 12:55:59 UTC
Created attachment 1368464 [details]
cockpit_iptables.png

Comment 39 Yihui Zhao 2017-12-18 15:03:33 UTC
Update :

Test version:
ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch
cockpit-ovirt-dashboard-0.11.2-0.1.el7ev.noarch
rhvm-appliance-4.2-20171207.0.el7.noarch
vdsm-4.20.9.2-1.el7ev.x86_64

Test steps:
1. Update the ovirt-hosted-engine-ha pkg
2. Deploy HostedEngine via cockpit


Actual results:
1. After step2, find the error log in deploy log, and keep the "waiting for VDSM Host become operational" for a long time.

2017-12-18 23:01:28,888+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:92 VDSM host in  state
2017-12-18 23:01:31,894+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:86 Error fetching host state: [ERROR]::oVirt API connection failure, (7, 'Failed connect to rhevh-hostedengine-vm-03.qe.lab.eng.nay.**FILTERED**.com:443; No route to host')
2017-12-18 23:01:31,894+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:92 VDSM host in  state
2017-12-18 23:01:34,900+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:86 Error fetching host state: [ERROR]::oVirt API connection failure, (7, 'Failed connect to rhevh-hostedengine-vm-03.qe.lab.eng.nay.**FILTERED**.com:443; No route to host')
2017-12-18 23:01:34,900+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:92 VDSM host in  state

Comment 40 Simone Tiraboschi 2017-12-18 15:37:04 UTC
Thanks Yihui,
was rhevh-hostedengine-vm-03.qe.lab.eng.nay.redhat.com corerctly resolvable there?
Could you please attach hosted-engine-setup log file for your latest attempt?

Comment 41 Yihui Zhao 2017-12-18 16:21:12 UTC
Created attachment 1369591 [details]
no_route_deploy_log

Comment 42 Yihui Zhao 2017-12-18 16:22:13 UTC
(In reply to Simone Tiraboschi from comment #40)
> Thanks Yihui,
> was rhevh-hostedengine-vm-03.qe.lab.eng.nay.redhat.com corerctly resolvable
> there?
> Could you please attach hosted-engine-setup log file for your latest attempt?

See the attachment 1369591 [details].

Comment 43 Yihui Zhao 2017-12-18 16:25:00 UTC
(In reply to Simone Tiraboschi from comment #40)
> Thanks Yihui,
> was rhevh-hostedengine-vm-03.qe.lab.eng.nay.redhat.com corerctly resolvable
> there?
> Could you please attach hosted-engine-setup log file for your latest attempt?

The HE-VM may be down while waiting for VDSM Host become operational.

Comment 44 Simone Tiraboschi 2017-12-18 17:03:50 UTC
(In reply to Yihui Zhao from comment #43)
> (In reply to Simone Tiraboschi from comment #40)
> > Thanks Yihui,
> > was rhevh-hostedengine-vm-03.qe.lab.eng.nay.redhat.com corerctly resolvable
> > there?
> > Could you please attach hosted-engine-setup log file for your latest attempt?
> 
> The HE-VM may be down while waiting for VDSM Host become operational.

Are you able to bring it up and extract engine.log from there?

Comment 45 Simone Tiraboschi 2017-12-18 17:06:38 UTC
(In reply to Simone Tiraboschi from comment #44)
> Are you able to bring it up and extract engine.log from there?

Nikolai reported something similar here: https://bugzilla.redhat.com/show_bug.cgi?id=1525907#c22

Comment 46 Yihui Zhao 2017-12-19 05:10:31 UTC
Created attachment 1369773 [details]
test_he_cli_1218.log

Comment 47 Yihui Zhao 2017-12-19 05:11:28 UTC
Created attachment 1369774 [details]
test_he_cockpit_1218.log

Comment 48 Yihui Zhao 2017-12-19 05:13:22 UTC
Update:

Test version:
rhvh-4.2.0.6-0.20171218.0+1
ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch
cockpit-ovirt-dashboard-0.11.3-0.1.el7ev.noarch
rhvm-appliance-4.2-20171207.0.el7.noarch
vdsm-4.20.9.3-1.el7ev.x86_64

Test steps:
1. Deploy HostedEngine via cockpit or CLI



Actual results:
1. After step1, deploy HostedEngine failed with cockpit or CLI.


Additional info:
Error messages from CLI:

[ INFO  ] Still waiting for VDSM host to become operational...
[ INFO  ] Still waiting for VDSM host to become operational...
[ ERROR ] Timed out while waiting for host to start. Please check the logs.
[ ERROR ] Unable to add dhcp-8-176.nay.redhat.com to the manager
[ ERROR ] Failed to execute stage 'Closing up': Couldn't  connect to VDSM within 20 seconds
[ INFO  ] Stage: Clean up
[ ERROR ] Failed to execute stage 'Clean up': Request Host.stopMonitoringDomain with args {'sdUUID': 'dea166f4-7109-47e4-8baa-15302e6eb1bf'} timed out after 900 seconds
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20171218231100.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171218223123-o6lcez.log


Details for attachment 1369773 [details] : test_he_cli_1218.log
            attachment 1369774 [details] : test_he_cockpit_1218.log

Comment 49 Yihui Zhao 2017-12-19 07:05:36 UTC
Created attachment 1369781 [details]
sosreport+engine.log+vdsm.log+deploy.log

Comment 51 Simone Tiraboschi 2017-12-19 07:57:22 UTC
We have engine.log only for the CLI case and not for the cockpit one but on hosted-engine-setup side the symptoms are almost the same: the engine is not correctly able to deploy the host and hosted-engine-setup fails after some time monitoring it.

Focusing on the CLI case were we have engine.log:

The engine fails deploying the host due to:

2017-12-18 23:44:25,295-05 ERROR [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-77) [c1a14d6] Command 'org.ovirt.engine.core.bll.AddUnmanagedVmsCommand' failed: No enum constant org.ovirt.engine.core.common.businessentities.network.VmInterfaceType.virtio
2017-12-18 23:44:25,295-05 ERROR [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-77) [c1a14d6] Exception: java.lang.IllegalArgumentException: No enum constant org.ovirt.engine.core.common.businessentities.network.VmInterfaceType.virtio
	at java.lang.Enum.valueOf(Enum.java:238) [rt.jar:1.8.0_151]
	at org.ovirt.engine.core.common.businessentities.network.VmInterfaceType.valueOf(VmInterfaceType.java:6) [common.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.buildVmNetworkInterfacesFromDevices(VdsBrokerObjectsBuilder.java:232) [vdsbroker.jar:]
	at org.ovirt.engine.core.bll.AddUnmanagedVmsCommand.importHostedEngineVm(AddUnmanagedVmsCommand.java:181) [bll.jar:]
	at org.ovirt.engine.core.bll.AddUnmanagedVmsCommand.convertVm(AddUnmanagedVmsCommand.java:111) [bll.jar:]

This is repeated 238 times in engine.log
I'm going to open a separate bug on it.

Yihui, could you please repeat the test on cockpit and, when you see 
 Adding the host to the cluster
 Waiting for the host to become operational in the engine. This may take several minutes...

connect to the engine VM do download engine.log just to be sure that the issue is really the same?

Comment 52 Simone Tiraboschi 2017-12-19 10:52:03 UTC
Ok, this starts making a bit of sense now.

On rhevh-hostedengine-vm-03 we see a flood of
2017-12-19 04:14:55,517-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [] Command 'GetCapabilitiesVDSCommand(HostName = hp-z620-04.qe.lab.eng.nay.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{hostId='7ef5a936-a31b-4f4f-aed2-a9531a23f3c8', vds='Host[hp-z620-04.qe.lab.eng.nay.redhat.com,7ef5a936-a31b-4f4f-aed2-a9531a23f3c8]'})' execution failed: java.net.NoRouteToHostException: No route to host
2017-12-19 04:14:55,517-05 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [] Failure to refresh host 'hp-z620-04.qe.lab.eng.nay.redhat.com' runtime info: java.net.NoRouteToHostException: No route to host

So technically the engine VM never triggered host-deploy on the host and indeed the host was in Non Responsive state in the engine.

In past some manual steps could be potentially required, especially on network setup side, to bring the host up and hosted-engine-setup could potentially also successfully conclude with the host in Non Responsive state.
https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/plugins/gr-he-setup/engine/add_host.py#L193

So indeed on the host we see:
2017-12-19 03:49:50,345-0500 ERROR otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:122 Timed out while waiting for host to start. Please check the logs.
2017-12-19 03:49:50,346-0500 ERROR otopi.plugins.gr_he_setup.engine.add_host add_host._closeup:662 Unable to add hp-z620-04.qe.lab.eng.nay.redhat.com to the manager

but then:
2017-12-19 03:53:15,394-0500 INFO otopi.plugins.gr_he_common.core.misc misc._terminate:251 Hosted Engine successfully deployed

Comment 53 Ryan Barry 2017-12-19 11:11:41 UTC
Do you think this is a failure in HE setup, or the test environment/network?

Comment 55 Yihui Zhao 2017-12-19 13:46:16 UTC
Created attachment 1370037 [details]
CLI_deploy+CLI_redeploy+CLI_redeploy_engine.log

Comment 58 Yihui Zhao 2017-12-20 12:03:54 UTC
Test version:
cockpit-ws-155-1.el7.x86_64
cockpit-bridge-155-1.el7.x86_64
cockpit-system-155-1.el7.noarch
cockpit-storaged-155-1.el7.noarch
cockpit-ovirt-dashboard-0.11.3-0.1.el7ev.noarch
cockpit-dashboard-155-1.el7.x86_64
cockpit-155-1.el7.x86_64
vdsm-4.20.9.3-1.el7ev.x86_64
rhvm-appliance-4.2-20171219.0.el7.noarch
ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch

Test steps:
1. Clean install latest RHVH4.2
2. Deploy HostedEngine via cockpit

Test result:
After step2, deploy HostedEngine successfully.
[root@hp-dl385pg8-11 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : hp-dl385pg8-11.lab.eng.pek2.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : c7321e13
local_conf_timestamp               : 9732
Host timestamp                     : 9731
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=9731 (Wed Dec 20 06:51:59 2017)
	host-id=1
	score=3400
	vm_conf_refresh_time=9732 (Wed Dec 20 06:52:01 2017)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineUp
	stopped=False

[oVirt shell (connected)]# list hosts --show-all |grep "status-state"
WARNING: yacc table file version is out of date
external_status-state                                       : ok
spm-status-state                                            : none
status-state                                                : up


So there is no dependence on bug 1522878 and 1527318, remove the dependence.
So change the bug's status to verified!

If exists the issue about redeployment SHE, I will report a new bug to track!

Comment 59 Sandro Bonazzola 2017-12-22 06:50:31 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.