Created attachment 1363529 [details] he.png Description of problem: In the deploying process, cockpit keep the status to wait the vdsm host become operational, then, can not execute the later operation. Version-Release number of selected component (if applicable): rhvh-4.2.0.5-0.20171123.0+1 rhvm-appliance-4.2-20171102.0.el7.noarch cockpit-ovirt-dashboard-0.11.1-0.6.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. Install the latest RHVH 4.2 2. Upgrade the cockpit-ovirt-dashboard pkg, restart cockpit 3. Deploy HostedEngine Actual results: After step3, cockpit keep the status to wait the vdsm host become operational, then, can not execute the later operation. Expected results: After step3, deploy HostedEngine successfully Additional info: HE appliance engine setup success.
*** This bug has been marked as a duplicate of bug 1517881 ***
Update the result: Actual result: After step3, cockpit keep the status to wait the vdsm host become operational, Wait for a long time, more than four hours, the Hosted Engine Setup successfully completed.
Created attachment 1364060 [details] he_deploy.log
Created attachment 1364061 [details] engine.log
*** This bug has been marked as a duplicate of bug 1512534 ***
Add the ovirt-hosted-engine-ha version: ovirt-hosted-engine-ha-2.2.0-0.2.master.gitcbe3c76.el7ev.noarch(In reply to Yihui Zhao from comment #0) > Created attachment 1363529 [details] > he.png > > Description of problem: > In the deploying process, cockpit keep the status to wait the vdsm host > become operational, then, can not execute the later operation. > > > Version-Release number of selected component (if applicable): > rhvh-4.2.0.5-0.20171123.0+1 > rhvm-appliance-4.2-20171102.0.el7.noarch > cockpit-ovirt-dashboard-0.11.1-0.6.el7ev.noarch > > > How reproducible: > 100% > > Steps to Reproduce: > 1. Install the latest RHVH 4.2 > 2. Upgrade the cockpit-ovirt-dashboard pkg, restart cockpit > 3. Deploy HostedEngine > > Actual results: > After step3, cockpit keep the status to wait the vdsm host become > operational, then, can not execute the later operation. > > > Expected results: > After step3, deploy HostedEngine successfully > > > Additional info: > HE appliance engine setup success. Add the ovirt-hosted-engine-ha version: ovirt-hosted-engine-ha-2.2.0-0.2.master.gitcbe3c76.el7ev.noarch
Re-open this bug according #c22 & C24 of bug 1512534, they are different behavior.
Update: Test version: cockpit-ovirt-dashboard-0.11.1-0.6.el7ev.noarch vdsm-4.20.9-1.el7ev.x86_64 ovirt-hosted-engine-setup-2.2.0-2.el7ev.noarch ovirt-hosted-engine-ha-2.2.0-1.el7ev.noarch rhvh-4.2.0.5-0.20171207.0+1 rhvm-appliance-4.2-20171207.0.el7.noarch Test steps: 1.Deploy HostedEngine via cockpit Test result: Vdsm recover and Couldn't connect to VDSM """ Timed out while waiting for host to start. Please check the logs. Unable to add hp-bl460cg9-01.lab.eng.pek2.redhat.com to the manager Failed to execute stage 'Closing up': Couldn't connect to VDSM within 15 seconds Failed to execute stage 'Clean up': Request Host.stopMonitoringDomain with args {'sdUUID': '089c1971-ea47-457b-9651-11e87a945a48'} timed out after 900 seconds Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy """
Yihui, To make sure that this is actually related to the wizard, would it be possible for you to attempt running the CLI version of the installer and not use any answer files generated by the wizard?
(In reply to Phillip Bailey from comment #9) > Yihui, > > To make sure that this is actually related to the wizard, would it be > possible for you to attempt running the CLI version of the installer and not > use any answer files generated by the wizard? Deploy HE successfully with CLI.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Sorry for the confusion between 1517881 and 1512534, fault of mine. in 1517881, engine-setup running on the appliance was stuck due to an SELinux issue there ad we should get: [ ERROR ] Engine setup got stuck on the appliance [ ERROR ] Failed to execute stage 'Closing up': Engine setup is stalled on the appliance since 1800 seconds ago. but this is not our case. In 1512534 instead the issue is about reconnecting to vdsm to check the deployment status after that hosted-deploy reconfigured vdsm. Since vdsm cert got renewed by host-deploy, the reconnect mechanism in the json rpc client is silently failing in loop till a timeout. We had a workaround setting s short timeout to mitigate the issue: http://gerrit.ovirt.org/84794 According to https://bugzilla.redhat.com/show_bug.cgi?id=1512534#c29 this is fine on RHEL7.4 but it's still an issue on RHEV-H. I don't think that this is UI specific.
Created attachment 1367894 [details] HE_stuck.png
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Created attachment 1367895 [details] he_deploy_log
Please ignore comment 20, I will give the summary: Deploy HostedEngine looking like stuck , but wait one or two hours, HostedEngine is up. Test version: rhvh-4.2.0.6-0.20171213.0+1 cockpit-ovirt-dashboard-0.11.2-0.1.el7ev.noarch rhvm-appliance-4.2-20171207.0.el7.noarch ovirt-hosted-engine-ha-2.2.1-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch Test steps: 1. Deploy HostedEngine via cockpit Additional info: log : attachment 1367895 [details] -- he_deploy_log
What has been done to distinguish the delay as a cockpit UI problem? Once the deployment process has been started, the UI is only acting as a passthrough for the output from the CLI version of ovirt-hosted-engine-setup. The only bearing the UI actually has on the success/failure of the deployment is the answer file it generates (/tmp/he-setup-answerfile.conf). In order to draw a correlation between the UI and the deployment delay, you need to also: 1. Perform an installation from the CLI without using the cockpit-generated answer file. 2. Perform an installation from the CLI using the cockpit-generated answer file. If there is no delay in 1, but there is in 2, then there would appear to be some relationship between the use of the UI and the delay.
Is this only on rhvh?
(In reply to Ryan Barry from comment #26) > Is this only on rhvh? Yes.
I see that in the logs at 2017-12-14 16:05:37,362+0800 there was issue to connect to vdsm due to 'Connection refused'. At 2017-12-14 16:06:15,898+0800 message changed to 'Operation now in progress' and later (2017-12-14 16:06:44,924+0800) to 'No route to host'. The setup ended with: 2017-12-14 17:02:10,567+0800 ERROR otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:122 Timed out while waiting for host to start. Please check the logs. In vdsm logs I see that vdsm was killed (possibly OS restart): 2017-12-14 16:05:26,144+0800 INFO (MainThread) [vds] Exiting (vdsmd:170) and started at: 2017-12-14 16:07:15,387+0800 INFO (MainThread) [vds] (PID: 20185) I am the actual vdsm 4.20.9.2-1.el7ev hp-z620-04.qe.lab.eng.nay.redhat.com (3.10.0-693.11.1.el7.x86_64) (vdsmd:148) Is the network properly configure?
(In reply to Piotr Kliczewski from comment #28) > I see that in the logs at 2017-12-14 16:05:37,362+0800 there was issue to > connect to vdsm due to 'Connection refused'. At 2017-12-14 16:06:15,898+0800 > message changed to 'Operation now in progress' and later (2017-12-14 > 16:06:44,924+0800) to 'No route to host'. The setup ended with: > > 2017-12-14 17:02:10,567+0800 ERROR otopi.plugins.gr_he_setup.engine.add_host > add_host._wait_host_ready:122 Timed out while waiting for host to start. > Please check the logs. > > In vdsm logs I see that vdsm was killed (possibly OS restart): > > 2017-12-14 16:05:26,144+0800 INFO (MainThread) [vds] Exiting (vdsmd:170) > > and started at: > > 2017-12-14 16:07:15,387+0800 INFO (MainThread) [vds] (PID: 20185) I am the > actual vdsm 4.20.9.2-1.el7ev hp-z620-04.qe.lab.eng.nay.redhat.com > (3.10.0-693.11.1.el7.x86_64) (vdsmd:148) > > Is the network properly configure? It seems that the VM is down while waiting for the vdsm is operational.
Test Version: CentOS-7-x86_64-DVD-1708.iso cockpit-ovirt-dashboard-0.11.2-0.1.el7.centos.noarch cockpit-system-155-1.el7.centos.noarch cockpit-dashboard-155-1.el7.centos.x86_64 cockpit-storaged-155-1.el7.centos.noarch cockpit-networkmanager-155-1.el7.centos.noarch cockpit-ws-155-1.el7.centos.x86_64 cockpit-155-1.el7.centos.x86_64 cockpit-bridge-155-1.el7.centos.x86_64 ovirt-hosted-engine-setup-2.2.1-1.el7.centos.noarch ovirt-hosted-engine-ha-2.2.1-1.el7.centos.noarch Test steps: 1. Clean install CentOS-7-x86_64-DVD-1708.iso 2. Yum install cockpit and cockpit-ovirt-dashboard 3. Deploy HostedEngine via cockpit UI Result: HostedEngine deploys failed with error "Couldn't connect to VDSM within 15 seconds" Bug is also detected with RHEL-7.4-20170711.0-Server-x86_64-dvd1.iso host.
Created attachment 1368267 [details] For centos7
Created attachment 1368268 [details] For centos7
Wei and Yihui: please also add vdsm versions The 15s timeout issue is related to https://gerrit.ovirt.org/#/c/85416/
Piotr: > In vdsm logs I see that vdsm was killed (possibly OS restart): This looks like a manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1522878
Please retest with fixes for the referenced bugs (nev vdsm, new hosted engine). I believe this is now a duplicate of the original bug 1512534 as well, because of comment #31 (happens on RHEL as well).
Here we are probably overlapping a lot of different related issues but at least one is for sure specific to the cockpit plugin: by default "Firewall: configure IPTables:" is off in the cockpit plug (see attached screenshot) but, more than that, the cockpit plugin writes in the answer file for hosted-engine-setup OVEHOSTED_NETWORK/firewallManager=bool:true or OVEHOSTED_NETWORK/firewallManager=bool:false while otopi expects OVEHOSTED_NETWORK/firewallManager=str:iptables or OVEHOSTED_NETWORK/firewallManager=none:None with the first one it will open all the needed ports on iptables while with the second option it should print something like [ INFO ] Stage: Closing up The following network ports should be opened: tcp:5900 tcp:5901 tcp:9090 ... Please note that also the request for manual firewall configuration is printed only if OVEHOSTED_NETWORK/firewallManager is None and not if false as if configured by the cockpit plugin. https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/plugins/gr-he-setup/network/firewall_manager.py#L239
Created attachment 1368464 [details] cockpit_iptables.png
Update : Test version: ovirt-hosted-engine-setup-2.2.1-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch cockpit-ovirt-dashboard-0.11.2-0.1.el7ev.noarch rhvm-appliance-4.2-20171207.0.el7.noarch vdsm-4.20.9.2-1.el7ev.x86_64 Test steps: 1. Update the ovirt-hosted-engine-ha pkg 2. Deploy HostedEngine via cockpit Actual results: 1. After step2, find the error log in deploy log, and keep the "waiting for VDSM Host become operational" for a long time. 2017-12-18 23:01:28,888+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:92 VDSM host in state 2017-12-18 23:01:31,894+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:86 Error fetching host state: [ERROR]::oVirt API connection failure, (7, 'Failed connect to rhevh-hostedengine-vm-03.qe.lab.eng.nay.**FILTERED**.com:443; No route to host') 2017-12-18 23:01:31,894+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:92 VDSM host in state 2017-12-18 23:01:34,900+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:86 Error fetching host state: [ERROR]::oVirt API connection failure, (7, 'Failed connect to rhevh-hostedengine-vm-03.qe.lab.eng.nay.**FILTERED**.com:443; No route to host') 2017-12-18 23:01:34,900+0800 DEBUG otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:92 VDSM host in state
Thanks Yihui, was rhevh-hostedengine-vm-03.qe.lab.eng.nay.redhat.com corerctly resolvable there? Could you please attach hosted-engine-setup log file for your latest attempt?
Created attachment 1369591 [details] no_route_deploy_log
(In reply to Simone Tiraboschi from comment #40) > Thanks Yihui, > was rhevh-hostedengine-vm-03.qe.lab.eng.nay.redhat.com corerctly resolvable > there? > Could you please attach hosted-engine-setup log file for your latest attempt? See the attachment 1369591 [details].
(In reply to Simone Tiraboschi from comment #40) > Thanks Yihui, > was rhevh-hostedengine-vm-03.qe.lab.eng.nay.redhat.com corerctly resolvable > there? > Could you please attach hosted-engine-setup log file for your latest attempt? The HE-VM may be down while waiting for VDSM Host become operational.
(In reply to Yihui Zhao from comment #43) > (In reply to Simone Tiraboschi from comment #40) > > Thanks Yihui, > > was rhevh-hostedengine-vm-03.qe.lab.eng.nay.redhat.com corerctly resolvable > > there? > > Could you please attach hosted-engine-setup log file for your latest attempt? > > The HE-VM may be down while waiting for VDSM Host become operational. Are you able to bring it up and extract engine.log from there?
(In reply to Simone Tiraboschi from comment #44) > Are you able to bring it up and extract engine.log from there? Nikolai reported something similar here: https://bugzilla.redhat.com/show_bug.cgi?id=1525907#c22
Created attachment 1369773 [details] test_he_cli_1218.log
Created attachment 1369774 [details] test_he_cockpit_1218.log
Update: Test version: rhvh-4.2.0.6-0.20171218.0+1 ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch cockpit-ovirt-dashboard-0.11.3-0.1.el7ev.noarch rhvm-appliance-4.2-20171207.0.el7.noarch vdsm-4.20.9.3-1.el7ev.x86_64 Test steps: 1. Deploy HostedEngine via cockpit or CLI Actual results: 1. After step1, deploy HostedEngine failed with cockpit or CLI. Additional info: Error messages from CLI: [ INFO ] Still waiting for VDSM host to become operational... [ INFO ] Still waiting for VDSM host to become operational... [ ERROR ] Timed out while waiting for host to start. Please check the logs. [ ERROR ] Unable to add dhcp-8-176.nay.redhat.com to the manager [ ERROR ] Failed to execute stage 'Closing up': Couldn't connect to VDSM within 20 seconds [ INFO ] Stage: Clean up [ ERROR ] Failed to execute stage 'Clean up': Request Host.stopMonitoringDomain with args {'sdUUID': 'dea166f4-7109-47e4-8baa-15302e6eb1bf'} timed out after 900 seconds [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20171218231100.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination [ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue,fix and redeploy Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20171218223123-o6lcez.log Details for attachment 1369773 [details] : test_he_cli_1218.log attachment 1369774 [details] : test_he_cockpit_1218.log
Created attachment 1369781 [details] sosreport+engine.log+vdsm.log+deploy.log
We have engine.log only for the CLI case and not for the cockpit one but on hosted-engine-setup side the symptoms are almost the same: the engine is not correctly able to deploy the host and hosted-engine-setup fails after some time monitoring it. Focusing on the CLI case were we have engine.log: The engine fails deploying the host due to: 2017-12-18 23:44:25,295-05 ERROR [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-77) [c1a14d6] Command 'org.ovirt.engine.core.bll.AddUnmanagedVmsCommand' failed: No enum constant org.ovirt.engine.core.common.businessentities.network.VmInterfaceType.virtio 2017-12-18 23:44:25,295-05 ERROR [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-77) [c1a14d6] Exception: java.lang.IllegalArgumentException: No enum constant org.ovirt.engine.core.common.businessentities.network.VmInterfaceType.virtio at java.lang.Enum.valueOf(Enum.java:238) [rt.jar:1.8.0_151] at org.ovirt.engine.core.common.businessentities.network.VmInterfaceType.valueOf(VmInterfaceType.java:6) [common.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.buildVmNetworkInterfacesFromDevices(VdsBrokerObjectsBuilder.java:232) [vdsbroker.jar:] at org.ovirt.engine.core.bll.AddUnmanagedVmsCommand.importHostedEngineVm(AddUnmanagedVmsCommand.java:181) [bll.jar:] at org.ovirt.engine.core.bll.AddUnmanagedVmsCommand.convertVm(AddUnmanagedVmsCommand.java:111) [bll.jar:] This is repeated 238 times in engine.log I'm going to open a separate bug on it. Yihui, could you please repeat the test on cockpit and, when you see Adding the host to the cluster Waiting for the host to become operational in the engine. This may take several minutes... connect to the engine VM do download engine.log just to be sure that the issue is really the same?
Ok, this starts making a bit of sense now. On rhevh-hostedengine-vm-03 we see a flood of 2017-12-19 04:14:55,517-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [] Command 'GetCapabilitiesVDSCommand(HostName = hp-z620-04.qe.lab.eng.nay.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{hostId='7ef5a936-a31b-4f4f-aed2-a9531a23f3c8', vds='Host[hp-z620-04.qe.lab.eng.nay.redhat.com,7ef5a936-a31b-4f4f-aed2-a9531a23f3c8]'})' execution failed: java.net.NoRouteToHostException: No route to host 2017-12-19 04:14:55,517-05 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-11) [] Failure to refresh host 'hp-z620-04.qe.lab.eng.nay.redhat.com' runtime info: java.net.NoRouteToHostException: No route to host So technically the engine VM never triggered host-deploy on the host and indeed the host was in Non Responsive state in the engine. In past some manual steps could be potentially required, especially on network setup side, to bring the host up and hosted-engine-setup could potentially also successfully conclude with the host in Non Responsive state. https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/plugins/gr-he-setup/engine/add_host.py#L193 So indeed on the host we see: 2017-12-19 03:49:50,345-0500 ERROR otopi.plugins.gr_he_setup.engine.add_host add_host._wait_host_ready:122 Timed out while waiting for host to start. Please check the logs. 2017-12-19 03:49:50,346-0500 ERROR otopi.plugins.gr_he_setup.engine.add_host add_host._closeup:662 Unable to add hp-z620-04.qe.lab.eng.nay.redhat.com to the manager but then: 2017-12-19 03:53:15,394-0500 INFO otopi.plugins.gr_he_common.core.misc misc._terminate:251 Hosted Engine successfully deployed
Do you think this is a failure in HE setup, or the test environment/network?
Created attachment 1370037 [details] CLI_deploy+CLI_redeploy+CLI_redeploy_engine.log
Test version: cockpit-ws-155-1.el7.x86_64 cockpit-bridge-155-1.el7.x86_64 cockpit-system-155-1.el7.noarch cockpit-storaged-155-1.el7.noarch cockpit-ovirt-dashboard-0.11.3-0.1.el7ev.noarch cockpit-dashboard-155-1.el7.x86_64 cockpit-155-1.el7.x86_64 vdsm-4.20.9.3-1.el7ev.x86_64 rhvm-appliance-4.2-20171219.0.el7.noarch ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch Test steps: 1. Clean install latest RHVH4.2 2. Deploy HostedEngine via cockpit Test result: After step2, deploy HostedEngine successfully. [root@hp-dl385pg8-11 ~]# hosted-engine --vm-status --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : hp-dl385pg8-11.lab.eng.pek2.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : c7321e13 local_conf_timestamp : 9732 Host timestamp : 9731 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=9731 (Wed Dec 20 06:51:59 2017) host-id=1 score=3400 vm_conf_refresh_time=9732 (Wed Dec 20 06:52:01 2017) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False [oVirt shell (connected)]# list hosts --show-all |grep "status-state" WARNING: yacc table file version is out of date external_status-state : ok spm-status-state : none status-state : up So there is no dependence on bug 1522878 and 1527318, remove the dependence. So change the bug's status to verified! If exists the issue about redeployment SHE, I will report a new bug to track!
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.