Bug 1817402 - Host up timeout during deploying hosted engine via cockpit.
Summary: Host up timeout during deploying hosted engine via cockpit.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-ansible-collection
Classification: oVirt
Component: hosted-engine-setup
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.4.0
: 1.1.2
Assignee: Yedidyah Bar David
QA Contact: Wei Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-26 09:56 UTC by Wei Wang
Modified: 2020-05-20 20:01 UTC (History)
15 users (show)

Fixed In Version: ovirt-ansible-hosted-engine-setup-1.1.2
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-20 20:01:02 UTC
oVirt Team: Integration
Embargoed:
sbonazzo: ovirt-4.4?
sbonazzo: planning_ack?
sbonazzo: devel_ack+
weiwang: testing_ack+


Attachments (Terms of Use)
var log files (2.49 MB, application/gzip)
2020-03-26 09:59 UTC, Wei Wang
no flags Details
picture (103.11 KB, image/png)
2020-03-26 10:00 UTC, Wei Wang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-ansible-hosted-engine-setup pull 310 0 None closed DEBUG: Update 03_engine_initial_tasks.yml 2020-04-16 05:50:33 UTC

Description Wei Wang 2020-03-26 09:56:35 UTC
Description of problem:
This bug has detected with
    RHVH-UNSIGNED-ISO-4.4-RHEL-8-20200318.0-RHVH-x86_64-dvd1 
    rhvm-appliance-4.4-20200123.0.el8ev.x86_64

https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c0

With latest 4.4 build:
    RHVH-4.4-20200325.0-RHVH-x86_64-dvd1.iso
    rhvm-appliance-4.4-20200325.0.el8ev.x86_64
detect the same issue "Host up timeout during deploying hosted engine via cockpit". The host is upping for 10 minutes, then failed.

[ INFO ] TASK [ovirt.hosted_engine_setup : Wait for the host to be up]
[ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": [{"address": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "affinity_labels": [], "auto_numa_status": "unknown", "certificate": {"organization": "lab.eng.pek2.redhat.com", "subject": "O=lab.eng.pek2.redhat.com,CN=hp-dl388g9-04.lab.eng.pek2.redhat.com"}, "cluster": {"href": "/ovirt-engine/api/clusters/0dbc162c-6f43-11ea-93bd-5254005d2164", "id": "0dbc162c-6f43-11ea-93bd-5254005d2164"}, "comment": "", "cpu": {"speed": 0.0, "topology": {}}, "device_passthrough": {"enabled": false}, "devices": [], "external_network_provider_configurations": [], "external_status": "ok", "hardware_information": {"supported_rng_sources": []}, "hooks": [], "href": "/ovirt-engine/api/hosts/94bc9af5-8c39-47d3-bded-a3775cdb01b2", "id": "94bc9af5-8c39-47d3-bded-a3775cdb01b2", "katello_errata": [], "kdump_status": "unknown", "ksm": {"enabled": false}, "max_scheduling_memory": 0, "memory": 0, "name": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "network_attachments": [], "nics": [], "numa_nodes": [], "numa_supported": false, "os": {"custom_kernel_cmdline": ""}, "permissions": [], "port": 54321, "power_management": {"automatic_pm_enabled": true, "enabled": false, "kdump_detection": true, "pm_proxies": []}, "protocol": "stomp", "se_linux": {}, "spm": {"priority": 5, "status": "none"}, "ssh": {"fingerprint": "SHA256:8sEFgGYDwAmrZA0xt+r8MeE1ltWapw42HvRF811+ZLo", "port": 22}, "statistics": [], "status": "install_failed", "storage_connection_extensions": [], "summary": {"total": 0}, "tags": [], "transparent_huge_pages": {"enabled": false}, "type": "rhel", "unmanaged_networks": [], "update_available": false, "vgpu_placement": "consolidated"}]}, "attempts": 120, "changed": false, "deprecations": [{"msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts", "version": "2.13"}]}

Version-Release number of selected component (if applicable):
RHVH-4.4-20200325.0-RHVH-x86_64-dvd1.iso
cockpit-system-211.3-1.el8.noarch
cockpit-ws-211.3-1.el8.x86_64
cockpit-ovirt-dashboard-0.14.3-1.el8ev.noarch
cockpit-211.3-1.el8.x86_64
cockpit-bridge-211.3-1.el8.x86_64
cockpit-dashboard-211.3-1.el8.noarch
cockpit-storaged-211.3-1.el8.noarch
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.3-2.el8ev.noarch
rhvm-appliance-4.4-20200325.0.el8ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Deploy hosted engine via cockpit.
2.
3.

Actual results:
Host up timeout during deploying hosted engine via cockpit, then hosted engine deploy failed.

Expected results:
Host up in time and hosted engine deploy successfully.

Additional info:
Refer the analysis from https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c2

Comment 1 Wei Wang 2020-03-26 09:59:19 UTC
Created attachment 1673739 [details]
var log files

Comment 2 Wei Wang 2020-03-26 10:00:25 UTC
Created attachment 1673740 [details]
picture

Comment 3 Martin Perina 2020-03-26 13:21:21 UTC
Isn't it duplicate of BZ1814940?

Comment 4 Wei Wang 2020-03-26 15:04:15 UTC
(In reply to Martin Perina from comment #3)
> Isn't it duplicate of BZ1814940?

Yes, since the BZ1814940 record another bug in comment #3, so report the host up timeout bug in a new report file. BZ1814940 is only for comment #3 now. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c11

Comment 5 Yedidyah Bar David 2020-03-29 05:57:09 UTC
Current bug is only on hosted-engine side, and only for making it wait longer for the host to become up.

Comment 6 Lukas Svaty 2020-03-30 08:21:02 UTC
@Didi increasing timeout does not seem like the right solution.

The problem was that rdma service was not enabled, and boot time was expanded by a lot.
We already have WA, and waiting for gluster/rhel fixed. 

IMHO this timeout should not be accepted, WDYT?

Comment 7 Yedidyah Bar David 2020-03-30 09:19:48 UTC
(In reply to Lukas Svaty from comment #6)
> @Didi increasing timeout does not seem like the right solution.
> 
> The problem was that rdma service was not enabled, and boot time was
> expanded by a lot.
> We already have WA, and waiting for gluster/rhel fixed. 

Not sure what you mean. We already saw several ansible-host-deploy logs
that took, from first to last line (all ansible code, no reboots or anything)
more than 10 minutes.

> 
> IMHO this timeout should not be accepted, WDYT?

If you mean to say: 10 minutes should be enough, we should make our ansible code
not take more than 10 minutes, then I agree with you, and mperina tells me we
are working on it. Current bug is a workaround, yes, for the time being (and I
have no problem keeping it also later, for slow setups or whatever).

Comment 8 Michal Skrivanek 2020-03-30 12:45:40 UTC
(In reply to Yedidyah Bar David from comment #7)

> I have no problem keeping it also later, for slow setups or whatever).

TBH I would go even higher. While the RHV host should be generally up to date you can easily be installing an outdated version and then have plenty of packages to be updated, slow machines, etc. I would personally use 30 minutes

Comment 9 Wei Wang 2020-04-07 10:36:04 UTC
Test with rhvh-4.4.0.16-0.20200401.0 and rhvm-appliance-4.4-20200403.0.el8ev.x86_64, hosted engine deploy successful, the bug is fixed.

QE will move the status to "VERIFIED" until dev move the status to "ON_QA"

Comment 10 Sandro Bonazzola 2020-05-20 20:01:02 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.