1817402 – Host up timeout during deploying hosted engine via cockpit.

Bug 1817402 - Host up timeout during deploying hosted engine via cockpit.

Summary: Host up timeout during deploying hosted engine via cockpit.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-ansible-collection
Classification:	oVirt
Component:	hosted-engine-setup
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.4.0
Target Release:	1.1.2
Assignee:	Yedidyah Bar David
QA Contact:	Wei Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-26 09:56 UTC by Wei Wang
Modified:	2020-05-20 20:01 UTC (History)
CC List:	15 users (show)
Fixed In Version:	ovirt-ansible-hosted-engine-setup-1.1.2
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-20 20:01:02 UTC
oVirt Team:	Integration
Embargoed:
Flags:	sbonazzo: ovirt-4.4? sbonazzo: planning_ack? sbonazzo: devel_ack+ weiwang: testing_ack+

Attachments	(Terms of Use)
var log files (2.49 MB, application/gzip) 2020-03-26 09:59 UTC, Wei Wang	no flags	Details
picture (103.11 KB, image/png) 2020-03-26 10:00 UTC, Wei Wang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-ansible-hosted-engine-setup pull 310	0	None	closed	DEBUG: Update 03_engine_initial_tasks.yml	2020-04-16 05:50:33 UTC

Description Wei Wang 2020-03-26 09:56:35 UTC

Description of problem:
This bug has detected with
    RHVH-UNSIGNED-ISO-4.4-RHEL-8-20200318.0-RHVH-x86_64-dvd1 
    rhvm-appliance-4.4-20200123.0.el8ev.x86_64

https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c0

With latest 4.4 build:
    RHVH-4.4-20200325.0-RHVH-x86_64-dvd1.iso
    rhvm-appliance-4.4-20200325.0.el8ev.x86_64
detect the same issue "Host up timeout during deploying hosted engine via cockpit". The host is upping for 10 minutes, then failed.

[ INFO ] TASK [ovirt.hosted_engine_setup : Wait for the host to be up]
[ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": [{"address": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "affinity_labels": [], "auto_numa_status": "unknown", "certificate": {"organization": "lab.eng.pek2.redhat.com", "subject": "O=lab.eng.pek2.redhat.com,CN=hp-dl388g9-04.lab.eng.pek2.redhat.com"}, "cluster": {"href": "/ovirt-engine/api/clusters/0dbc162c-6f43-11ea-93bd-5254005d2164", "id": "0dbc162c-6f43-11ea-93bd-5254005d2164"}, "comment": "", "cpu": {"speed": 0.0, "topology": {}}, "device_passthrough": {"enabled": false}, "devices": [], "external_network_provider_configurations": [], "external_status": "ok", "hardware_information": {"supported_rng_sources": []}, "hooks": [], "href": "/ovirt-engine/api/hosts/94bc9af5-8c39-47d3-bded-a3775cdb01b2", "id": "94bc9af5-8c39-47d3-bded-a3775cdb01b2", "katello_errata": [], "kdump_status": "unknown", "ksm": {"enabled": false}, "max_scheduling_memory": 0, "memory": 0, "name": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "network_attachments": [], "nics": [], "numa_nodes": [], "numa_supported": false, "os": {"custom_kernel_cmdline": ""}, "permissions": [], "port": 54321, "power_management": {"automatic_pm_enabled": true, "enabled": false, "kdump_detection": true, "pm_proxies": []}, "protocol": "stomp", "se_linux": {}, "spm": {"priority": 5, "status": "none"}, "ssh": {"fingerprint": "SHA256:8sEFgGYDwAmrZA0xt+r8MeE1ltWapw42HvRF811+ZLo", "port": 22}, "statistics": [], "status": "install_failed", "storage_connection_extensions": [], "summary": {"total": 0}, "tags": [], "transparent_huge_pages": {"enabled": false}, "type": "rhel", "unmanaged_networks": [], "update_available": false, "vgpu_placement": "consolidated"}]}, "attempts": 120, "changed": false, "deprecations": [{"msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts", "version": "2.13"}]}

Version-Release number of selected component (if applicable):
RHVH-4.4-20200325.0-RHVH-x86_64-dvd1.iso
cockpit-system-211.3-1.el8.noarch
cockpit-ws-211.3-1.el8.x86_64
cockpit-ovirt-dashboard-0.14.3-1.el8ev.noarch
cockpit-211.3-1.el8.x86_64
cockpit-bridge-211.3-1.el8.x86_64
cockpit-dashboard-211.3-1.el8.noarch
cockpit-storaged-211.3-1.el8.noarch
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.3-2.el8ev.noarch
rhvm-appliance-4.4-20200325.0.el8ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Deploy hosted engine via cockpit.
2.
3.

Actual results:
Host up timeout during deploying hosted engine via cockpit, then hosted engine deploy failed.

Expected results:
Host up in time and hosted engine deploy successfully.

Additional info:
Refer the analysis from https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c2

Comment 1 Wei Wang 2020-03-26 09:59:19 UTC

Created attachment 1673739 [details]
var log files

Comment 2 Wei Wang 2020-03-26 10:00:25 UTC

Created attachment 1673740 [details]
picture

Comment 3 Martin Perina 2020-03-26 13:21:21 UTC

Isn't it duplicate of BZ1814940?

Comment 4 Wei Wang 2020-03-26 15:04:15 UTC

(In reply to Martin Perina from comment #3)
> Isn't it duplicate of BZ1814940?

Yes, since the BZ1814940 record another bug in comment #3, so report the host up timeout bug in a new report file. BZ1814940 is only for comment #3 now. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c11

Comment 5 Yedidyah Bar David 2020-03-29 05:57:09 UTC

Current bug is only on hosted-engine side, and only for making it wait longer for the host to become up.

Comment 6 Lukas Svaty 2020-03-30 08:21:02 UTC

@Didi increasing timeout does not seem like the right solution.

The problem was that rdma service was not enabled, and boot time was expanded by a lot.
We already have WA, and waiting for gluster/rhel fixed. 

IMHO this timeout should not be accepted, WDYT?

Comment 7 Yedidyah Bar David 2020-03-30 09:19:48 UTC

(In reply to Lukas Svaty from comment #6)
> @Didi increasing timeout does not seem like the right solution.
> 
> The problem was that rdma service was not enabled, and boot time was
> expanded by a lot.
> We already have WA, and waiting for gluster/rhel fixed. 

Not sure what you mean. We already saw several ansible-host-deploy logs
that took, from first to last line (all ansible code, no reboots or anything)
more than 10 minutes.

> 
> IMHO this timeout should not be accepted, WDYT?

If you mean to say: 10 minutes should be enough, we should make our ansible code
not take more than 10 minutes, then I agree with you, and mperina tells me we
are working on it. Current bug is a workaround, yes, for the time being (and I
have no problem keeping it also later, for slow setups or whatever).

Comment 8 Michal Skrivanek 2020-03-30 12:45:40 UTC

(In reply to Yedidyah Bar David from comment #7)

> I have no problem keeping it also later, for slow setups or whatever).

TBH I would go even higher. While the RHV host should be generally up to date you can easily be installing an outdated version and then have plenty of packages to be updated, slow machines, etc. I would personally use 30 minutes

Comment 9 Wei Wang 2020-04-07 10:36:04 UTC

Test with rhvh-4.4.0.16-0.20200401.0 and rhvm-appliance-4.4-20200403.0.el8ev.x86_64, hosted engine deploy successful, the bug is fixed.

QE will move the status to "VERIFIED" until dev move the status to "ON_QA"

Comment 10 Sandro Bonazzola 2020-05-20 20:01:02 UTC

This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.