Description of problem: This bug has detected with RHVH-UNSIGNED-ISO-4.4-RHEL-8-20200318.0-RHVH-x86_64-dvd1 rhvm-appliance-4.4-20200123.0.el8ev.x86_64 https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c0 With latest 4.4 build: RHVH-4.4-20200325.0-RHVH-x86_64-dvd1.iso rhvm-appliance-4.4-20200325.0.el8ev.x86_64 detect the same issue "Host up timeout during deploying hosted engine via cockpit". The host is upping for 10 minutes, then failed. [ INFO ] TASK [ovirt.hosted_engine_setup : Wait for the host to be up] [ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": [{"address": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "affinity_labels": [], "auto_numa_status": "unknown", "certificate": {"organization": "lab.eng.pek2.redhat.com", "subject": "O=lab.eng.pek2.redhat.com,CN=hp-dl388g9-04.lab.eng.pek2.redhat.com"}, "cluster": {"href": "/ovirt-engine/api/clusters/0dbc162c-6f43-11ea-93bd-5254005d2164", "id": "0dbc162c-6f43-11ea-93bd-5254005d2164"}, "comment": "", "cpu": {"speed": 0.0, "topology": {}}, "device_passthrough": {"enabled": false}, "devices": [], "external_network_provider_configurations": [], "external_status": "ok", "hardware_information": {"supported_rng_sources": []}, "hooks": [], "href": "/ovirt-engine/api/hosts/94bc9af5-8c39-47d3-bded-a3775cdb01b2", "id": "94bc9af5-8c39-47d3-bded-a3775cdb01b2", "katello_errata": [], "kdump_status": "unknown", "ksm": {"enabled": false}, "max_scheduling_memory": 0, "memory": 0, "name": "hp-dl388g9-04.lab.eng.pek2.redhat.com", "network_attachments": [], "nics": [], "numa_nodes": [], "numa_supported": false, "os": {"custom_kernel_cmdline": ""}, "permissions": [], "port": 54321, "power_management": {"automatic_pm_enabled": true, "enabled": false, "kdump_detection": true, "pm_proxies": []}, "protocol": "stomp", "se_linux": {}, "spm": {"priority": 5, "status": "none"}, "ssh": {"fingerprint": "SHA256:8sEFgGYDwAmrZA0xt+r8MeE1ltWapw42HvRF811+ZLo", "port": 22}, "statistics": [], "status": "install_failed", "storage_connection_extensions": [], "summary": {"total": 0}, "tags": [], "transparent_huge_pages": {"enabled": false}, "type": "rhel", "unmanaged_networks": [], "update_available": false, "vgpu_placement": "consolidated"}]}, "attempts": 120, "changed": false, "deprecations": [{"msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts", "version": "2.13"}]} Version-Release number of selected component (if applicable): RHVH-4.4-20200325.0-RHVH-x86_64-dvd1.iso cockpit-system-211.3-1.el8.noarch cockpit-ws-211.3-1.el8.x86_64 cockpit-ovirt-dashboard-0.14.3-1.el8ev.noarch cockpit-211.3-1.el8.x86_64 cockpit-bridge-211.3-1.el8.x86_64 cockpit-dashboard-211.3-1.el8.noarch cockpit-storaged-211.3-1.el8.noarch ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch ovirt-hosted-engine-setup-2.4.3-2.el8ev.noarch rhvm-appliance-4.4-20200325.0.el8ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Deploy hosted engine via cockpit. 2. 3. Actual results: Host up timeout during deploying hosted engine via cockpit, then hosted engine deploy failed. Expected results: Host up in time and hosted engine deploy successfully. Additional info: Refer the analysis from https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c2
Created attachment 1673739 [details] var log files
Created attachment 1673740 [details] picture
Isn't it duplicate of BZ1814940?
(In reply to Martin Perina from comment #3) > Isn't it duplicate of BZ1814940? Yes, since the BZ1814940 record another bug in comment #3, so report the host up timeout bug in a new report file. BZ1814940 is only for comment #3 now. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1814940#c11
Current bug is only on hosted-engine side, and only for making it wait longer for the host to become up.
@Didi increasing timeout does not seem like the right solution. The problem was that rdma service was not enabled, and boot time was expanded by a lot. We already have WA, and waiting for gluster/rhel fixed. IMHO this timeout should not be accepted, WDYT?
(In reply to Lukas Svaty from comment #6) > @Didi increasing timeout does not seem like the right solution. > > The problem was that rdma service was not enabled, and boot time was > expanded by a lot. > We already have WA, and waiting for gluster/rhel fixed. Not sure what you mean. We already saw several ansible-host-deploy logs that took, from first to last line (all ansible code, no reboots or anything) more than 10 minutes. > > IMHO this timeout should not be accepted, WDYT? If you mean to say: 10 minutes should be enough, we should make our ansible code not take more than 10 minutes, then I agree with you, and mperina tells me we are working on it. Current bug is a workaround, yes, for the time being (and I have no problem keeping it also later, for slow setups or whatever).
(In reply to Yedidyah Bar David from comment #7) > I have no problem keeping it also later, for slow setups or whatever). TBH I would go even higher. While the RHV host should be generally up to date you can easily be installing an outdated version and then have plenty of packages to be updated, slow machines, etc. I would personally use 30 minutes
Test with rhvh-4.4.0.16-0.20200401.0 and rhvm-appliance-4.4-20200403.0.el8ev.x86_64, hosted engine deploy successful, the bug is fixed. QE will move the status to "VERIFIED" until dev move the status to "ON_QA"
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.