Bug 1810043 - [vdsmd] Hosted Engine deployment failed when trying to restart vdsmd (Tracker for RHVH bug 1810882 )
Summary: [vdsmd] Hosted Engine deployment failed when trying to restart vdsmd (Tracker...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhhi
Version: rhhiv-1.8
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHHI-V 1.8
Assignee: Gobinda Das
QA Contact: milind
URL:
Whiteboard:
Depends On: 1810882
Blocks: RHHI-V-1.8-Engineering-Inflight-BZs
TreeView+ depends on / blocked
 
Reported: 2020-03-04 13:29 UTC by milind
Modified: 2020-08-04 14:51 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1810882 (view as bug list)
Environment:
rhhiv, rhel8
Last Closed: 2020-08-04 14:51:33 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2020:3314 0 None None None 2020-08-04 14:51:55 UTC

Description milind 2020-03-04 13:29:44 UTC
Description of problem:
HE deployment is Failing on RHVH-4.4 
i feel that its unable to start vdsmd.service
-------------------------------------------
 (if applicable):

vdsm-4.40.5-1.el8ev.x86_64
vdsm-client-4.40.5-1.el8ev.noarch
vdsm-api-4.40.5-1.el8ev.noarch
vdsm-hook-vhostmd-4.40.5-1.el8ev.noarch
vdsm-http-4.40.5-1.el8ev.noarch
vdsm-hook-ethtool-options-4.40.5-1.el8ev.noarch
vdsm-hook-openstacknet-4.40.5-1.el8ev.noarch
vdsm-common-4.40.5-1.el8ev.noarch
vdsm-network-4.40.5-1.el8ev.x86_64
vdsm-jsonrpc-4.40.5-1.el8ev.noarch
vdsm-hook-vmfex-dev-4.40.5-1.el8ev.noarch
vdsm-hook-fcoe-4.40.5-1.el8ev.noarch
vdsm-gluster-4.40.5-1.el8ev.x86_64
vdsm-yajsonrpc-4.40.5-1.el8ev.noarch
vdsm-python-4.40.5-1.el8ev.noarch

-----------------------------------------
How reproducible:
Always
------------------------------------------
Steps to Reproduce:
1. From cockpit click in Hyperconverged and deploy gluster it will be successfully complete
2. Deploy HE this step will fail

--------------------------------------------------------------------------------------


Additional info:
 ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": [{"address": "rhsqa-grafton1.lab.eng.blr.redhat.com", "affinity_labels": [], "auto_numa_status": "unknown", "certificate": {"organization": "lab.eng.blr.redhat.com", "subject": "O=lab.eng.blr.redhat.com,CN=rhsqa-grafton1.lab.eng.blr.redhat.com"}, "cluster": {"href": "/ovirt-engine/api/clusters/56a6240c-5e01-11ea-a454-004554194801", "id": "56a6240c-5e01-11ea-a454-004554194801"}, "comment": "", "cpu": {"speed": 0.0, "topology": {}}, "device_passthrough": {"enabled": false}, "devices": [], "external_network_provider_configurations": [], "external_status": "ok", "hardware_information": {"supported_rng_sources": []}, "hooks": [], "href": "/ovirt-engine/api/hosts/362dd036-aa2e-403b-9aee-f47ff7fa7496", "id": "362dd036-aa2e-403b-9aee-f47ff7fa7496", "katello_errata": [], "kdump_status": "unknown", "ksm": {"enabled": false}, "max_scheduling_memory": 0, "memory": 0, "name": "rhsqa-grafton1.lab.eng.blr.redhat.com", "network_attachments": [], "nics": [], "numa_nodes": [], "numa_supported": false, "os": {"custom_kernel_cmdline": ""}, "permissions": [], "port": 54321, "power_management": {"automatic_pm_enabled": true, "enabled": false, "kdump_detection": true, "pm_proxies": []}, "protocol": "stomp", "se_linux": {}, "spm": {"priority": 5, "status": "none"}, "ssh": {"fingerprint": "SHA256:afIfjlqbi4e9fzOARDkN0wfg2IVI3qI/Dejc3kTUHPo", "port": 22}, "statistics": [], "status": "install_failed", "storage_connection_extensions": [], "summary": {"total": 0}, "tags": [], "transparent_huge_pages": {"enabled": false}, "type": "rhel", "unmanaged_networks": [], "update_available": false, "vgpu_placement": "consolidated"}]}, "attempts": 120, "changed": false, "deprecations": [{"msg": "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts", "version": "2.13"}]}
[ INFO ] TASK [ovirt.hosted_engine_setup : Fetch logs from the engine VM]
[ INFO ] ok: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Set destination directory path]
[ INFO ] ok: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Create destination directory]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Find the local appliance image]
[ INFO ] ok: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Set local_vm_disk_path]
[ INFO ] ok: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Give the vm time to flush dirty buffers]
[ INFO ] ok: [localhost -> localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Copy engine logs]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Remove local vm dir]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Remove temporary entry in /etc/hosts for the local VM]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}

Comment 3 SATHEESARAN 2020-03-06 05:57:02 UTC
We could debug to some point.

When HE deployment is initiated from web console, 'HostedEngineLocal' is created, and its up, engine-setup
is done, and when adding that 'HE host' to the cluster, the host never becomes operational and moves in to
non-operational.

At the time these events happens, engine is trying to restart vdsmd and vdsmd couldn't start any more due
to dependency failed

<snip>
[root@rhsqa-grafton1 ~]# systemctl status vdsmd
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

Mar 05 09:43:14 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Virtual Desktop Server Manager.
Mar 05 09:43:14 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: vdsmd.service: Job vdsmd.service/start failed with result 'dependency'.
[root@rhsqa-grafton1 ~]# systemctl status supervdsmd
● supervdsmd.service - Auxiliary vdsm service for running helper functions as root
   Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static; vendor preset: enabled)
   Active: inactive (dead) since Thu 2020-03-05 11:25:18 UTC; 18h ago
 Main PID: 1577 (code=exited, status=0/SUCCESS)

Mar 05 09:43:00 localhost.localdomain systemd[1]: Started Auxiliary vdsm service for running helper functions as root.
Mar 05 09:43:03 localhost.localdomain supervdsmd[1577]: failed to load module lvm: libbd_lvm.so.2: cannot open shared object file: No such file or directory
Mar 05 09:43:03 localhost.localdomain supervdsmd[1577]: failed to load module mpath: libbd_mpath.so.2: cannot open shared object file: No such file or directory
Mar 05 09:43:03 localhost.localdomain supervdsmd[1577]: failed to load module dm: libbd_dm.so.2: cannot open shared object file: No such file or directory
Mar 05 09:43:03 localhost.localdomain supervdsmd[1577]: failed to load module nvdimm: libbd_nvdimm.so.2: cannot open shared object file: No such file or directory
Mar 05 11:25:18 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Stopping Auxiliary vdsm service for running helper functions as root...
Mar 05 11:25:18 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Stopped Auxiliary vdsm service for running helper functions as root.
Mar 05 11:25:24 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Dependency failed for Auxiliary vdsm service for running helper functions as root.
Mar 05 11:25:24 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: supervdsmd.service: Job supervdsmd.service/start failed with result 'dependency'.
[root@rhsqa-grafton1 ~]# systemctl status libvirtd-tls.socket
● libvirtd-tls.socket - Libvirt TLS IP socket
   Loaded: loaded (/usr/lib/systemd/system/libvirtd-tls.socket; enabled; vendor preset: disabled)
   Active: failed (Result: service-start-limit-hit) since Thu 2020-03-05 09:43:16 UTC; 20h ago
   Listen: [::]:16514 (Stream)

Mar 05 09:42:54 localhost.localdomain systemd[1]: Listening on Libvirt TLS IP socket.
Mar 05 09:43:16 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: libvirtd-tls.socket: Failed with result 'service-start-limit-hit'.
Mar 05 11:25:24 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: libvirtd-tls.socket: Socket service libvirtd.service already active, refusing.
Mar 05 11:25:24 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Failed to listen on Libvirt TLS IP socket.
</snip>

When checked with Marcin Sobcyzk, he exclaimed that this issue is the result improper host configuration and
pointed to get help from Martin Perina.

Martin Perina is currently investigating the issue. Access to the setup is made available for him to debug and find out the real cause.

This is not only the RHHI problem, but also the HE deployment problem at RHV 4.4

Comment 6 SATHEESARAN 2020-05-05 02:27:50 UTC
RHHI-V 1.8 deployment with 3 node works good with the workaround from Bug 1823423.
The particular issue on the bug is not seen

The builds used for the verification are:
RHVH-4.4-20200417.0-RHVH-x86_64-dvd1.iso 
rhvm-appliance-4.4-20200417.0.el8ev.x86_64.rpm 

@Milind, could you verify this bug also with single node RHHI-V 1.8 deployment
and verify this bug ?

Comment 7 milind 2020-05-05 11:38:54 UTC
As the Vdsmd is successfully restated and HE Deployment is successfully .
Hence marking this bug as verified 

[root@rhsqa-grafton1 vdsm]# imgbase w
You are on rhvh-4.4.0.18-0.20200417.0+1
[root@rhsqa-grafton1 vdsm]# systemctl status vdsmd
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2020-05-05 09:49:44 UTC; 1h 46min ago
 Main PID: 278336 (vdsmd)
    Tasks: 58 (limit: 1648316)
   Memory: 115.6M
   CGroup: /system.slice/vdsmd.service
           ├─278336 /usr/bin/python3 /usr/share/vdsm/vdsmd
           └─328561 /usr/libexec/ioprocess --read-pipe-fd 59 --write-pipe-fd 58 --max-threads 10 --max-queued-requests 10

May 05 09:49:45 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Not ready yet, ignoring event '|virt|VM_status|0a3a305e-0d89-4ce4-93f9-bb3441e27874' args={'0a3a305e-0d89-4ce4-93f9->
May 05 11:05:52 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished?
May 05 11:06:07 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished?
May 05 11:06:22 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished?
May 05 11:06:37 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished?
May 05 11:06:43 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN unhandled write event
May 05 11:07:12 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Failed to retrieve Hosted Engine HA info: timed out
May 05 11:07:14 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Failed to retrieve Hosted Engine HA info: timed out
May 05 11:07:14 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Failed to retrieve Hosted Engine HA info: timed out
May 05 11:08:24 rhsqa-grafton1.lab.eng.blr.redhat.com vdsm[278336]: WARN Attempting to add an existing net user: ovirtmgmt/6d150246-22bc-48ac-8a6b-98abea28d4d3

Comment 10 errata-xmlrpc 2020-08-04 14:51:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHHI for Virtualization 1.8 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3314


Note You need to log in before you can comment on or make changes to this bug.