1589544 – [HE] host-deploy fails to start vdsmd on node/rhel-h

Bug 1589544 - [HE] host-deploy fails to start vdsmd on node/rhel-h

Summary: [HE] host-deploy fails to start vdsmd on node/rhel-h

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Services
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.2.4
Target Release:	---
Assignee:	Piotr Kliczewski
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-10 15:46 UTC by Kobi Hakimi
Modified:	2018-06-26 08:41 UTC (History)
CC List:	17 users (show)
Fixed In Version:	imgbased-1.0.19
Clone Of:
Environment:
Last Closed:	2018-06-26 08:41:25 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ ykaul: blocker+

Attachments	(Terms of Use)
zip with host log files (1.42 MB, application/x-gzip) 2018-06-10 15:46 UTC, Kobi Hakimi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	92151	0	'None'	MERGED	Only configure vdsm once at startup	2020-08-25 08:40:19 UTC
oVirt gerrit	92211	0	'None'	MERGED	Only configure vdsm once at startup	2020-08-25 08:40:18 UTC

Description Kobi Hakimi 2018-06-10 15:46:58 UTC

Created attachment 1449699 [details]
zip with host log files

Description of problem:
[HE] Failed to deploy hosted engine on RHVH with error: local_vm_disk_path is undefined

Version-Release number of selected component (if applicable):
rhvh-4.2.4.2-0.20180608.0
ovirt-hosted-engine-setup-2.2.22-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.13-1.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy the hosted engine on RHVH with the automation deploy job.

Actual results:
Failed to deploy hosted engine on the new rhvh with the following error:

ovirt-hosted-engine-setup-ansible-final_clean-20180610172659-9lwtac.log
2018-06-10 17:27:25,095+0300 DEBUG ansible on_any args TASK: Copy engine logs kwargs is_conditional:False 
2018-06-10 17:27:25,679+0300 ERROR ansible failed {'status': 'FAILED', 'ansible_type': 'task', 'ansible_task': u'Copy engine logs', 'ansible_result': u'type: <type \'dict\'>\nstr: {\'msg\': u"The task includes an option with an undefined variable. The error was: \'local_vm_disk_path\' is undefined\\n\\nThe error appears to have been in \'/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml\': line 16, column 3, but may\\nbe elsewhere in the file depending on the exact sy\nrepr: {\'msg\': u"The task includes an option with an undefined variable. The error was: \'local_vm_disk_path\' is undefined\\n\\nThe error appears to have been in \'/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml\': line 16, column 3, but may\\nbe elsewhere in the file depending on the exact sy\ndir: [\'__class__\', \'__cmp__\', \'__contains__\', \'__delattr__\', \'__delitem__\', \'__doc__\', \'__eq__\', \'__format__\', \'__ge__\', \'__getattribute__\', \'__getitem__\', \'__gt__\', \'__hash__\', \'__init__\', \'__iter__\', \'__le__\', \'__len__\', \'__lt__\', \'__ne__\', \'__new__\', \'__reduce__\', \'__reduce_ex__\', \'__repr__\', \'__setattr__\', \'__setitem__\', \'__sizeof__\', \'__str__\', \'__subclasshook__\', \'clear\', \'copy\', \'fromkeys\', \'get\', \'has_key\', \'items\', \'iteritems\', \'iterkeys\', \'itervalues\', \'keys\', \'pop\', \'popitem\', \'setdefault\', \'update\', \'values\', \'viewitems\', \'viewkeys\', \'viewvalues\']\npprint: {\'msg\': u"The task includes an option with an undefined variable. The error was: \'local_vm_disk_path\' is undefined\\n\\nThe error appears to have been in \'/usr/share/ovirt-hosted-engine-setup/ansible/fetch_engine_logs.yml\': line 16, column 3, but may\\nbe elsewhere in the file depending on the exact sy\n{\'msg\': u"The task includes an option with an undefined variable. The error was: \'local_vm_disk_path.__doc__: "dict() -> new empty dictionary\\ndict(mapping) -> new dictionary initialized from a mapping object\'s\\n    (key, value) pairs\\ndict(iterable) -> new dictionary initialized as if via:\\n    d = {}\\n    for k, v in iterable:\\n        d[k] = v\\ndict(**kwargs) -> new dictionary initialized with the name=value pairs\\n    in the keyword argument list.  For example:  dict(one=1, two=2)"\n{\'msg\': u"The task includes an option with an undefined variable. The error was: \'local_vm_disk_path.__hash__: None', 'ansible_host': u'localhost', 'ansible_playbook': u'/usr/share/ovirt-hosted-engine-setup/ansible/final_clean.yml'}


Expected results:
complete the deploy successfully.

Additional info:
Maybe its related to the following patches:
https://gerrit.ovirt.org/#/c/90404/
https://gerrit.ovirt.org/#/c/90405/


The environment is still up and running 
host: lynx01.lab.eng.tlv2.redhat.com

for more info you can also see the failed job:
https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-ge-deploy-4.2/1566/consoleFull

Comment 1 Simone Tiraboschi 2018-06-11 09:01:14 UTC

The real issue happened before that and it's here:

[ INFO  ] TASK [Wait for the host to be up]
[ ERROR ] fatal: [localhost]: FAILED! => {\"ansible_facts\": {\"ovirt_hosts\": [{\"address\": \"lynx01.lab.eng.tlv2.redhat.com\", \"affinity_labels\": [], \"auto_numa_status\": \"unknown\", \"certificate\": {\"organization\": \"localdomain\", \"subject\": \"O=localdomain,CN=lynx01.lab.eng.tlv2.redhat.com\"}, \"cluster\": {\"href\": \"/ovirt-engine/api/clusters/2471d47c-6cb8-11e8-9660-00163e7be006\", \"id\": \"2471d47c-6cb8-11e8-9660-00163e7be006\"}, \"comment\": \"\", \"cpu\": {\"speed\": 0.0, \"topology\": {}}, \"device_passthrough\": {\"enabled\": false}, \"devices\": [], \"external_network_provider_configurations\": [], \"external_status\": \"ok\", \"hardware_information\": {\"supported_rng_sources\": []}, \"hooks\": [], \"href\": \"/ovirt-engine/api/hosts/1c641819-6f86-4a0a-a6c6-5266d4daa917\", \"id\": \"1c641819-6f86-4a0a-a6c6-5266d4daa917\", \"katello_errata\": [], \"kdump_status\": \"unknown\", \"ksm\": {\"enabled\": false}, \"max_scheduling_memory\": 0, \"memory\": 0, \"name\": \"host_mixed_1\", \"network_attachments\": [], \"nics\": [], \"numa_nodes\": [], \"numa_supported\": false, \"os\": {\"custom_kernel_cmdline\": \"\"}, \"permissions\": [], \"port\": 54321, \"power_management\": {\"automatic_pm_enabled\": true, \"enabled\": false, \"kdump_detection\": true, \"pm_proxies\": []}, \"protocol\": \"stomp\", \"se_linux\": {}, \"spm\": {\"priority\": 5, \"status\": \"none\"}, \"ssh\": {\"fingerprint\": \"SHA256:WaskLez/Yfv0OLnR+Lz1rxwzeajKUCmLbQD5Vy1JgZM\", \"port\": 22}, \"statistics\": [], \"status\": \"install_failed\", \"storage_connection_extensions\": [], \"summary\": {\"total\": 0}, \"tags\": [], \"transparent_huge_pages\": {\"enabled\": false}, \"type\": \"ovirt_node\", \"unmanaged_networks\": [], \"update_available\": false}]}, \"attempts\": 120, \"changed\": false}

due to "status\": "install_failed"

We have to double check host-deploy logs to see why it failed.

Comment 2 Simone Tiraboschi 2018-06-11 09:03:42 UTC

where we see:

2018-06-10 17:14:42,892+0300 DEBUG otopi.plugins.otopi.services.systemd systemd.state:130 starting service vdsmd
2018-06-10 17:14:42,892+0300 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:813 execute: ('/usr/bin/systemctl', 'start', 'vdsmd.service'), executable='None', cwd='None', env=None
2018-06-10 17:14:46,685+0300 DEBUG otopi.plugins.otopi.services.systemd plugin.executeRaw:863 execute-result: ('/usr/bin/systemctl', 'start', 'vdsmd.service'), rc=1
2018-06-10 17:14:46,686+0300 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:921 execute-output: ('/usr/bin/systemctl', 'start', 'vdsmd.service') stdout:


2018-06-10 17:14:46,687+0300 DEBUG otopi.plugins.otopi.services.systemd plugin.execute:926 execute-output: ('/usr/bin/systemctl', 'start', 'vdsmd.service') stderr:
Job for vdsmd.service canceled.

2018-06-10 17:14:46,687+0300 DEBUG otopi.context context._executeMethod:143 method exception
Traceback (most recent call last):
  File "/tmp/ovirt-hk028Vsxow/pythonlib/otopi/context.py", line 133, in _executeMethod
    method['method']()
  File "/tmp/ovirt-hk028Vsxow/otopi-plugins/ovirt-host-deploy/vdsm/packages.py", line 216, in _start
    self.services.state('vdsmd', True)
  File "/tmp/ovirt-hk028Vsxow/otopi-plugins/otopi/services/systemd.py", line 141, in state
    service=name,
RuntimeError: Failed to start service 'vdsmd'

Comment 3 Simone Tiraboschi 2018-06-11 09:07:26 UTC

and vdsm is down and nothing in its log file:

[root@lynx01 ~]# systemctl status vdsmd
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since dom 2018-06-10 17:14:46 IDT; 18h ago
  Process: 48985 ExecStopPost=/usr/libexec/vdsm/vdsmd_init_common.sh --post-stop (code=exited, status=0/SUCCESS)
  Process: 48829 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=killed, signal=TERM)

giu 10 17:14:44 lynx01.lab.eng.tlv2.redhat.com systemd[1]: Starting Virtual Desktop Server Manager...
giu 10 17:14:44 lynx01.lab.eng.tlv2.redhat.com vdsmd_init_common.sh[48829]: vdsm: Running mkdirs
giu 10 17:14:44 lynx01.lab.eng.tlv2.redhat.com vdsmd_init_common.sh[48829]: vdsm: Running configure_coredump
giu 10 17:14:44 lynx01.lab.eng.tlv2.redhat.com vdsmd_init_common.sh[48829]: vdsm: Running configure_vdsm_logs
giu 10 17:14:44 lynx01.lab.eng.tlv2.redhat.com vdsmd_init_common.sh[48829]: vdsm: Running wait_for_network
giu 10 17:14:44 lynx01.lab.eng.tlv2.redhat.com vdsmd_init_common.sh[48829]: vdsm: Running run_init_hooks
giu 10 17:14:44 lynx01.lab.eng.tlv2.redhat.com vdsmd_init_common.sh[48829]: vdsm: Running ngn_configure
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com vdsmd_init_common.sh[48985]: vdsm: Running run_final_hooks
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com systemd[1]: Stopped Virtual Desktop Server Manager.
[root@lynx01 ~]# ls -l /var/log/vdsm/vdsm.log 
-rw-r--r--. 1 vdsm kvm 0  8 giu 23.19 /var/log/vdsm/vdsm.log

Comment 4 Simone Tiraboschi 2018-06-11 09:16:45 UTC

supervdsm failed as well.

[root@lynx01 ~]# systemctl status supervdsmd
● supervdsmd.service - Auxiliary vdsm service for running helper functions as root
   Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static; vendor preset: enabled)
   Active: inactive (dead) since dom 2018-06-10 17:14:46 IDT; 19h ago
  Process: 48737 ExecStart=/usr/share/vdsm/daemonAdapter /usr/share/vdsm/supervdsmd --sockfile /var/run/vdsm/svdsm.sock (code=exited, status=0/SUCCESS)
 Main PID: 48737 (code=exited, status=0/SUCCESS)

giu 10 17:14:42 lynx01.lab.eng.tlv2.redhat.com systemd[1]: Started Auxiliary vdsm service for running helper functions as root.
giu 10 17:14:42 lynx01.lab.eng.tlv2.redhat.com systemd[1]: Starting Auxiliary vdsm service for running helper functions as root...
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com systemd[1]: Stopping Auxiliary vdsm service for running helper functions as root...
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com daemonAdapter[48737]: Traceback (most recent call last):
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com daemonAdapter[48737]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com daemonAdapter[48737]: finalizer()
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com daemonAdapter[48737]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com daemonAdapter[48737]: res = self._callback(*self._args, **self._kwargs)
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com daemonAdapter[48737]: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'
giu 10 17:14:46 lynx01.lab.eng.tlv2.redhat.com systemd[1]: Stopped Auxiliary vdsm service for running helper functions as root.

Comment 5 Yaniv Kaul 2018-06-11 09:35:49 UTC

Severity?

Comment 6 SATHEESARAN 2018-06-11 11:01:56 UTC

(In reply to Yaniv Kaul from comment #5)
> Severity?

This will highly impact the RHHI deployment. With RHHI, this issue is 'urgent', as the HE deployment can't be done.

Comment 7 Dan Kenigsberg 2018-06-11 12:32:05 UTC

Can you attach {super,}vdsm.log and host /var/log/message with hints on why vdsmd failed to start?

Comment 8 Kobi Hakimi 2018-06-11 12:39:33 UTC

just open the attached zip file and you have all the logs :)
Also if you want I reproduced this issue and the machine is up and running.

Comment 9 Martin Perina 2018-06-11 12:42:11 UTC

Piotr, could you please take a look? According to logs it seems to me that vdsm-tool configure --force was executed but afterwards supervdsm was not able to start:

Jun 10 16:23:31 lynx01 vdsmd_init_common.sh: vdsm: Running mkdirs
Jun 10 16:23:31 lynx01 vdsmd_init_common.sh: vdsm: Running configure_coredump
Jun 10 16:23:31 lynx01 vdsmd_init_common.sh: vdsm: Running configure_vdsm_logs
Jun 10 16:23:31 lynx01 vdsmd_init_common.sh: vdsm: Running wait_for_network
Jun 10 16:23:32 lynx01 vdsmd_init_common.sh: vdsm: Running run_init_hooks
Jun 10 16:23:32 lynx01 systemd: Started GlusterFS, a clustered file-system server.
Jun 10 16:23:32 lynx01 systemd: Reached target Network is Online.
Jun 10 16:23:32 lynx01 systemd: Starting Network is Online.
Jun 10 16:23:32 lynx01 systemd: Starting Crash recovery kernel arming...
Jun 10 16:23:32 lynx01 systemd: Starting System Logging Service...
Jun 10 16:23:32 lynx01 systemd: Starting Notify NFS peers of a restart...
Jun 10 16:23:32 lynx01 systemd: Starting LSB: Starts the Spacewalk Daemon...
Jun 10 16:23:32 lynx01 sm-notify[2501]: Version 1.3.0 starting
Jun 10 16:23:32 lynx01 vdsmd_init_common.sh: vdsm: Running ngn_configure

^^^ Here configure should be executed

Jun 10 16:23:32 lynx01 systemd: Started Notify NFS peers of a restart.
Jun 10 16:23:32 lynx01 rhnsd: Starting Spacewalk Daemon: [  OK  ]
Jun 10 16:23:32 lynx01 rhnsd[2530]: Spacewalk Services Daemon starting up, check in interval 240 minutes.
Jun 10 16:23:32 lynx01 systemd: Started LSB: Starts the Spacewalk Daemon.
Jun 10 16:23:32 lynx01 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="2500" x-info="http://www.rsyslog.com"] start
Jun 10 16:23:32 lynx01 systemd: Started System Logging Service.
Jun 10 16:23:32 lynx01 kdumpctl: No kdump initial ramdisk found.
Jun 10 16:23:32 lynx01 kdumpctl: Rebuilding /boot/initramfs-3.10.0-862.3.2.el7.x86_64kdump.img
Jun 10 16:23:33 lynx01 sasldblistusers2: _sasldb_getkeyhandle has failed
Jun 10 16:23:34 lynx01 vdsmd_init_common.sh: vdsm: Running run_final_hooks
Jun 10 16:23:35 lynx01 systemd: Stopped Virtual Desktop Server Manager.
Jun 10 16:23:35 lynx01 systemd: Starting Suspend/Resume Running libvirt Guests...
Jun 10 16:23:35 lynx01 systemd: Reached target Multi-User System.
Jun 10 16:23:35 lynx01 systemd: Starting Multi-User System.
Jun 10 16:23:35 lynx01 systemd: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Jun 10 16:23:35 lynx01 systemd: Starting Stop Read-Ahead Data Collection 10s After Completed Startup.
Jun 10 16:23:35 lynx01 systemd: Starting Update UTMP about System Runlevel Changes...
Jun 10 16:23:35 lynx01 systemd: Stopping Auxiliary vdsm service for running helper functions as root...
Jun 10 16:23:35 lynx01 daemonAdapter: Traceback (most recent call last):
Jun 10 16:23:35 lynx01 daemonAdapter: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
Jun 10 16:23:35 lynx01 daemonAdapter: finalizer()
Jun 10 16:23:35 lynx01 daemonAdapter: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
Jun 10 16:23:35 lynx01 systemd: Started Update UTMP about System Runlevel Changes.
Jun 10 16:23:35 lynx01 daemonAdapter: res = self._callback(*self._args, **self._kwargs)
Jun 10 16:23:35 lynx01 daemonAdapter: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'
Jun 10 16:23:35 lynx01 systemd: Started Suspend/Resume Running libvirt Guests.
Jun 10 16:23:35 lynx01 systemd: Stopped Auxiliary vdsm service for running helper functions as root.

Comment 10 Piotr Kliczewski 2018-06-12 07:27:09 UTC

Martin, it looks like Ryan already pushed a patch to fix the issue.

Comment 11 Yaniv Kaul 2018-06-13 12:18:46 UTC

Ryan, can we build and deliver a fixed RHVH? It blocks RHHI testing, for example.

Comment 12 Ryan Barry 2018-06-13 12:26:10 UTC

The workaround from RHVH was posted on Monday and merged this morning (passed code review late yesterday).

I'll put out a RHVH build for RHHI QE to use today, though they should expect another build later this week (Friday/Sunday) after the normal RHV compose.

Comment 14 Pavol Brilla 2018-06-20 12:28:34 UTC

Version for verification:

# rpm -qa | grep ovirt-hosted-engine
ovirt-hosted-engine-ha-2.2.14-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.22-1.el7ev.noarch
[root@lynx21 ~]# imgbase w
You are on rhvh-4.2.4.3-0.20180615.0+1

Comment 15 Sandro Bonazzola 2018-06-26 08:41:25 UTC

This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.