1459481 – Can't shutdown/reboot host with hosted engine

Bug 1459481 - Can't shutdown/reboot host with hosted engine

Summary: Can't shutdown/reboot host with hosted engine

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-hosted-engine-setup
Classification:	oVirt
Component:	General
Sub Component:
Version:	2.1.0.6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-07 09:22 UTC by shyningcrow
Modified:	2017-10-04 12:38 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-12 09:33:24 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-4.2+

Attachments	(Terms of Use)
Journald log file (187.35 KB, text/x-vhdl) 2017-06-07 09:22 UTC, shyningcrow	no flags	Details
View All

Description shyningcrow 2017-06-07 09:22:41 UTC

Created attachment 1285723 [details]
Journald log file

Description of problem:
In a test environment, I have one node (CentOS 7.3) on which the hosted engine has been deployed. The machine could be rebooted normally before the installation of oVirt. After the installation of oVirt HE, it refuses to shutdown or reboot, refuses any form of connection (SSH connection refused) and will require a manual reset using the physical button.

Version-Release number of selected component (if applicable):
ovirt-engine-sdk-python-3.6.9.1-1.el7.centos.noarch
ovirt-imageio-common-1.0.0-1.el7.noarch
ovirt-hosted-engine-ha-2.1.0.6-1.el7.centos.noarch
ovirt-hosted-engine-setup-2.1.0.6-1.el7.centos.noarch
ovirt-vmconsole-1.0.4-1.el7.centos.noarch
ovirt-host-deploy-1.6.5-1.el7.centos.noarch
cockpit-ovirt-dashboard-0.10.7-0.0.18.el7.centos.noarch
ovirt-vmconsole-host-1.0.4-1.el7.centos.noarch
ovirt-setup-lib-1.1.0-1.el7.centos.noarch
ovirt-engine-appliance-4.1-20170523.1.el7.centos.noarch
ovirt-imageio-daemon-1.0.0-1.el7.noarch
vdsm-client-4.19.15-1.el7.centos.noarch
vdsm-xmlrpc-4.19.15-1.el7.centos.noarch
vdsm-jsonrpc-4.19.15-1.el7.centos.noarch
vdsm-hook-vmfex-dev-4.19.15-1.el7.centos.noarch
vdsm-api-4.19.15-1.el7.centos.noarch
vdsm-yajsonrpc-4.19.15-1.el7.centos.noarch
vdsm-4.19.15-1.el7.centos.x86_64
vdsm-python-4.19.15-1.el7.centos.noarch
vdsm-cli-4.19.15-1.el7.centos.noarch
libvirt-daemon-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-lxc-2.0.0-10.el7_3.9.x86_64
libvirt-client-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-secret-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-config-network-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-nwfilter-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-config-nwfilter-2.0.0-10.el7_3.9.x86_64
libvirt-2.0.0-10.el7_3.9.x86_64
libvirt-python-2.0.0-2.el7.x86_64
libvirt-daemon-driver-network-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-nodedev-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-qemu-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-kvm-2.0.0-10.el7_3.9.x86_64
libvirt-lock-sanlock-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-interface-2.0.0-10.el7_3.9.x86_64
libvirt-daemon-driver-storage-2.0.0-10.el7_3.9.x86_64
libnfsidmap-0.25-15.el7.x86_64
nfs-utils-1.3.0-0.33.el7_3.x86_64


How reproducible:
Install CentOS, oVirt repositories, perform hosted-engine --deploy or use the cockpit plugin. Try to restart the host.

Steps to Reproduce:
1. Install CentOS 7
2. Install oVirt repositories
3. Setup NFSv3 shares on the host.
4. Disable/Configure SELinux.
4. Install oVirt HE (hosted-engine --deploy; alternatively use the cockpit plugin)
5. Perform a shutdown/reboot, whatever the mean (systemctl poweroff; shutdown; cockpit; HE)

Actual results:
The system hangs and refuses to shutdown, reboot or accept any form of connection (SSH connection refused) for a long (3+ hours), possibly indefinite time.

Expected results:
The system should shutdown/reboot normally, even if it isn't able to migrate the HE, although I suspect that's not the reason. OR at least it should try to kill processes to allow a shutdown.

Additional info:
- SELinux is currently disabled for testing purposes and was disabled during the installation.
- The problem arises when the system isn't yet configured (HE requires an additional storage to be set after installation) and thereafter.
- My current configuration uses NFSv3 shares on the same host which I'm trying to shutdown.
- Vdsm reports an error (BackendFailureException).
- Sanlock probably prevents the system to shutdown since unmounting times out. (My best bet at the moment).
- Stopping Sanlock/Vdsm or both has no effect on the current behaviour.
- Stopping the HE virtual machine before shutting down has no effect.
- Setting the global maintenance mode has no effect and will prevent host shutdown anyway.
- Cockpit Ovirt Plugin doesn't seem to be able to connect to Vdsm, although executing the script manually to query Vdsm API works correctly.

Attaching a journald persistent log file where the problem can be observed. Notice the pages nearing the end of the log. In this log, the machine has been manually shut.

Probably related: http://lists.ovirt.org/pipermail/users/2016-March/071649.html .

Comment 1 shyningcrow 2017-06-09 14:46:48 UTC

A few updates:
 - I was able to access vdsmd through Cockpit Ovirt Plugin as root. Logging in Cockpit as user probably lacks permissions. Hence I don't think it's related to this bug.
 - Stopping the ovirt-ha-agent, ovirt-ha-broker, vdsmd and sanlock (without global maintenance) produces a system reboot (as wdmd probably thinks the node has failed).
 - Stopping sanlock isn't successful. The process isn't stopped by systemd and the unit enters failed state. The process keeps living and prevents umount on the /rhev/... mountpoint.
 - Stopping wdmd (alongside the agents, vdsmd and sanlock) produces a reboot anyway (it shouldn't behave like this).

Comment 2 Sandro Bonazzola 2017-06-12 09:33:24 UTC

Looks like you're running a hosted engine on a single host with local storage (nfs export?).

For shutting it down you need to put the host in global maintenance, then shut down the hosted engine vm, then disconnect the storage from vdsm before shutting down.

This is not a supported deployment anyway, for hosted engine at least 2 hosts should be used.

Closing as not a bug.

Comment 3 shyningcrow 2017-10-04 12:38:40 UTC

I know the configuration I'm trying to use is not supported, but I do think this would be a great add for PoC cases and small deployments. For instance, vSphere doesn't require two hosts and it makes me sad Ovirt HE isn't able to do the same.

Although the procedure you described SHOULD theoretically work, it doesn't.

I finally think I've been able to detect the root cause of the problem. Raw devices are unmounted BEFORE Sanlock can properly shut down. Since the devices aren't available anymore Sanlock tries to write and sync the state, but it isn't unable to do so since there's nothing behind those NFS mounts. Because of this the task hangs.

I am currently trying to avoid this situation by using systemd mounts in order to correctly specify the shutdown order.

Note You need to log in before you can comment on or make changes to this bug.