Red Hat Bugzilla – Bug 1459481
Can't shutdown/reboot host with hosted engine
Last modified: 2017-10-04 08:38:40 EDT
Created attachment 1285723 [details]
Journald log file
Description of problem:
In a test environment, I have one node (CentOS 7.3) on which the hosted engine has been deployed. The machine could be rebooted normally before the installation of oVirt. After the installation of oVirt HE, it refuses to shutdown or reboot, refuses any form of connection (SSH connection refused) and will require a manual reset using the physical button.
Version-Release number of selected component (if applicable):
Install CentOS, oVirt repositories, perform hosted-engine --deploy or use the cockpit plugin. Try to restart the host.
Steps to Reproduce:
1. Install CentOS 7
2. Install oVirt repositories
3. Setup NFSv3 shares on the host.
4. Disable/Configure SELinux.
4. Install oVirt HE (hosted-engine --deploy; alternatively use the cockpit plugin)
5. Perform a shutdown/reboot, whatever the mean (systemctl poweroff; shutdown; cockpit; HE)
The system hangs and refuses to shutdown, reboot or accept any form of connection (SSH connection refused) for a long (3+ hours), possibly indefinite time.
The system should shutdown/reboot normally, even if it isn't able to migrate the HE, although I suspect that's not the reason. OR at least it should try to kill processes to allow a shutdown.
- SELinux is currently disabled for testing purposes and was disabled during the installation.
- The problem arises when the system isn't yet configured (HE requires an additional storage to be set after installation) and thereafter.
- My current configuration uses NFSv3 shares on the same host which I'm trying to shutdown.
- Vdsm reports an error (BackendFailureException).
- Sanlock probably prevents the system to shutdown since unmounting times out. (My best bet at the moment).
- Stopping Sanlock/Vdsm or both has no effect on the current behaviour.
- Stopping the HE virtual machine before shutting down has no effect.
- Setting the global maintenance mode has no effect and will prevent host shutdown anyway.
- Cockpit Ovirt Plugin doesn't seem to be able to connect to Vdsm, although executing the script manually to query Vdsm API works correctly.
Attaching a journald persistent log file where the problem can be observed. Notice the pages nearing the end of the log. In this log, the machine has been manually shut.
Probably related: http://lists.ovirt.org/pipermail/users/2016-March/071649.html .
A few updates:
- I was able to access vdsmd through Cockpit Ovirt Plugin as root. Logging in Cockpit as user probably lacks permissions. Hence I don't think it's related to this bug.
- Stopping the ovirt-ha-agent, ovirt-ha-broker, vdsmd and sanlock (without global maintenance) produces a system reboot (as wdmd probably thinks the node has failed).
- Stopping sanlock isn't successful. The process isn't stopped by systemd and the unit enters failed state. The process keeps living and prevents umount on the /rhev/... mountpoint.
- Stopping wdmd (alongside the agents, vdsmd and sanlock) produces a reboot anyway (it shouldn't behave like this).
Looks like you're running a hosted engine on a single host with local storage (nfs export?).
For shutting it down you need to put the host in global maintenance, then shut down the hosted engine vm, then disconnect the storage from vdsm before shutting down.
This is not a supported deployment anyway, for hosted engine at least 2 hosts should be used.
Closing as not a bug.
I know the configuration I'm trying to use is not supported, but I do think this would be a great add for PoC cases and small deployments. For instance, vSphere doesn't require two hosts and it makes me sad Ovirt HE isn't able to do the same.
Although the procedure you described SHOULD theoretically work, it doesn't.
I finally think I've been able to detect the root cause of the problem. Raw devices are unmounted BEFORE Sanlock can properly shut down. Since the devices aren't available anymore Sanlock tries to write and sync the state, but it isn't unable to do so since there's nothing behind those NFS mounts. Because of this the task hangs.
I am currently trying to avoid this situation by using systemd mounts in order to correctly specify the shutdown order.