Created attachment 1661969 [details] logs Description of problem:If the host is rebooted (or powered off) by sending poweroff or reboot command , the HA VMs are not restarted Version-Release number of selected component (if applicable): vdsm-4.40.2-1.el8ev.x86_64 ovirt-engine-4.4.0-0.19.master.el7.noarch How reproducible:100% Steps to Reproduce: 1.Run HA VMs on the host 2.ssh to the host and send command 'poweroff' (the same behavior for reboot). Actual results:HA VMs are not restarted when the host is back Expected results: HA VMs must be restarted Additional info:engine, vdsm, qemu logs are attached
The mechanism of reboot should not matter here. If the host was not put into maintenance first, there will be an existing lease on the disk, and we must wait at least 5 min even after it's back up in vdsm
clean "poweroff" should trigger libvirt-guests service which should shut down the VMs, I can see in the logs(even though their are just partial and engine log is missing) that it did just that. The only thing is that those shutdowns are clean, so no HA is started. Bug 1334982 should have changed exit reason so HA VMs are then restarted, but that was later rewritten(I952d8e590c5d40009b8ea119c584b268cb266adb) and it seems it may no longer work. Need correct logs from engine as well as debug level in vdsm to see why it didn't set the right shutdown reason
Created attachment 1662121 [details] logs attached new logs (engine , vdsm (with debug), qemu, libvirt). in the engine.log the test start at : 2020-02-10 12:43:33,194+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: USER_RUN_VM(32), VM high_4 started on Host host_mixed_1 2020-02-10 12:43:33,211+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: USER_RUN_VM(32), VM high_2 started on Host host_mixed_1 2020-02-10 12:43:33,233+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: USER_RUN_VM(32), VM high_1 started on Host host_mixed_1 2020-02-10 12:43:33,251+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: USER_RUN_VM(32), VM high_3 started on Host host_mixed_1
The problem is clearly on the host. There is a message "host in shutdown waiting" in vdsm.log, so Vdsm is aware about the shutdown. However the VMs are exited with a wrong exit reason: "'exitMessage': 'User shut down from within the guest', 'exitReason': 7", without an apparent reason. When I try it on my setup, it works as expected and I get the right exit reason: "'exitMessage': 'VM has been terminated on the host', 'exitReason': 11". Identifying the right exit reason involves systemd and dbus. I suspect it could be a platform problem that might be fixed in the meantime. I'd suggest to try again, with an up-to-date system. If it still doesn't work and you have at least systemd-239-18 and dbus-1.12.8-9, we will have to inspect it on a live system.
Polina, could you please check if it reproduces on latest version?
Created attachment 1711117 [details] logs Yes, I'm reproducing it on vdsm-4.40.22-1.el8ev.x86_64, ovirt-engine-4.4.1.10-0.1.el8ev.noarch, libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64. Attaching logs, but also can provide the environment if required The logs are for the scenario: 1. Run two HA VMs golden_env_mixed_virtio_0 , golden_env_mixed_virtio_1. on host1. at Aug 11, 2020, 9:40:42 PM 2. ssh to host 1 , reboot. Host host_mixed_1 is non responsive at Aug 11, 2020, 9:46:55 PM. When host is back the VMs are not started. waited an hour.
I can reproduce it now, with: - qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64 - systemd-239-31.el8_2.2.x86_64 - dbus-1.12.8-10.el8_2.x86_64 - kernel-4.18.0-193.14.3.el8_2.x86_64 - libvirt-daemon-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64 It is reported as guest shutdown rather than host shutdown from libvirt. Looking into libvirt sources, this information comes directly from QEMU. There is nothing interesting in QEMU log, just: 2020-08-20 09:32:55.013+0000: shutting down, reason=shutdown Martin, do you have any idea why guest shutdown can be reported for a VM when running poweroff on the host? It didn't happen before and there was no related change on our side AFAIK. Or any idea what to look at or who to ask for help?
As Michal pointed out, check whether you have libvirt-guests service activated. That is intended for users to keep their machines (save their states, but that is configurable) upon shutdown on the host. I suspect oVirt/RHV to disable this service if it is managing the host. I am guessing machined would not be the one suspending the guests since it looks like it is done from inside the guest (probably using qemu-ga).
libvirt-guests seems to be working normally: # journalctl -b -1 -u libvirt-guests -- Logs begin at Thu 2020-08-20 11:38:01 CEST, end at Thu 2020-08-20 16:25:07 CEST. -- Aug 20 16:02:39 rhel-8-host-1 systemd[1]: Starting Suspend/Resume Running libvirt Guests... Aug 20 16:02:39 rhel-8-host-1 systemd[1]: Started Suspend/Resume Running libvirt Guests. Aug 20 16:14:19 rhel-8-host-1 systemd[1]: Stopping Suspend/Resume Running libvirt Guests... Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Running guests on qemu+tls://rhel-8-host-1/system URI: centos Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Shutting down guests on qemu+tls://rhel-8-host-1/system URI... Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Starting shutdown on guest: centos Aug 20 16:14:22 rhel-8-host-1 libvirt-guests.sh[3835]: Waiting for guest centos to shut down, 600 seconds left Aug 20 16:14:28 rhel-8-host-1 libvirt-guests.sh[3835]: Shutdown of guest centos complete. Aug 20 16:14:28 rhel-8-host-1 systemd[1]: Stopped Suspend/Resume Running libvirt Guests. Changes to /etc/sysconfig/libvirt-guests made by Vdsm look also innocent to me: URIS=qemu+tls://rhel-8-host-1/system ON_BOOT=ignore ON_SHUTDOWN=shutdown PARALLEL_SHUTDOWN=0 SHUTDOWN_TIMEOUT=600 libvirt reports some errors during the shutdown process, but I guess they are normal, are they? (qemu-ga is running in the VM.) # journalctl -b -1 -u 'libvirt*' --since='2020-08-20 16:14:20' --until='2020-08-20 16:14:29' -- Logs begin at Thu 2020-08-20 11:38:01 CEST, end at Thu 2020-08-20 16:41:13 CEST. -- Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Running guests on qemu+tls://rhel-8-host-1/system URI: centos Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Shutting down guests on qemu+tls://rhel-8-host-1/system URI... Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Starting shutdown on guest: centos Aug 20 16:14:22 rhel-8-host-1 libvirt-guests.sh[3835]: Waiting for guest centos to shut down, 600 seconds left Aug 20 16:14:23 rhel-8-host-1 libvirtd[1619]: Guest agent is not responding: QEMU guest agent is not connected Aug 20 16:14:27 rhel-8-host-1 libvirtd[1619]: internal error: End of file from qemu monitor Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: End of file from qemu monitor Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: Unable to remove disk metadata on vm centos from /rhev/data-center/mnt/centos-engine-44.standard:_export_data/d18ee680-fe75-42e3-b9c5-ea7a380ed371/images/9db5f863-b569-45b1-89d9-20cffc9642ff/66f6df4a-8837-4fd3-a283-6e2c1a9e6929 (disk target sda) Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: Unable to remove disk metadata on vm centos from /rhev/data-center/mnt/centos-engine-44.standard:_export_data/d18ee680-fe75-42e3-b9c5-ea7a380ed371/images/9db5f863-b569-45b1-89d9-20cffc9642ff/139dcd4d-2297-48a9-b7c2-3dcf50b9b01d (disk target sda) Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: Unable to remove disk metadata on vm centos from /rhev/data-center/mnt/centos-engine-44.standard:_export_data/d18ee680-fe75-42e3-b9c5-ea7a380ed371/images/9db5f863-b569-45b1-89d9-20cffc9642ff/fe055930-bd4c-4e46-82e6-66b574a01267 (disk target sda) Aug 20 16:14:28 rhel-8-host-1 libvirt-guests.sh[3835]: Shutdown of guest centos complete. Aug 20 16:14:28 rhel-8-host-1 systemd[1]: Stopped Suspend/Resume Running libvirt Guests. I'm also attaching libvirt debug log, from another run: Aug 20 16:52:19 rhel-8-host-1 systemd[1]: Stopping Suspend/Resume Running libvirt Guests... Aug 20 16:52:20 rhel-8-host-1 libvirt-guests.sh[7774]: Running guests on qemu+tls://rhel-8-host-1/system URI: centos Aug 20 16:52:20 rhel-8-host-1 libvirt-guests.sh[7774]: Shutting down guests on qemu+tls://rhel-8-host-1/system URI... Aug 20 16:52:20 rhel-8-host-1 libvirt-guests.sh[7774]: Starting shutdown on guest: centos Aug 20 16:52:22 rhel-8-host-1 libvirt-guests.sh[7774]: Waiting for guest centos to shut down, 600 seconds left Aug 20 16:52:27 rhel-8-host-1 libvirt-guests.sh[7774]: Waiting for guest centos to shut down, 595 seconds left Aug 20 16:52:30 rhel-8-host-1 libvirt-guests.sh[7774]: Shutdown of guest centos complete. Aug 20 16:52:30 rhel-8-host-1 systemd[1]: Stopped Suspend/Resume Running libvirt Guests. It looks more interesting, I'm not sure everything is all right there: 2020-08-20 14:52:20.861+0000: 7169: debug : qemuAgentCommand:1131 : Send command '{"execute":"guest-shutdown","arguments":{"mode":"powerdown"}}' for write, seconds = 60 ... 2020-08-20 14:52:27.688+0000: 7164: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1597935147, "microseconds": 687708}, "event": "SHUTDOWN", "data": {"guest": true, "reason": "guest-shutdown"}}] 2020-08-20 14:52:27.688+0000: 7164: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7fc0b0002c50 event={"timestamp": {"seconds": 1597935147, "microseconds": 687708}, "event": "SHUTDOWN", "data": {"guest": true, "reason": "guest-shutdown"}} ... 2020-08-20 14:52:27.712+0000: 7164: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1597935147, "microseconds": 712137}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}] 2020-08-20 14:52:27.712+0000: 7164: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7fc0b0002c50 event={"timestamp": {"seconds": 1597935147, "microseconds": 712137}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}} ... 2020-08-20 14:52:27.712+0000: 7164: debug : qemuProcessHandleShutdown:582 : Ignoring repeated SHUTDOWN event from domain centos Is it OK that QEMU reports guest shutdown and then host shutdown? libvirt reports only the first event to Vdsm (guest shutdown) and then Stopped event: 2020-08-20 16:52:27,690+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6f6b8826-eb89-41b8-ac1b-c0a93672cacc') event Shutdown detail 1 opaque None (vm:4914) 2020-08-20 16:52:30,348+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6f6b8826-eb89-41b8-ac1b-c0a93672cacc') event Stopped detail 0 opaque None (vm:4914) Martin, can you see anything wrong here?
Created attachment 1712044 [details] libvirtd.log.xz
Well, clearly libvirt-guests is setup to shutdown the machine on shutdown, so it does that, and the default is to use the guest agent if possible and just acpi signal if it is not. That's why you see clean shutdown of the machine from inside of the guest. I would guess the errors are there because the guest does not reply before the shutdown takes place and we get disconnected from the socket, hopefully there's not more to that.
(In reply to Martin Kletzander from comment #11) > Well, clearly libvirt-guests is setup to shutdown the machine on shutdown, > so it does that, Yes, this is what should happen to the VMs running on the host. > and the default is to use the guest agent if possible and > just acpi signal if it is not. That's why you see clean shutdown of the > machine from inside of the guest. So libvirt-guests.sh calls `virsh shutdown' and from the libvirt point of view it's a shutdown from inside of the guest and reports it as such to Vdsm (with or without qemu-ga). I wonder why it worked differently before. Under which circumstances would libvirt report host shutdown then? And does libvirt provide a way to distinguish between a host initiated shutdown and a (real) shutdown from the guest, while still ensuring clean VM shutdown on host shutdown?
(In reply to Milan Zamazal from comment #12) Not sure, to be honest. I have spend some time away from the project, so I will miss some changes, but it would be nice to know with what version worked the way you expected. Could you also describe how was the older behaviour was different? How did libvirt report the host shutdown? Was it in the "reason" field of the domain status?
Unfortunately, I can't remember the details. I think libvirt VIR_DOMAIN_EVENT_SHUTDOWN event detail VIR_DOMAIN_EVENT_SHUTDOWN_HOST x VIR_DOMAIN_EVENT_SHUTDOWN_GUEST was used and worked correctly for me when I wrote Comment 4. I also think I have seen some systemd bug that might cause it not working correctly and that was handled around that time, this is why I wrote there "Identifying the right exit reason involves systemd and dbus." But I can't find it now. Looking how it works, I wonder how it could work for me. Maybe there was something wrong in my environment that made it working by side effect, hard to say now. I'm not sure how it is supposed to work exactly, I'll ask on the users list for clarification.
(In reply to Milan Zamazal from comment #14) Unfortunately I think you are right about how it works as I have the same understanding of it. Maybe someone will be able to explain the differences that were made when switching to systemd service. But it would be most beneficial to know under what circumstances does RHV restart the HA VMs so that we can try to find out what is it that it requires from libvirt (or maybe someone else in case it was monitoring the host shutdown, for example).
Thank you, Martin, for clarifications, confirmed also by Daniel on the users list (https://www.redhat.com/archives/libvirt-users/2020-August/msg00095.html). The regression in oVirt was apparently introduced when fixing bug 1783815 and switching to information provided by libvirt instead of our former guesswork. I posted a patch that returns back our own check for host shutdown when libvirt reports guest shutdown. AFAIK oVirt restarts HA VMs by default, unless: - the VM has been shut down from Engine explicitly (we can track that, it's under our control); - or the VM has been shut down solely from within the guest, typically by a sysadmin action (we need an external information to know whether the guest shutdown was initiated from within the guest or from something on the host).
It's not in any build yet.
Verified on ovirt-engine-4.4.3.3-0.19.el8ev.noarch vdsm-4.40.30-1.el8ev.x86_64 played with HA VMs with and without lease. Restarted after sending poweroff to the host
This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.