1800966 – HA VMs are not restarted in case of host reboot

Bug 1800966 - HA VMs are not restarted in case of host reboot

Summary: HA VMs are not restarted in case of host reboot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ovirt-4.4.3
Target Release:	4.40.30
Assignee:	Milan Zamazal
QA Contact:	Polina
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-09 14:36 UTC by Polina
Modified:	2020-11-11 06:42 UTC (History)
CC List:	6 users (show)
Fixed In Version:	vdsm-4.40.30
Clone Of:
Environment:
Last Closed:	2020-11-11 06:42:34 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+

Attachments	(Terms of Use)
logs (271.37 KB, application/gzip) 2020-02-09 14:36 UTC, Polina	no flags	Details
logs (1.16 MB, application/gzip) 2020-02-10 11:17 UTC, Polina	no flags	Details
logs (894.44 KB, application/gzip) 2020-08-11 19:51 UTC, Polina	no flags	Details
libvirtd.log.xz (18.02 KB, application/x-xz) 2020-08-20 15:27 UTC, Milan Zamazal	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	111071	0	master	MERGED	virt: Return back the check for host shutdown	2020-12-01 05:26:47 UTC

Description Polina 2020-02-09 14:36:03 UTC

Created attachment 1661969 [details]
logs

Description of problem:If the host is rebooted (or powered off) by sending poweroff or reboot command , the HA VMs are not restarted


Version-Release number of selected component (if applicable):
vdsm-4.40.2-1.el8ev.x86_64
ovirt-engine-4.4.0-0.19.master.el7.noarch

How reproducible:100%


Steps to Reproduce:
1.Run HA VMs on the host
2.ssh to the host and send command 'poweroff' (the same behavior for reboot).


Actual results:HA VMs are not restarted when the host is back


Expected results: HA VMs must be restarted


Additional info:engine, vdsm, qemu logs are attached

Comment 1 Ryan Barry 2020-02-10 01:23:43 UTC

The mechanism of reboot should not matter here. If the host was not put into maintenance first, there will be an existing lease on the disk, and we must wait at least 5 min even after it's back up in vdsm

Comment 2 Michal Skrivanek 2020-02-10 05:18:19 UTC

clean "poweroff" should trigger libvirt-guests service which should shut down the VMs, I can see in the logs(even though their are just partial and engine log is missing) that it did just that. The only thing is that those shutdowns are clean, so no HA is started. Bug 1334982 should have changed exit reason so HA VMs are then restarted, but that was later rewritten(I952d8e590c5d40009b8ea119c584b268cb266adb) and it seems it may no longer work.

Need correct logs from engine as well as debug level in vdsm to see why it didn't set the right shutdown reason

Comment 3 Polina 2020-02-10 11:17:39 UTC

Created attachment 1662121 [details]
logs

attached new logs (engine , vdsm (with debug), qemu, libvirt).

in the engine.log the test start at :

2020-02-10 12:43:33,194+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: USER_RUN_VM(32), VM high_4 started on Host host_mixed_1
2020-02-10 12:43:33,211+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: USER_RUN_VM(32), VM high_2 started on Host host_mixed_1
2020-02-10 12:43:33,233+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: USER_RUN_VM(32), VM high_1 started on Host host_mixed_1
2020-02-10 12:43:33,251+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [] EVENT_ID: USER_RUN_VM(32), VM high_3 started on Host host_mixed_1

Comment 4 Milan Zamazal 2020-03-16 18:52:17 UTC

The problem is clearly on the host. There is a message "host in shutdown waiting" in vdsm.log, so Vdsm is aware about the shutdown. However the VMs are exited with a wrong exit reason: "'exitMessage': 'User shut down from within the guest', 'exitReason': 7", without an apparent reason.

When I try it on my setup, it works as expected and I get the right exit reason: "'exitMessage': 'VM has been terminated on the host', 'exitReason': 11". Identifying the right exit reason involves systemd and dbus. I suspect it could be a platform problem that might be fixed in the meantime. I'd suggest to try again, with an up-to-date system. If it still doesn't work and you have at least systemd-239-18 and dbus-1.12.8-9, we will have to inspect it on a live system.

Comment 5 Arik 2020-08-10 17:28:13 UTC

Polina, could you please check if it reproduces on latest version?

Comment 6 Polina 2020-08-11 19:51:30 UTC

Created attachment 1711117 [details]
logs

Yes, I'm reproducing it on vdsm-4.40.22-1.el8ev.x86_64, ovirt-engine-4.4.1.10-0.1.el8ev.noarch, libvirt-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64.

Attaching logs, but also can provide the environment if required 

The logs are for the scenario:

1. Run two HA VMs golden_env_mixed_virtio_0 , golden_env_mixed_virtio_1. on host1. at Aug 11, 2020, 9:40:42 PM 
2. ssh to host 1 , reboot.
Host host_mixed_1 is non responsive at Aug 11, 2020, 9:46:55 PM.
When host is back the VMs are not started. waited an hour.

Comment 7 Milan Zamazal 2020-08-20 09:57:31 UTC

I can reproduce it now, with:

- qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64
- systemd-239-31.el8_2.2.x86_64
- dbus-1.12.8-10.el8_2.x86_64
- kernel-4.18.0-193.14.3.el8_2.x86_64
- libvirt-daemon-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64

It is reported as guest shutdown rather than host shutdown from libvirt. Looking into libvirt sources, this information comes directly from QEMU. There is nothing interesting in QEMU log, just:

  2020-08-20 09:32:55.013+0000: shutting down, reason=shutdown

Martin, do you have any idea why guest shutdown can be reported for a VM when running poweroff on the host? It didn't happen before and there was no related change on our side AFAIK. Or any idea what to look at or who to ask for help?

Comment 8 Martin Kletzander 2020-08-20 12:42:14 UTC

As Michal pointed out, check whether you have libvirt-guests service activated.  That is intended for users to keep their machines (save their states, but that is configurable) upon shutdown on the host.  I suspect oVirt/RHV to disable this service if it is managing the host.  I am guessing machined would not be the one suspending the guests since it looks like it is done from inside the guest (probably using qemu-ga).

Comment 9 Milan Zamazal 2020-08-20 15:27:08 UTC

libvirt-guests seems to be working normally:

  # journalctl -b -1 -u libvirt-guests
  -- Logs begin at Thu 2020-08-20 11:38:01 CEST, end at Thu 2020-08-20 16:25:07 CEST. --
  Aug 20 16:02:39 rhel-8-host-1 systemd[1]: Starting Suspend/Resume Running libvirt Guests...
  Aug 20 16:02:39 rhel-8-host-1 systemd[1]: Started Suspend/Resume Running libvirt Guests.
  Aug 20 16:14:19 rhel-8-host-1 systemd[1]: Stopping Suspend/Resume Running libvirt Guests...
  Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Running guests on qemu+tls://rhel-8-host-1/system URI: centos
  Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Shutting down guests on qemu+tls://rhel-8-host-1/system URI...
  Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Starting shutdown on guest: centos
  Aug 20 16:14:22 rhel-8-host-1 libvirt-guests.sh[3835]: Waiting for guest centos to shut down, 600 seconds left
  Aug 20 16:14:28 rhel-8-host-1 libvirt-guests.sh[3835]: Shutdown of guest centos complete.
  Aug 20 16:14:28 rhel-8-host-1 systemd[1]: Stopped Suspend/Resume Running libvirt Guests.

Changes to /etc/sysconfig/libvirt-guests made by Vdsm look also innocent to me:

  URIS=qemu+tls://rhel-8-host-1/system
  ON_BOOT=ignore
  ON_SHUTDOWN=shutdown
  PARALLEL_SHUTDOWN=0
  SHUTDOWN_TIMEOUT=600

libvirt reports some errors during the shutdown process, but I guess they are normal, are they?  (qemu-ga is running in the VM.)

  # journalctl -b -1 -u 'libvirt*' --since='2020-08-20 16:14:20' --until='2020-08-20 16:14:29'
  -- Logs begin at Thu 2020-08-20 11:38:01 CEST, end at Thu 2020-08-20 16:41:13 CEST. --
  Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Running guests on qemu+tls://rhel-8-host-1/system URI: centos
  Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Shutting down guests on qemu+tls://rhel-8-host-1/system URI...
  Aug 20 16:14:20 rhel-8-host-1 libvirt-guests.sh[3835]: Starting shutdown on guest: centos
  Aug 20 16:14:22 rhel-8-host-1 libvirt-guests.sh[3835]: Waiting for guest centos to shut down, 600 seconds left
  Aug 20 16:14:23 rhel-8-host-1 libvirtd[1619]: Guest agent is not responding: QEMU guest agent is not connected
  Aug 20 16:14:27 rhel-8-host-1 libvirtd[1619]: internal error: End of file from qemu monitor
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: End of file from qemu monitor
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: Unable to remove disk metadata on vm centos from /rhev/data-center/mnt/centos-engine-44.standard:_export_data/d18ee680-fe75-42e3-b9c5-ea7a380ed371/images/9db5f863-b569-45b1-89d9-20cffc9642ff/66f6df4a-8837-4fd3-a283-6e2c1a9e6929 (disk target sda)
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: Unable to remove disk metadata on vm centos from /rhev/data-center/mnt/centos-engine-44.standard:_export_data/d18ee680-fe75-42e3-b9c5-ea7a380ed371/images/9db5f863-b569-45b1-89d9-20cffc9642ff/139dcd4d-2297-48a9-b7c2-3dcf50b9b01d (disk target sda)
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: child reported (status=125): internal error: End of file from qemu monitor
  Aug 20 16:14:28 rhel-8-host-1 libvirtd[1619]: Unable to remove disk metadata on vm centos from /rhev/data-center/mnt/centos-engine-44.standard:_export_data/d18ee680-fe75-42e3-b9c5-ea7a380ed371/images/9db5f863-b569-45b1-89d9-20cffc9642ff/fe055930-bd4c-4e46-82e6-66b574a01267 (disk target sda)
  Aug 20 16:14:28 rhel-8-host-1 libvirt-guests.sh[3835]: Shutdown of guest centos complete.
  Aug 20 16:14:28 rhel-8-host-1 systemd[1]: Stopped Suspend/Resume Running libvirt Guests.

I'm also attaching libvirt debug log, from another run:

  Aug 20 16:52:19 rhel-8-host-1 systemd[1]: Stopping Suspend/Resume Running libvirt Guests...
  Aug 20 16:52:20 rhel-8-host-1 libvirt-guests.sh[7774]: Running guests on qemu+tls://rhel-8-host-1/system URI: centos
  Aug 20 16:52:20 rhel-8-host-1 libvirt-guests.sh[7774]: Shutting down guests on qemu+tls://rhel-8-host-1/system URI...
  Aug 20 16:52:20 rhel-8-host-1 libvirt-guests.sh[7774]: Starting shutdown on guest: centos
  Aug 20 16:52:22 rhel-8-host-1 libvirt-guests.sh[7774]: Waiting for guest centos to shut down, 600 seconds left
  Aug 20 16:52:27 rhel-8-host-1 libvirt-guests.sh[7774]: Waiting for guest centos to shut down, 595 seconds left
  Aug 20 16:52:30 rhel-8-host-1 libvirt-guests.sh[7774]: Shutdown of guest centos complete.
  Aug 20 16:52:30 rhel-8-host-1 systemd[1]: Stopped Suspend/Resume Running libvirt Guests.

It looks more interesting, I'm not sure everything is all right there:

  2020-08-20 14:52:20.861+0000: 7169: debug : qemuAgentCommand:1131 : Send command '{"execute":"guest-shutdown","arguments":{"mode":"powerdown"}}' for write, seconds = 60
  ...
  2020-08-20 14:52:27.688+0000: 7164: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1597935147, "microseconds": 687708}, "event": "SHUTDOWN", "data": {"guest": true, "reason": "guest-shutdown"}}]
  2020-08-20 14:52:27.688+0000: 7164: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7fc0b0002c50 event={"timestamp": {"seconds": 1597935147, "microseconds": 687708}, "event": "SHUTDOWN", "data": {"guest": true, "reason": "guest-shutdown"}}
  ...
  2020-08-20 14:52:27.712+0000: 7164: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1597935147, "microseconds": 712137}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}]
  2020-08-20 14:52:27.712+0000: 7164: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7fc0b0002c50 event={"timestamp": {"seconds": 1597935147, "microseconds": 712137}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
  ...
  2020-08-20 14:52:27.712+0000: 7164: debug : qemuProcessHandleShutdown:582 : Ignoring repeated SHUTDOWN event from domain centos

Is it OK that QEMU reports guest shutdown and then host shutdown?

libvirt reports only the first event to Vdsm (guest shutdown) and then Stopped event:

  2020-08-20 16:52:27,690+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6f6b8826-eb89-41b8-ac1b-c0a93672cacc') event Shutdown detail 1 opaque None (vm:4914)
  2020-08-20 16:52:30,348+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6f6b8826-eb89-41b8-ac1b-c0a93672cacc') event Stopped detail 0 opaque None (vm:4914)

Martin, can you see anything wrong here?

Comment 10 Milan Zamazal 2020-08-20 15:27:56 UTC

Created attachment 1712044 [details]
libvirtd.log.xz

Comment 11 Martin Kletzander 2020-08-20 15:50:44 UTC

Well, clearly libvirt-guests is setup to shutdown the machine on shutdown, so it does that, and the default is to use the guest agent if possible and just acpi signal if it is not.  That's why you see clean shutdown of the machine from inside of the guest.

I would guess the errors are there because the guest does not reply before the shutdown takes place and we get disconnected from the socket, hopefully there's not more to that.

Comment 12 Milan Zamazal 2020-08-20 16:23:41 UTC

(In reply to Martin Kletzander from comment #11)
> Well, clearly libvirt-guests is setup to shutdown the machine on shutdown,
> so it does that, 

Yes, this is what should happen to the VMs running on the host.

> and the default is to use the guest agent if possible and
> just acpi signal if it is not.  That's why you see clean shutdown of the
> machine from inside of the guest.

So libvirt-guests.sh calls `virsh shutdown' and from the libvirt point of view it's a shutdown from inside of the guest and reports it as such to Vdsm (with or without qemu-ga). I wonder why it worked differently before. Under which circumstances would libvirt report host shutdown then? And does libvirt provide a way to distinguish between a host initiated shutdown and a (real) shutdown from the guest, while still ensuring clean VM shutdown on host shutdown?

Comment 13 Martin Kletzander 2020-08-24 15:52:10 UTC

(In reply to Milan Zamazal from comment #12)
Not sure, to be honest.  I have spend some time away from the project, so I will miss some changes, but it would be nice to know with what version worked the way you expected.  Could you also describe how was the older behaviour was different?  How did libvirt report the host shutdown?  Was it in the "reason" field of the domain status?

Comment 14 Milan Zamazal 2020-08-26 20:16:38 UTC

Unfortunately, I can't remember the details. I think libvirt VIR_DOMAIN_EVENT_SHUTDOWN event detail VIR_DOMAIN_EVENT_SHUTDOWN_HOST x VIR_DOMAIN_EVENT_SHUTDOWN_GUEST was used and worked correctly for me when I wrote Comment 4. I also think I have seen some systemd bug that might cause it not working correctly and that was handled around that time, this is why I wrote there "Identifying the right exit reason involves systemd and dbus." But I can't find it now.

Looking how it works, I wonder how it could work for me. Maybe there was something wrong in my environment that made it working by side effect, hard to say now.

I'm not sure how it is supposed to work exactly, I'll ask on the users list for clarification.

Comment 15 Martin Kletzander 2020-09-01 07:16:22 UTC

(In reply to Milan Zamazal from comment #14)
Unfortunately I think you are right about how it works as I have the same understanding of it.  Maybe someone will be able to explain the differences that were made when switching to systemd service.  But it would be most beneficial to know under what circumstances does RHV restart the HA VMs so that we can try to find out what is it that it requires from libvirt (or maybe someone else in case it was monitoring the host shutdown, for example).

Comment 16 Milan Zamazal 2020-09-01 19:09:20 UTC

Thank you, Martin, for clarifications, confirmed also by Daniel on the users list (https://www.redhat.com/archives/libvirt-users/2020-August/msg00095.html). The regression in oVirt was apparently introduced when fixing bug 1783815 and switching to information provided by libvirt instead of our former guesswork. I posted a patch that returns back our own check for host shutdown when libvirt reports guest shutdown.

AFAIK oVirt restarts HA VMs by default, unless:

- the VM has been shut down from Engine explicitly (we can track that, it's under our control);
- or the VM has been shut down solely from within the guest, typically by a sysadmin action (we need an external information to know whether the guest shutdown was initiated from within the guest or from something on the host).

Comment 17 Milan Zamazal 2020-09-03 14:09:25 UTC

It's not in any build yet.

Comment 18 Polina 2020-09-22 10:57:10 UTC

Verified on 
ovirt-engine-4.4.3.3-0.19.el8ev.noarch
vdsm-4.40.30-1.el8ev.x86_64

played with HA VMs with and without lease. Restarted after sending poweroff to the  host

Comment 19 Sandro Bonazzola 2020-11-11 06:42:34 UTC

This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.