Bug 1783815 - HA did not work, vm killed by sanlock recovery was reported as shutdown from within the guest
Summary: HA did not work, vm killed by sanlock recovery was reported as shutdown from ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.1
: ---
Assignee: Andrej Krejcir
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks: 1789090
TreeView+ depends on / blocked
 
Reported: 2019-12-16 01:07 UTC by Germano Veit Michel
Modified: 2020-08-04 13:27 UTC (History)
14 users (show)

Fixed In Version: rhv-4.4.0-29
Doc Type: Bug Fix
Doc Text:
Previously, if a virtual machine (VM) was forcibly shut down by SIGTERM, in some cases the VDSM did not handle the libvirt shutdown event that contained information about why the VM was shut down and evaluated it as if the guest had initiated a clean shutdown. The current release fixes this issue: VDSM handles the shutdown event, and the Manager restarts the high-availability VMs as expected.
Clone Of:
Environment:
Last Closed: 2020-08-04 13:27:22 UTC
oVirt Team: Virt
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4758291 0 None None None 2020-01-20 23:13:11 UTC
Red Hat Product Errata RHEA-2020:3246 0 None None None 2020-08-04 13:27:50 UTC
oVirt gerrit 106310 0 master MERGED vm: Extract methods for handling libvirt events 2021-02-15 18:01:39 UTC
oVirt gerrit 106311 0 master MERGED vm: Handle VIR_DOMAIN_EVENT_SHUTDOWN libvirt event 2021-02-15 18:01:39 UTC

Description Germano Veit Michel 2019-12-16 01:07:46 UTC
Description of problem:

1. Sanlock enters recovery, sends SIGTERM to VDSM (SDM) and some VMs.

2019-12-07 01:26:57 3318744 [15583]: s1 kill 4493 sig 15 count 1          (PID 4493 = VM 1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a)
2019-12-07 01:26:57 3318744 [15583]: s1 kill 39377 sig 15 count 1         (PID 39377 = VDSM)

2. VDSM is restarted:
2019-12-07 01:26:57,359+0100 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:70)

3. VM is down:
2019-12-07 01:26:58 3318745 [15583]: dead 4493 ci 6 count 2

4. VDSM recovering upon restart:
2019-12-07 01:27:51,119+0100 INFO  (vmrecovery) [vds] recovery [1:6/6]: recovered domain 1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (recovery:123)
2019-12-07 01:27:55,799+0100 INFO  (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') VM wrapper has started (vm:2771)
2019-12-07 01:27:55,821+0100 INFO  (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Starting connection (guestagent:256)
2019-12-07 01:27:55,822+0100 INFO  (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') CPU stopped: domain initialization (vm:6047)
2019-12-07 01:27:55,824+0100 ERROR (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Unexpected VM state: 4 (reason 0) (vm:994)

VIR_DOMAIN_SHUTDOWN= 4
SUCCESS = 0

2019-12-07 01:28:02,471+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') underlying process disconnected (vm:1066)
2019-12-07 01:28:02,471+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Release VM resources (vm:5168)
2019-12-07 01:28:02,471+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Stopping connection (guestagent:455)
2019-12-07 01:28:03,141+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Changed state to Down: User shut down from within the guest (code=7) (vm:1690)
2019-12-07 01:28:03,144+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Stopping connection (guestagent:455)
2019-12-07 01:28:03,152+0100 INFO  (jsonrpc/6) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:removed,38246, vmId=1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (api:48)
2019-12-07 01:28:03,153+0100 INFO  (jsonrpc/6) [api.virt] FINISH destroy return={'status': {'message': 'Machine destroyed', 'code': 0}} from=::ffff:removed,38246, vmId=1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (api:54)

5. Engine sees this as a graceful shutdown from within the Guest and does not trigger HA.

2019-12-07 01:28:03,151+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM '1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a' was reported as Down on VDS 'e6e9e8d3-8f96-41cf-a2ad-0eb607382a04'(vds_removed)
2019-12-07 01:28:03,154+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM '1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a'(vm_removed) moved from 'Up' --> 'Down'
2019-12-07 01:28:03,207+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-14) [] EVENT_ID: VM_DOWN(61), VM vm_removed is down. Exit message: User shut down from within the guest

6. HA VM is down, not restarted.

This happened to 2 VMs out of 6. The one above was dead after SIGTERM, and the other (80e622be-b7df-47c5-b15e-28398ed0c65f) didn't shut down in time, it was killed with a SIGKILL by sanlock a little later (PID 83444).


Version-Release number of selected component (if applicable):
vdsm-4.30.24-2.el7ev.x86_64
rhvm-4.3.5.5-0.1.el7.noarch

How reproducible:
Tried on the following a couple times but did not reproduce:
vdsm-4.30.38-1.el7ev.x86_64
ovirt-engine-4.3.7.2-1.el7.noarch
Looking at the code I'm not sure this was fixed, I think this requires specific timing to reproduce.

In theory one needs:
1. HA VM running on the SPM
2. Storage outage on Master (lose SDM resource)

Actual results:
HA VM down

Expected results:
HA VM restarted

Comment 5 Tal Nisan 2019-12-18 14:30:31 UTC
Nir, does this falls within Storage realm or Virt (fencing)?

Comment 6 Nir Soffer 2019-12-18 16:11:27 UTC
(In reply to Tal Nisan from comment #5)
> Nir, does this falls within Storage realm or Virt (fencing)?

This is vm life-cycle, so virt.

Storage give the fencing capability killing the vm. Libvirt should report
this vdsm as vm being killed, and vdsm should report the event to engine,
so it can restart the vm on another host.

Because vdsm was also terminated by sanlock, it may miss events from libvirt.
in this case it needs to be able to get the status of vms after vdsm starts up,
and report the status to engine.

Comment 7 Ryan Barry 2020-01-07 05:40:28 UTC
I'd ask why the host wasn't fenced when it lost a connection to storage (either fencing from infra or storage fencing), but for the HA part, we should behave better.

Andrej, any thoughts?

Comment 8 Andrej Krejcir 2020-01-08 10:30:23 UTC
When the qemu process receives SIGTERM, it shuts down cleanly, but the guest OS does not. Vdsm and the engine interpret this as shutdown from the guest, so HA VM is not restarted.

I think it would be better if HA VMs were restarted in this case.
I will look if there is a way to distinguish between a shutdown from the guset OS and qemu receiving SIGTERM.

Comment 9 Andrej Krejcir 2020-01-14 12:46:42 UTC
The 'Shutdown' event from libvirt has a parameter that specifies if the VM was shut down from guest or host.

https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainEventShutdownDetailType

It was implemented in Bug 1384007.


Currently, VDSM ignores this event.

Comment 11 Polina 2020-04-22 12:03:59 UTC
verified on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-31

1. Create four HA VMs: Server with/without lease, Desktop with/without lease and run them on the same host.

2. enter the host and send kill to vdsm PID and to four qemu IDs.
[root@ocelot02 ~]# systemctl status vdsmd.service 
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) since Wed 2020-04-22 14:48:38 IDT; 905ms ago
  Process: 1023401 ExecStart=/usr/share/vdsm/daemonAdapter -0 /dev/null -1 /dev/null -2 /dev/null /usr/share/vdsm/vdsmd (code=exited, status=0/SUCCESS)
  Process: 1023333 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS)
 Main PID: 1023401 (code=exited, status=0/SUCCESS)

Result: all are restarted

Comment 20 errata-xmlrpc 2020-08-04 13:27:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246


Note You need to log in before you can comment on or make changes to this bug.