Bug 1783815

Summary:	HA did not work, vm killed by sanlock recovery was reported as shutdown from within the guest
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	vdsm	Assignee:	Andrej Krejcir <akrejcir>
Status:	CLOSED ERRATA	QA Contact:	Polina <pagranat>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.3.5	CC:	ahadas, akrejcir, aperotti, dconsoli, jortialc, lsurette, mavital, mkalinin, nsoffer, pelauter, rdlugyhe, srevivo, tnisan, ycui
Target Milestone:	ovirt-4.4.1	Flags:	lsvaty: testing_plan_complete-
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	rhv-4.4.0-29	Doc Type:	Bug Fix
Doc Text:	Previously, if a virtual machine (VM) was forcibly shut down by SIGTERM, in some cases the VDSM did not handle the libvirt shutdown event that contained information about why the VM was shut down and evaluated it as if the guest had initiated a clean shutdown. The current release fixes this issue: VDSM handles the shutdown event, and the Manager restarts the high-availability VMs as expected.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-04 13:27:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1789090

Description Germano Veit Michel 2019-12-16 01:07:46 UTC

Description of problem:

1. Sanlock enters recovery, sends SIGTERM to VDSM (SDM) and some VMs.

2019-12-07 01:26:57 3318744 [15583]: s1 kill 4493 sig 15 count 1          (PID 4493 = VM 1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a)
2019-12-07 01:26:57 3318744 [15583]: s1 kill 39377 sig 15 count 1         (PID 39377 = VDSM)

2. VDSM is restarted:
2019-12-07 01:26:57,359+0100 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:70)

3. VM is down:
2019-12-07 01:26:58 3318745 [15583]: dead 4493 ci 6 count 2

4. VDSM recovering upon restart:
2019-12-07 01:27:51,119+0100 INFO  (vmrecovery) [vds] recovery [1:6/6]: recovered domain 1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (recovery:123)
2019-12-07 01:27:55,799+0100 INFO  (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') VM wrapper has started (vm:2771)
2019-12-07 01:27:55,821+0100 INFO  (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Starting connection (guestagent:256)
2019-12-07 01:27:55,822+0100 INFO  (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') CPU stopped: domain initialization (vm:6047)
2019-12-07 01:27:55,824+0100 ERROR (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Unexpected VM state: 4 (reason 0) (vm:994)

VIR_DOMAIN_SHUTDOWN= 4
SUCCESS = 0

2019-12-07 01:28:02,471+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') underlying process disconnected (vm:1066)
2019-12-07 01:28:02,471+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Release VM resources (vm:5168)
2019-12-07 01:28:02,471+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Stopping connection (guestagent:455)
2019-12-07 01:28:03,141+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Changed state to Down: User shut down from within the guest (code=7) (vm:1690)
2019-12-07 01:28:03,144+0100 INFO  (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Stopping connection (guestagent:455)
2019-12-07 01:28:03,152+0100 INFO  (jsonrpc/6) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:removed,38246, vmId=1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (api:48)
2019-12-07 01:28:03,153+0100 INFO  (jsonrpc/6) [api.virt] FINISH destroy return={'status': {'message': 'Machine destroyed', 'code': 0}} from=::ffff:removed,38246, vmId=1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (api:54)

5. Engine sees this as a graceful shutdown from within the Guest and does not trigger HA.

2019-12-07 01:28:03,151+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM '1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a' was reported as Down on VDS 'e6e9e8d3-8f96-41cf-a2ad-0eb607382a04'(vds_removed)
2019-12-07 01:28:03,154+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM '1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a'(vm_removed) moved from 'Up' --> 'Down'
2019-12-07 01:28:03,207+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-14) [] EVENT_ID: VM_DOWN(61), VM vm_removed is down. Exit message: User shut down from within the guest

6. HA VM is down, not restarted.

This happened to 2 VMs out of 6. The one above was dead after SIGTERM, and the other (80e622be-b7df-47c5-b15e-28398ed0c65f) didn't shut down in time, it was killed with a SIGKILL by sanlock a little later (PID 83444).


Version-Release number of selected component (if applicable):
vdsm-4.30.24-2.el7ev.x86_64
rhvm-4.3.5.5-0.1.el7.noarch

How reproducible:
Tried on the following a couple times but did not reproduce:
vdsm-4.30.38-1.el7ev.x86_64
ovirt-engine-4.3.7.2-1.el7.noarch
Looking at the code I'm not sure this was fixed, I think this requires specific timing to reproduce.

In theory one needs:
1. HA VM running on the SPM
2. Storage outage on Master (lose SDM resource)

Actual results:
HA VM down

Expected results:
HA VM restarted

Comment 5 Tal Nisan 2019-12-18 14:30:31 UTC

Nir, does this falls within Storage realm or Virt (fencing)?

Comment 6 Nir Soffer 2019-12-18 16:11:27 UTC

(In reply to Tal Nisan from comment #5)
> Nir, does this falls within Storage realm or Virt (fencing)?

This is vm life-cycle, so virt.

Storage give the fencing capability killing the vm. Libvirt should report
this vdsm as vm being killed, and vdsm should report the event to engine,
so it can restart the vm on another host.

Because vdsm was also terminated by sanlock, it may miss events from libvirt.
in this case it needs to be able to get the status of vms after vdsm starts up,
and report the status to engine.

Comment 7 Ryan Barry 2020-01-07 05:40:28 UTC

I'd ask why the host wasn't fenced when it lost a connection to storage (either fencing from infra or storage fencing), but for the HA part, we should behave better.

Andrej, any thoughts?

Comment 8 Andrej Krejcir 2020-01-08 10:30:23 UTC

When the qemu process receives SIGTERM, it shuts down cleanly, but the guest OS does not. Vdsm and the engine interpret this as shutdown from the guest, so HA VM is not restarted.

I think it would be better if HA VMs were restarted in this case.
I will look if there is a way to distinguish between a shutdown from the guset OS and qemu receiving SIGTERM.

Comment 9 Andrej Krejcir 2020-01-14 12:46:42 UTC

The 'Shutdown' event from libvirt has a parameter that specifies if the VM was shut down from guest or host.

https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainEventShutdownDetailType

It was implemented in Bug 1384007.


Currently, VDSM ignores this event.

Comment 11 Polina 2020-04-22 12:03:59 UTC

verified on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-31

1. Create four HA VMs: Server with/without lease, Desktop with/without lease and run them on the same host.

2. enter the host and send kill to vdsm PID and to four qemu IDs.
[root@ocelot02 ~]# systemctl status vdsmd.service 
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) since Wed 2020-04-22 14:48:38 IDT; 905ms ago
  Process: 1023401 ExecStart=/usr/share/vdsm/daemonAdapter -0 /dev/null -1 /dev/null -2 /dev/null /usr/share/vdsm/vdsmd (code=exited, status=0/SUCCESS)
  Process: 1023333 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS)
 Main PID: 1023401 (code=exited, status=0/SUCCESS)

Result: all are restarted

Comment 20 errata-xmlrpc 2020-08-04 13:27:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246