Description of problem: 1. Sanlock enters recovery, sends SIGTERM to VDSM (SDM) and some VMs. 2019-12-07 01:26:57 3318744 [15583]: s1 kill 4493 sig 15 count 1 (PID 4493 = VM 1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a) 2019-12-07 01:26:57 3318744 [15583]: s1 kill 39377 sig 15 count 1 (PID 39377 = VDSM) 2. VDSM is restarted: 2019-12-07 01:26:57,359+0100 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:70) 3. VM is down: 2019-12-07 01:26:58 3318745 [15583]: dead 4493 ci 6 count 2 4. VDSM recovering upon restart: 2019-12-07 01:27:51,119+0100 INFO (vmrecovery) [vds] recovery [1:6/6]: recovered domain 1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (recovery:123) 2019-12-07 01:27:55,799+0100 INFO (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') VM wrapper has started (vm:2771) 2019-12-07 01:27:55,821+0100 INFO (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Starting connection (guestagent:256) 2019-12-07 01:27:55,822+0100 INFO (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') CPU stopped: domain initialization (vm:6047) 2019-12-07 01:27:55,824+0100 ERROR (vm/1b1a4aa9) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Unexpected VM state: 4 (reason 0) (vm:994) VIR_DOMAIN_SHUTDOWN= 4 SUCCESS = 0 2019-12-07 01:28:02,471+0100 INFO (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') underlying process disconnected (vm:1066) 2019-12-07 01:28:02,471+0100 INFO (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Release VM resources (vm:5168) 2019-12-07 01:28:02,471+0100 INFO (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Stopping connection (guestagent:455) 2019-12-07 01:28:03,141+0100 INFO (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Changed state to Down: User shut down from within the guest (code=7) (vm:1690) 2019-12-07 01:28:03,144+0100 INFO (libvirt/events) [virt.vm] (vmId='1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a') Stopping connection (guestagent:455) 2019-12-07 01:28:03,152+0100 INFO (jsonrpc/6) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:removed,38246, vmId=1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (api:48) 2019-12-07 01:28:03,153+0100 INFO (jsonrpc/6) [api.virt] FINISH destroy return={'status': {'message': 'Machine destroyed', 'code': 0}} from=::ffff:removed,38246, vmId=1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a (api:54) 5. Engine sees this as a graceful shutdown from within the Guest and does not trigger HA. 2019-12-07 01:28:03,151+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM '1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a' was reported as Down on VDS 'e6e9e8d3-8f96-41cf-a2ad-0eb607382a04'(vds_removed) 2019-12-07 01:28:03,154+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-14) [] VM '1b1a4aa9-09aa-4f04-a27f-b15c82e0b32a'(vm_removed) moved from 'Up' --> 'Down' 2019-12-07 01:28:03,207+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-14) [] EVENT_ID: VM_DOWN(61), VM vm_removed is down. Exit message: User shut down from within the guest 6. HA VM is down, not restarted. This happened to 2 VMs out of 6. The one above was dead after SIGTERM, and the other (80e622be-b7df-47c5-b15e-28398ed0c65f) didn't shut down in time, it was killed with a SIGKILL by sanlock a little later (PID 83444). Version-Release number of selected component (if applicable): vdsm-4.30.24-2.el7ev.x86_64 rhvm-4.3.5.5-0.1.el7.noarch How reproducible: Tried on the following a couple times but did not reproduce: vdsm-4.30.38-1.el7ev.x86_64 ovirt-engine-4.3.7.2-1.el7.noarch Looking at the code I'm not sure this was fixed, I think this requires specific timing to reproduce. In theory one needs: 1. HA VM running on the SPM 2. Storage outage on Master (lose SDM resource) Actual results: HA VM down Expected results: HA VM restarted
Nir, does this falls within Storage realm or Virt (fencing)?
(In reply to Tal Nisan from comment #5) > Nir, does this falls within Storage realm or Virt (fencing)? This is vm life-cycle, so virt. Storage give the fencing capability killing the vm. Libvirt should report this vdsm as vm being killed, and vdsm should report the event to engine, so it can restart the vm on another host. Because vdsm was also terminated by sanlock, it may miss events from libvirt. in this case it needs to be able to get the status of vms after vdsm starts up, and report the status to engine.
I'd ask why the host wasn't fenced when it lost a connection to storage (either fencing from infra or storage fencing), but for the HA part, we should behave better. Andrej, any thoughts?
When the qemu process receives SIGTERM, it shuts down cleanly, but the guest OS does not. Vdsm and the engine interpret this as shutdown from the guest, so HA VM is not restarted. I think it would be better if HA VMs were restarted in this case. I will look if there is a way to distinguish between a shutdown from the guset OS and qemu receiving SIGTERM.
The 'Shutdown' event from libvirt has a parameter that specifies if the VM was shut down from guest or host. https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainEventShutdownDetailType It was implemented in Bug 1384007. Currently, VDSM ignores this event.
verified on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-31 1. Create four HA VMs: Server with/without lease, Desktop with/without lease and run them on the same host. 2. enter the host and send kill to vdsm PID and to four qemu IDs. [root@ocelot02 ~]# systemctl status vdsmd.service ● vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: disabled) Active: activating (auto-restart) since Wed 2020-04-22 14:48:38 IDT; 905ms ago Process: 1023401 ExecStart=/usr/share/vdsm/daemonAdapter -0 /dev/null -1 /dev/null -2 /dev/null /usr/share/vdsm/vdsmd (code=exited, status=0/SUCCESS) Process: 1023333 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS) Main PID: 1023401 (code=exited, status=0/SUCCESS) Result: all are restarted
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3246