1275351 – [RFE][vdsm] [hosted-engine] In case the qemu process is terminated ungracefully by libvirt during VM migration, vdsm doesn't detect that correctly and it causes ovirt-ha-agent to think the VM had crashed during the migration

Bug 1275351 - [RFE][vdsm] [hosted-engine] In case the qemu process is terminated ungracefully by libvirt during VM migration, vdsm doesn't detect that correctly and it causes ovirt-ha-agent to think the VM had crashed during the migration

Summary: [RFE][vdsm] [hosted-engine] In case the qemu process is terminated ungraceful...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	RFEs
Sub Component:
Version:	4.17.10
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	bugs@ovirt.org
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:	1363728
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-26 15:44 UTC by Elad
Modified:	2018-06-06 12:28 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-06-06 12:28:27 UTC
oVirt Team:	SLA
Embargoed:
Dependent Products:
Flags:	mgoldboi: ovirt-future? ebenahar: planning_ack? ebenahar: devel_ack? ebenahar: testing_ack?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1393839	0	high	CLOSED	Hosted engine vm status remains paused on 1st host and starts on 2nd Host during hosted-storage disconnect and reconnect	2022-03-13 14:33:50 UTC

Internal Links: 1393839

Description Elad 2015-10-26 15:44:13 UTC

Description of problem:
Tried to migrate the hosted-engine VM and during the migration, the qemu process got terminated on signal 15 for some reason. 
This wasn't detected correctly by vdsm. This caused the hosted-engine HA agent to think that the VM had crashed during its migration and to report the host with score 0.

Version-Release number of selected component (if applicable):
vdsm-4.17.10-5.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch


How reproducible:
Always

Steps to Reproduce:
On hosted-engine setup with more than 1 hosted-engine host:
1. Migrate the hosted-engine VM


Actual results:

The VM is terminated with signal 15:

2015-10-26 13:32:05.302+0000: shutting down
qemu: terminating on signal 15 from pid 25015

vdsm:

Thread-6431::INFO::2015-10-26 15:32:05,801::API::332::vds::(destroy) vmContainerLock acquired by vm 97a65117-7f21-4b8c-9bec-07e37c07e55c
Thread-6431::DEBUG::2015-10-26 15:32:05,801::vm::3789::virt.vm::(destroy) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::destroy Called
Thread-6431::INFO::2015-10-26 15:32:05,801::vm::3721::virt.vm::(releaseVm) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::Release VM resources
Thread-6431::WARNING::2015-10-26 15:32:05,802::vm::343::virt.vm::(_set_lastStatus) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::trying to set state to Powering down when already Down
Thread-6431::INFO::2015-10-26 15:32:05,802::vm::3759::virt.vm::(_destroyVmGraceful) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::VM '97a65117-7f21-4b8c-9bec-07e37c07e55c' already down and destroyed

agent.log:

MainThread::ERROR::2015-10-26 15:32:15,495::states::397::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm unexpectedly running on host 2


Hosted engine migration finishes successfully but the SRC host is now reported with score 0.

[root@green-vdsc mnt]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : green-vdsa.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 7ba319fb
Host timestamp                     : 94575


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : green-vdsc.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : aefbc740
Host timestamp                     : 24637


Expected results:
vdsm should detect that the VM is down correctly in case the VM is terminated during the migration.

Additional info:
/var/log/ from both hosts and engine.log:
http://file.tlv.redhat.com/ebenahar/bug4.tar.gz

Comment 1 Martin Sivák 2015-10-26 15:52:43 UTC

Who was pid 25015?

Comment 2 Elad 2015-10-26 15:56:35 UTC

(In reply to Martin Sivák from comment #1)
> Who was pid 25015?

root     25015  0.0  0.1 1225632 21628 ?       Ssl  Oct25   1:02 /usr/sbin/libvirtd --listen

Comment 3 Francesco Romani 2015-10-26 16:06:16 UTC

From the logs of host1, we can see the VM was succesfully destroyed

Thread-6431::INFO::2015-10-26 15:32:05,809::logUtils::48::dispatcher::(wrapper) Run and protect: inappropriateDevices(thiefId=u'97a65117-7f21-4b8c-9bec-07e37c07e55c')
Thread-6431::INFO::2015-10-26 15:32:05,811::logUtils::51::dispatcher::(wrapper) Run and protect: inappropriateDevices, Return response: None
Thread-6431::DEBUG::2015-10-26 15:32:05,811::task::1191::Storage.TaskManager.Task::(prepare) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::finished: None
Thread-6431::DEBUG::2015-10-26 15:32:05,811::task::595::Storage.TaskManager.Task::(_updateState) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::moving from state preparing -> state finished
Thread-6431::DEBUG::2015-10-26 15:32:05,811::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-6431::DEBUG::2015-10-26 15:32:05,811::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-6431::DEBUG::2015-10-26 15:32:05,812::task::993::Storage.TaskManager.Task::(_decref) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::ref 0 aborting False
Thread-6431::DEBUG::2015-10-26 15:32:05,812::vm::3784::virt.vm::(deleteVm) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::Total desktops after destroy of 97a65117-7f21-4b8c-9bec-07e37c07e55c is 0
Thread-6431::DEBUG::2015-10-26 15:32:05,812::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.destroy' in bridge with True

So the problem is elsewhere, because from now on the VM is *not* running anymore on host1. Further proof:

Thread-6433::DEBUG::2015-10-26 15:32:07,176::bindingxmlrpc::1257::vds::(wrapper) client [127.0.0.1]::call vmGetStats with ('97a65117-7f21-4b8c-9bec-07e37c07e55c',) {}
Thread-6433::DEBUG::2015-10-26 15:32:07,177::bindingxmlrpc::1264::vds::(wrapper) return vmGetStats with {'status': {'message': 'Virtual machine does not exist', 'code': 1}}

Comment 4 Martin Sivák 2015-10-26 16:13:19 UTC

Unfortunately there is no way to distinguish this from a real crash and restart scenario.

Is it possible to filter this report on the vdsm side? Because it is visible both to the engine and to hosted engine - audit log surely reports that and hosted engine penalizes the source host.

Comment 5 Doron Fediuck 2015-10-27 11:57:16 UTC

I'm still unsure there's a /real/ problem here.
If the qemu process is terminated ungracefully for whatever reason,
this means there's an issue on the source host. And if there's an
issue then the host should get a lower score, including 0 if needed.

Was the HE VM up at the end of the process?

Comment 6 Simone Tiraboschi 2015-10-27 12:17:40 UTC

The VM was correctly migrated from source host (host 1)to the destination host (host 2) and ovirt-ha-agent correctly notices that the engine VM is running on the other host.
 
The issue is that the end of VM migration from source host (host 1) was not correctly detected (while qemu was gracefully terminated) and VDSM and/or ovirt-ha-agent reported that the VM crashed on that host so host 1 get a zero score while the migration was instead successfully and so host 1 should still be at 3400 points.

Comment 7 Martin Sivák 2015-10-27 15:14:42 UTC

This is a race between the engine and HE agent. This happens when the engine calls destroy before the agent get data from the source VDSM. VDSM deletes the information about the source VM and hosted engine does not get the migration finished status properly.

There is currently no way to handle this, we might be able to leverage VDSM's "migration finished" event once the python library for jsonrpc is published.

Comment 8 Roy Golan 2015-11-22 08:43:55 UTC

There is a "migration finished" event - its the status transition from MigrationSrc -> Down. the engine uses the exact same transition. 

The best fix will be to trigger migration through the engine but we unsure how to store the engine password for api usages. So till we solve this we can subscribe to the status change events sourced by vdsm. Naturally what I don't like is that we gradually copying the engine logic of Vms Monitoring.

Martin, Francesco any estimation on subscribing to vdsm events?

Comment 9 Martin Sivák 2015-11-23 11:32:36 UTC

We can't trigger the migration through the engine as the whole point of hosted engine is that it can recover a dead engine.

But we need the event.

Comment 10 Francesco Romani 2015-11-25 11:53:14 UTC

We are considering to add migration events anyway to improve reporting when migration aborts, or timeouts (bz1154397)

On VDSM side the change is both pretty simple and pretty safe.

However, we need to setup some documentation schema about events, maybe like QEMU does.

Will get in touch with infra people and report back. Can't provide an ETA at the moment.

Comment 11 Sandro Bonazzola 2016-05-02 10:01:42 UTC

Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 12 Yaniv Lavi 2016-05-23 13:17:23 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 13 Yaniv Lavi 2016-05-23 13:21:18 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 14 Yaniv Lavi 2018-06-06 12:28:27 UTC

Closing old RFEs. Please reopen if needed.
Patches are welcome.

Note You need to log in before you can comment on or make changes to this bug.