Bug 1275351

Summary: [RFE][vdsm] [hosted-engine] In case the qemu process is terminated ungracefully by libvirt during VM migration, vdsm doesn't detect that correctly and it causes ovirt-ha-agent to think the VM had crashed during the migration
Product: [oVirt] vdsm Reporter: Elad <ebenahar>
Component: RFEsAssignee: bugs <bugs>
Status: CLOSED WONTFIX QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: high    
Version: 4.17.10CC: bugs, dfediuck, ebenahar, fromani, mgoldboi, michal.skrivanek, msivak, rgolan, sbonazzo, stirabos, ylavi
Target Milestone: ---Keywords: FutureFeature
Target Release: ---Flags: mgoldboi: ovirt-future?
ebenahar: planning_ack?
ebenahar: devel_ack?
ebenahar: testing_ack?
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-06 12:28:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1363728    
Bug Blocks:    

Description Elad 2015-10-26 15:44:13 UTC
Description of problem:
Tried to migrate the hosted-engine VM and during the migration, the qemu process got terminated on signal 15 for some reason. 
This wasn't detected correctly by vdsm. This caused the hosted-engine HA agent to think that the VM had crashed during its migration and to report the host with score 0.

Version-Release number of selected component (if applicable):
vdsm-4.17.10-5.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch


How reproducible:
Always

Steps to Reproduce:
On hosted-engine setup with more than 1 hosted-engine host:
1. Migrate the hosted-engine VM


Actual results:

The VM is terminated with signal 15:

2015-10-26 13:32:05.302+0000: shutting down
qemu: terminating on signal 15 from pid 25015

vdsm:

Thread-6431::INFO::2015-10-26 15:32:05,801::API::332::vds::(destroy) vmContainerLock acquired by vm 97a65117-7f21-4b8c-9bec-07e37c07e55c
Thread-6431::DEBUG::2015-10-26 15:32:05,801::vm::3789::virt.vm::(destroy) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::destroy Called
Thread-6431::INFO::2015-10-26 15:32:05,801::vm::3721::virt.vm::(releaseVm) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::Release VM resources
Thread-6431::WARNING::2015-10-26 15:32:05,802::vm::343::virt.vm::(_set_lastStatus) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::trying to set state to Powering down when already Down
Thread-6431::INFO::2015-10-26 15:32:05,802::vm::3759::virt.vm::(_destroyVmGraceful) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::VM '97a65117-7f21-4b8c-9bec-07e37c07e55c' already down and destroyed

agent.log:

MainThread::ERROR::2015-10-26 15:32:15,495::states::397::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm unexpectedly running on host 2


Hosted engine migration finishes successfully but the SRC host is now reported with score 0.

[root@green-vdsc mnt]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : green-vdsa.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : 7ba319fb
Host timestamp                     : 94575


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : green-vdsc.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : aefbc740
Host timestamp                     : 24637


Expected results:
vdsm should detect that the VM is down correctly in case the VM is terminated during the migration.

Additional info:
/var/log/ from both hosts and engine.log:
http://file.tlv.redhat.com/ebenahar/bug4.tar.gz

Comment 1 Martin Sivák 2015-10-26 15:52:43 UTC
Who was pid 25015?

Comment 2 Elad 2015-10-26 15:56:35 UTC
(In reply to Martin Sivák from comment #1)
> Who was pid 25015?

root     25015  0.0  0.1 1225632 21628 ?       Ssl  Oct25   1:02 /usr/sbin/libvirtd --listen

Comment 3 Francesco Romani 2015-10-26 16:06:16 UTC
From the logs of host1, we can see the VM was succesfully destroyed

Thread-6431::INFO::2015-10-26 15:32:05,809::logUtils::48::dispatcher::(wrapper) Run and protect: inappropriateDevices(thiefId=u'97a65117-7f21-4b8c-9bec-07e37c07e55c')
Thread-6431::INFO::2015-10-26 15:32:05,811::logUtils::51::dispatcher::(wrapper) Run and protect: inappropriateDevices, Return response: None
Thread-6431::DEBUG::2015-10-26 15:32:05,811::task::1191::Storage.TaskManager.Task::(prepare) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::finished: None
Thread-6431::DEBUG::2015-10-26 15:32:05,811::task::595::Storage.TaskManager.Task::(_updateState) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::moving from state preparing -> state finished
Thread-6431::DEBUG::2015-10-26 15:32:05,811::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-6431::DEBUG::2015-10-26 15:32:05,811::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-6431::DEBUG::2015-10-26 15:32:05,812::task::993::Storage.TaskManager.Task::(_decref) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::ref 0 aborting False
Thread-6431::DEBUG::2015-10-26 15:32:05,812::vm::3784::virt.vm::(deleteVm) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::Total desktops after destroy of 97a65117-7f21-4b8c-9bec-07e37c07e55c is 0
Thread-6431::DEBUG::2015-10-26 15:32:05,812::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.destroy' in bridge with True

So the problem is elsewhere, because from now on the VM is *not* running anymore on host1. Further proof:

Thread-6433::DEBUG::2015-10-26 15:32:07,176::bindingxmlrpc::1257::vds::(wrapper) client [127.0.0.1]::call vmGetStats with ('97a65117-7f21-4b8c-9bec-07e37c07e55c',) {}
Thread-6433::DEBUG::2015-10-26 15:32:07,177::bindingxmlrpc::1264::vds::(wrapper) return vmGetStats with {'status': {'message': 'Virtual machine does not exist', 'code': 1}}

Comment 4 Martin Sivák 2015-10-26 16:13:19 UTC
Unfortunately there is no way to distinguish this from a real crash and restart scenario.

Is it possible to filter this report on the vdsm side? Because it is visible both to the engine and to hosted engine - audit log surely reports that and hosted engine penalizes the source host.

Comment 5 Doron Fediuck 2015-10-27 11:57:16 UTC
I'm still unsure there's a /real/ problem here.
If the qemu process is terminated ungracefully for whatever reason,
this means there's an issue on the source host. And if there's an
issue then the host should get a lower score, including 0 if needed.

Was the HE VM up at the end of the process?

Comment 6 Simone Tiraboschi 2015-10-27 12:17:40 UTC
The VM was correctly migrated from source host (host 1)to the destination host (host 2) and ovirt-ha-agent correctly notices that the engine VM is running on the other host.
 
The issue is that the end of VM migration from source host (host 1) was not correctly detected (while qemu was gracefully terminated) and VDSM and/or ovirt-ha-agent reported that the VM crashed on that host so host 1 get a zero score while the migration was instead successfully and so host 1 should still be at 3400 points.

Comment 7 Martin Sivák 2015-10-27 15:14:42 UTC
This is a race between the engine and HE agent. This happens when the engine calls destroy before the agent get data from the source VDSM. VDSM deletes the information about the source VM and hosted engine does not get the migration finished status properly.

There is currently no way to handle this, we might be able to leverage VDSM's "migration finished" event once the python library for jsonrpc is published.

Comment 8 Roy Golan 2015-11-22 08:43:55 UTC
There is a "migration finished" event - its the status transition from MigrationSrc -> Down. the engine uses the exact same transition. 

The best fix will be to trigger migration through the engine but we unsure how to store the engine password for api usages. So till we solve this we can subscribe to the status change events sourced by vdsm. Naturally what I don't like is that we gradually copying the engine logic of Vms Monitoring.

Martin, Francesco any estimation on subscribing to vdsm events?

Comment 9 Martin Sivák 2015-11-23 11:32:36 UTC
We can't trigger the migration through the engine as the whole point of hosted engine is that it can recover a dead engine.

But we need the event.

Comment 10 Francesco Romani 2015-11-25 11:53:14 UTC
We are considering to add migration events anyway to improve reporting when migration aborts, or timeouts (bz1154397)

On VDSM side the change is both pretty simple and pretty safe.

However, we need to setup some documentation schema about events, maybe like QEMU does.

Will get in touch with infra people and report back. Can't provide an ETA at the moment.

Comment 11 Sandro Bonazzola 2016-05-02 10:01:42 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 12 Yaniv Lavi 2016-05-23 13:17:23 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 13 Yaniv Lavi 2016-05-23 13:21:18 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 14 Yaniv Lavi 2018-06-06 12:28:27 UTC
Closing old RFEs. Please reopen if needed.
Patches are welcome.