Bug 1275351
Summary: | [RFE][vdsm] [hosted-engine] In case the qemu process is terminated ungracefully by libvirt during VM migration, vdsm doesn't detect that correctly and it causes ovirt-ha-agent to think the VM had crashed during the migration | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Elad <ebenahar> |
Component: | RFEs | Assignee: | bugs <bugs> |
Status: | CLOSED WONTFIX | QA Contact: | meital avital <mavital> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.17.10 | CC: | bugs, dfediuck, ebenahar, fromani, mgoldboi, michal.skrivanek, msivak, rgolan, sbonazzo, stirabos, ylavi |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | Flags: | mgoldboi:
ovirt-future?
ebenahar: planning_ack? ebenahar: devel_ack? ebenahar: testing_ack? |
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-06 12:28:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1363728 | ||
Bug Blocks: |
Description
Elad
2015-10-26 15:44:13 UTC
Who was pid 25015? (In reply to Martin Sivák from comment #1) > Who was pid 25015? root 25015 0.0 0.1 1225632 21628 ? Ssl Oct25 1:02 /usr/sbin/libvirtd --listen From the logs of host1, we can see the VM was succesfully destroyed Thread-6431::INFO::2015-10-26 15:32:05,809::logUtils::48::dispatcher::(wrapper) Run and protect: inappropriateDevices(thiefId=u'97a65117-7f21-4b8c-9bec-07e37c07e55c') Thread-6431::INFO::2015-10-26 15:32:05,811::logUtils::51::dispatcher::(wrapper) Run and protect: inappropriateDevices, Return response: None Thread-6431::DEBUG::2015-10-26 15:32:05,811::task::1191::Storage.TaskManager.Task::(prepare) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::finished: None Thread-6431::DEBUG::2015-10-26 15:32:05,811::task::595::Storage.TaskManager.Task::(_updateState) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::moving from state preparing -> state finished Thread-6431::DEBUG::2015-10-26 15:32:05,811::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-6431::DEBUG::2015-10-26 15:32:05,811::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-6431::DEBUG::2015-10-26 15:32:05,812::task::993::Storage.TaskManager.Task::(_decref) Task=`92d99247-9284-4446-96f6-9cc1116bd4e3`::ref 0 aborting False Thread-6431::DEBUG::2015-10-26 15:32:05,812::vm::3784::virt.vm::(deleteVm) vmId=`97a65117-7f21-4b8c-9bec-07e37c07e55c`::Total desktops after destroy of 97a65117-7f21-4b8c-9bec-07e37c07e55c is 0 Thread-6431::DEBUG::2015-10-26 15:32:05,812::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) Return 'VM.destroy' in bridge with True So the problem is elsewhere, because from now on the VM is *not* running anymore on host1. Further proof: Thread-6433::DEBUG::2015-10-26 15:32:07,176::bindingxmlrpc::1257::vds::(wrapper) client [127.0.0.1]::call vmGetStats with ('97a65117-7f21-4b8c-9bec-07e37c07e55c',) {} Thread-6433::DEBUG::2015-10-26 15:32:07,177::bindingxmlrpc::1264::vds::(wrapper) return vmGetStats with {'status': {'message': 'Virtual machine does not exist', 'code': 1}} Unfortunately there is no way to distinguish this from a real crash and restart scenario. Is it possible to filter this report on the vdsm side? Because it is visible both to the engine and to hosted engine - audit log surely reports that and hosted engine penalizes the source host. I'm still unsure there's a /real/ problem here. If the qemu process is terminated ungracefully for whatever reason, this means there's an issue on the source host. And if there's an issue then the host should get a lower score, including 0 if needed. Was the HE VM up at the end of the process? The VM was correctly migrated from source host (host 1)to the destination host (host 2) and ovirt-ha-agent correctly notices that the engine VM is running on the other host. The issue is that the end of VM migration from source host (host 1) was not correctly detected (while qemu was gracefully terminated) and VDSM and/or ovirt-ha-agent reported that the VM crashed on that host so host 1 get a zero score while the migration was instead successfully and so host 1 should still be at 3400 points. This is a race between the engine and HE agent. This happens when the engine calls destroy before the agent get data from the source VDSM. VDSM deletes the information about the source VM and hosted engine does not get the migration finished status properly. There is currently no way to handle this, we might be able to leverage VDSM's "migration finished" event once the python library for jsonrpc is published. There is a "migration finished" event - its the status transition from MigrationSrc -> Down. the engine uses the exact same transition. The best fix will be to trigger migration through the engine but we unsure how to store the engine password for api usages. So till we solve this we can subscribe to the status change events sourced by vdsm. Naturally what I don't like is that we gradually copying the engine logic of Vms Monitoring. Martin, Francesco any estimation on subscribing to vdsm events? We can't trigger the migration through the engine as the whole point of hosted engine is that it can recover a dead engine. But we need the event. We are considering to add migration events anyway to improve reporting when migration aborts, or timeouts (bz1154397) On VDSM side the change is both pretty simple and pretty safe. However, we need to setup some documentation schema about events, maybe like QEMU does. Will get in touch with infra people and report back. Can't provide an ETA at the moment. Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA. oVirt 4.0 beta has been released, moving to RC milestone. oVirt 4.0 beta has been released, moving to RC milestone. Closing old RFEs. Please reopen if needed. Patches are welcome. |