Bug 1382578 - Periodic functions may continue running after VM is down.
Summary: Periodic functions may continue running after VM is down.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.1.0-alpha
: ---
Assignee: Francesco Romani
QA Contact: sefi litmanovich
URL:
Whiteboard:
Depends On:
Blocks: 1391506
TreeView+ depends on / blocked
 
Reported: 2016-10-07 05:17 UTC by Germano Veit Michel
Modified: 2020-03-11 15:19 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, if a VM shutdown was too slow, the state of the VM could be misreported as unresponsive, even though the VM was operating correctly, albeit too slowly. This was caused by a too-aggressive check on startup and shutdown. This patch takes into account slowdowns in startup and shutdown, avoiding false positive reports.
Clone Of:
: 1391505 1391506 (view as bug list)
Environment:
Last Closed: 2017-04-25 00:48:49 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2749551 0 None None None 2016-11-04 05:47:24 UTC
Red Hat Product Errata RHEA-2017:0998 0 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1 GA 2017-04-18 20:11:39 UTC
oVirt gerrit 65814 0 None MERGED vm: introduce a `monitorable' attribute 2020-03-30 09:50:36 UTC

Description Germano Veit Michel 2016-10-07 05:17:46 UTC
Description of problem:

After VM.Shutdown(), periodic tasks may continue running after the VM is actually down (qemu down). Data gets old and VM might be seen as Unresponsive by the engine. 

In the Engine, instead of 
'Powering Down' -> 'Down'

It is seen as:
'Powering Down' -> 'Not responding' -> 'Down'

In VDSM, what happens is:
'Powering Down' -> 'qemu dead' -> 'monitor unresponsive|cannot run periodic' -> 'Down'

I regularly see cases (in my reproducer and customer's logs) where the periodic tasks and getRunningVmStats run for a short time after qemu has ended, but usually they stop within 1-2 seconds. However, I can see a few cases in customer's logs that the tasks continue to run for a longer time, until the destroy is issued. So it may become quite noticeable.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160920.1.el7ev)
rhevm-3.6.8.1-0.1.el6.noarch

How reproducible: sometimes.

In my reproducer I can get the VM to do 'Powering Down' -> 'Unresponsive' -> 'Down' about 25% of the shutdowns. It always transitions to 'Down' quite quickly, remaining a very short time in 'Not responding' state. However, in some high load environments this may look worse.

* Think when I increase the number of workers it gets easier to reproduce, might be just an impression.

Steps to Reproduce:
1. Start VM
2. Graceful Shutdown from the Engine (agent installed - desktopShutdown)

Actual results:
VM/qemu is actually down, Periodic tasks continue running, VM reported as unresponsive for a while.

Expected results:
VM shutdowns fine, periodic functions stop (or ignored). Goes to shutdown state without passing through 'Not Responding'.

Additional info:

VM XXX = f46242aa-63d5-41ad-92d6-290341cc7592

2016-10-03 12:00:04.531+08 | VM shutdown initiated by admin@internal on VM XXX (Host: YYY) (Reason:  No reason was returned for this op
eration failure. See logs for further details.).

2016-10-03 12:00:44.173+08 | VM XXX is not responding.

VDSM side:

jsonrpc.Executor/6::DEBUG::2016-10-03 04:00:04,486::__init__::532::jsonrpc.JsonRpcServer::(_handle_request) Calling 'VM.shutdown' in bridge with {u'delay': u'30', u'message': u'System Administrator has initiated shutdown of this Virtual Machine. Virtual Machine is shutting down.', u'vmID': u'f46242aa-63d5-41ad-92d6-290341cc7592'}

libvirtEventLoop::DEBUG::2016-10-03 04:00:43,409::vm::4648::virt.vm::(onLibvirtLifecycleEvent) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::event Stopped detail 0 opaque None

libvirtEventLoop::INFO::2016-10-03 04:00:43,409::vm::791::virt.vm::(_onQemuDeath) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::underlying process disconnected

libvirtEventLoop::INFO::2016-10-03 04:00:43,410::vm::3882::virt.vm::(releaseVm) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::Release VM resources

At this point any vmstats/vm_sample call is useless, but it's still running.

Two seconds later:

periodic/3889::WARNING::2016-10-03 04:00:45,002::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'f46242aa-63d5-41ad-92d6-290341cc7592']

Thread-1472923::WARNING::2016-10-03 04:00:45,843::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::monitor become unresponsive (command timeout, age=4513373.21)

periodic/3889::WARNING::2016-10-03 04:00:46,646::periodic::302::virt.vm::(__call__) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::could not run on f46242aa-63d5-41ad-92d6-290341cc7592: domain not connected

So it may look 'Not Responding' to the engine until the down event is sent.

Comment 2 Germano Veit Michel 2016-10-07 07:19:33 UTC
This doesn't look right, does it?

age=4513373.21

Could _vm_last_timestamp value been lost/zeroed at this time?

Comment 3 Dan Kenigsberg 2016-10-07 10:35:07 UTC
Michael, don't you already have a very similar bz for this (during startup)?
https://bugzilla.redhat.com/show_bug.cgi?id=1314371#c21

Comment 4 Michael Burman 2016-10-09 05:17:05 UTC
(In reply to Dan Kenigsberg from comment #3)
> Michael, don't you already have a very similar bz for this (during startup)?
> https://bugzilla.redhat.com/show_bug.cgi?id=1314371#c21

No) , i didn't reported such bug eventually. But this is definitely happening from time to time on startup.

Comment 6 Tomas Jelinek 2016-10-11 07:08:18 UTC
@Milan, any thoughts?

Comment 7 Michal Skrivanek 2016-10-12 12:54:11 UTC
(In reply to Germano Veit Michel from comment #2)
> This doesn't look right, does it?
> 
> age=4513373.21
> 
> Could _vm_last_timestamp value been lost/zeroed at this time?

indeed it doesn't look right. Normally during shutdown such delayed queries are tolerated up to 60s, even here in this loaded environment it's pretty quick, and statsAge would be few secs at most...this large value indicates that a particular stat call is not working for 52 days. But then I don't understand form the code how could it work before shutdown...it should have been Not Responding for the whole time. Unless we indeed mess up the samples on shutdown.

adding Francesco as well

Comment 8 Michal Skrivanek 2016-10-12 12:57:56 UTC
if those ~52 days corresponds to the vdsm uptime then indeed it would mean the samples has been cleared out (as they are on shutdown) returning statsAge of "current time - start time" by mistake

Comment 9 Francesco Romani 2016-10-12 13:41:34 UTC
(In reply to Michal Skrivanek from comment #7)
> (In reply to Germano Veit Michel from comment #2)
> > This doesn't look right, does it?
> > 
> > age=4513373.21
> > 
> > Could _vm_last_timestamp value been lost/zeroed at this time?
> 
> indeed it doesn't look right. Normally during shutdown such delayed queries
> are tolerated up to 60s, even here in this loaded environment it's pretty
> quick, and statsAge would be few secs at most...this large value indicates
> that a particular stat call is not working for 52 days. But then I don't
> understand form the code how could it work before shutdown...it should have
> been Not Responding for the whole time. Unless we indeed mess up the samples
> on shutdown.
> 
> adding Francesco as well

It is correct. There is a race on shutdown (and also on startup), because those three flows are all asynchronous with each other:
1. vm creation/powerup
2. vm powerdown
3. periodic monitoring

Here, we clear the stats cache when the VM is powering down, but the VM will stay around until Engine destroys it. If this takes time, the VM will be reported (Powering)Down and Unresponsive, which is indeed a bug.

The fix is not bother to consider the stats age for powering down and down VMs.
I'll see the proper way to implement that.

This fix should end up on 4.0.z (z==6?) as well.

Comment 10 Francesco Romani 2016-10-12 13:42:25 UTC
Not sure this is worth fixing for 3.6.$NEXT.
I will make sure the fix is backportable, however.

Comment 15 Francesco Romani 2016-10-25 16:15:23 UTC
no POST yet.

Comment 16 Francesco Romani 2016-10-26 15:38:05 UTC
Same question as per https://bugzilla.redhat.com/show_bug.cgi?id=1382583#c6
the root causes are the same, the patches are expected to fix both bugs.

Comment 17 Francesco Romani 2016-10-27 16:05:26 UTC
nope, still POST

Comment 18 Francesco Romani 2016-10-27 16:45:36 UTC
nope, still POST

Comment 19 Germano Veit Michel 2016-10-31 05:30:09 UTC
(In reply to Francesco Romani from comment #16)
> Same question as per https://bugzilla.redhat.com/show_bug.cgi?id=1382583#c6
> the root causes are the same, the patches are expected to fix both bugs.

Sure. I can try to reproduce both at once. Let me know when you have packages available.

Thanks!

Comment 23 sefi litmanovich 2017-02-06 15:00:00 UTC
Verified using host with vdsm-4.19.4-1.el7ev.x86_64 on a rhevm-4.1.0.4-0.1.el7.

After monitoring vdsm.log for several hours and starting and stopping 10 vms on and off, I didn't get the 'monitor become unresponsive' WARNING as I did get every few minutes with the older vdsms prior to the fix right after a vm is shutdown.


Note You need to log in before you can comment on or make changes to this bug.