Description of problem: After VM.Shutdown(), periodic tasks may continue running after the VM is actually down (qemu down). Data gets old and VM might be seen as Unresponsive by the engine. In the Engine, instead of 'Powering Down' -> 'Down' It is seen as: 'Powering Down' -> 'Not responding' -> 'Down' In VDSM, what happens is: 'Powering Down' -> 'qemu dead' -> 'monitor unresponsive|cannot run periodic' -> 'Down' I regularly see cases (in my reproducer and customer's logs) where the periodic tasks and getRunningVmStats run for a short time after qemu has ended, but usually they stop within 1-2 seconds. However, I can see a few cases in customer's logs that the tasks continue to run for a longer time, until the destroy is issued. So it may become quite noticeable. Version-Release number of selected component (if applicable): Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160920.1.el7ev) rhevm-3.6.8.1-0.1.el6.noarch How reproducible: sometimes. In my reproducer I can get the VM to do 'Powering Down' -> 'Unresponsive' -> 'Down' about 25% of the shutdowns. It always transitions to 'Down' quite quickly, remaining a very short time in 'Not responding' state. However, in some high load environments this may look worse. * Think when I increase the number of workers it gets easier to reproduce, might be just an impression. Steps to Reproduce: 1. Start VM 2. Graceful Shutdown from the Engine (agent installed - desktopShutdown) Actual results: VM/qemu is actually down, Periodic tasks continue running, VM reported as unresponsive for a while. Expected results: VM shutdowns fine, periodic functions stop (or ignored). Goes to shutdown state without passing through 'Not Responding'. Additional info: VM XXX = f46242aa-63d5-41ad-92d6-290341cc7592 2016-10-03 12:00:04.531+08 | VM shutdown initiated by admin@internal on VM XXX (Host: YYY) (Reason: No reason was returned for this op eration failure. See logs for further details.). 2016-10-03 12:00:44.173+08 | VM XXX is not responding. VDSM side: jsonrpc.Executor/6::DEBUG::2016-10-03 04:00:04,486::__init__::532::jsonrpc.JsonRpcServer::(_handle_request) Calling 'VM.shutdown' in bridge with {u'delay': u'30', u'message': u'System Administrator has initiated shutdown of this Virtual Machine. Virtual Machine is shutting down.', u'vmID': u'f46242aa-63d5-41ad-92d6-290341cc7592'} libvirtEventLoop::DEBUG::2016-10-03 04:00:43,409::vm::4648::virt.vm::(onLibvirtLifecycleEvent) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::event Stopped detail 0 opaque None libvirtEventLoop::INFO::2016-10-03 04:00:43,409::vm::791::virt.vm::(_onQemuDeath) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::underlying process disconnected libvirtEventLoop::INFO::2016-10-03 04:00:43,410::vm::3882::virt.vm::(releaseVm) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::Release VM resources At this point any vmstats/vm_sample call is useless, but it's still running. Two seconds later: periodic/3889::WARNING::2016-10-03 04:00:45,002::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'f46242aa-63d5-41ad-92d6-290341cc7592'] Thread-1472923::WARNING::2016-10-03 04:00:45,843::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::monitor become unresponsive (command timeout, age=4513373.21) periodic/3889::WARNING::2016-10-03 04:00:46,646::periodic::302::virt.vm::(__call__) vmId=`f46242aa-63d5-41ad-92d6-290341cc7592`::could not run on f46242aa-63d5-41ad-92d6-290341cc7592: domain not connected So it may look 'Not Responding' to the engine until the down event is sent.
This doesn't look right, does it? age=4513373.21 Could _vm_last_timestamp value been lost/zeroed at this time?
Michael, don't you already have a very similar bz for this (during startup)? https://bugzilla.redhat.com/show_bug.cgi?id=1314371#c21
(In reply to Dan Kenigsberg from comment #3) > Michael, don't you already have a very similar bz for this (during startup)? > https://bugzilla.redhat.com/show_bug.cgi?id=1314371#c21 No) , i didn't reported such bug eventually. But this is definitely happening from time to time on startup.
@Milan, any thoughts?
(In reply to Germano Veit Michel from comment #2) > This doesn't look right, does it? > > age=4513373.21 > > Could _vm_last_timestamp value been lost/zeroed at this time? indeed it doesn't look right. Normally during shutdown such delayed queries are tolerated up to 60s, even here in this loaded environment it's pretty quick, and statsAge would be few secs at most...this large value indicates that a particular stat call is not working for 52 days. But then I don't understand form the code how could it work before shutdown...it should have been Not Responding for the whole time. Unless we indeed mess up the samples on shutdown. adding Francesco as well
if those ~52 days corresponds to the vdsm uptime then indeed it would mean the samples has been cleared out (as they are on shutdown) returning statsAge of "current time - start time" by mistake
(In reply to Michal Skrivanek from comment #7) > (In reply to Germano Veit Michel from comment #2) > > This doesn't look right, does it? > > > > age=4513373.21 > > > > Could _vm_last_timestamp value been lost/zeroed at this time? > > indeed it doesn't look right. Normally during shutdown such delayed queries > are tolerated up to 60s, even here in this loaded environment it's pretty > quick, and statsAge would be few secs at most...this large value indicates > that a particular stat call is not working for 52 days. But then I don't > understand form the code how could it work before shutdown...it should have > been Not Responding for the whole time. Unless we indeed mess up the samples > on shutdown. > > adding Francesco as well It is correct. There is a race on shutdown (and also on startup), because those three flows are all asynchronous with each other: 1. vm creation/powerup 2. vm powerdown 3. periodic monitoring Here, we clear the stats cache when the VM is powering down, but the VM will stay around until Engine destroys it. If this takes time, the VM will be reported (Powering)Down and Unresponsive, which is indeed a bug. The fix is not bother to consider the stats age for powering down and down VMs. I'll see the proper way to implement that. This fix should end up on 4.0.z (z==6?) as well.
Not sure this is worth fixing for 3.6.$NEXT. I will make sure the fix is backportable, however.
no POST yet.
Same question as per https://bugzilla.redhat.com/show_bug.cgi?id=1382583#c6 the root causes are the same, the patches are expected to fix both bugs.
nope, still POST
(In reply to Francesco Romani from comment #16) > Same question as per https://bugzilla.redhat.com/show_bug.cgi?id=1382583#c6 > the root causes are the same, the patches are expected to fix both bugs. Sure. I can try to reproduce both at once. Let me know when you have packages available. Thanks!
Verified using host with vdsm-4.19.4-1.el7ev.x86_64 on a rhevm-4.1.0.4-0.1.el7. After monitoring vdsm.log for several hours and starting and stopping 10 vms on and off, I didn't get the 'monitor become unresponsive' WARNING as I did get every few minutes with the older vdsms prior to the fix right after a vm is shutdown.