Description of problem: When VMs get non responding. It can happen that in some cases the executor tasks queue get full and exception TooManyTasks is raised. This causes the operation not being scheduled any more. Version-Release number of selected component (if applicable): vdsm-4.17.31-0.el7ev.noarch How reproducible: Under heavy load and isue with qemu responsivness Steps to Reproduce: Not completely clear Actual results: All VMs are marked as non-responding Expected results: The task is scheduled as soon as there is some space in the tasks queue is Additional info: vdsm.Scheduler::ERROR::2016-07-16 16:15:28,745::schedule::213::Scheduler::(_execute) Unhandled exception in <bound method Operation._try_to_dispatch of <virt.periodic.Operation object at 0x45ea390>> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/schedule.py", line 211, in _execute self._callable() File "/usr/share/vdsm/virt/periodic.py", line 190, in _try_to_dispatch self._dispatch() File "/usr/share/vdsm/virt/periodic.py", line 197, in _dispatch self._executor.dispatch(self, self._timeout) File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 101, in dispatch self._tasks.put((callable, timeout)) File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 256, in put raise TooManyTasks() TooManyTasks vdsm/virt/periodic.py: ... class Operation(object): ... def _dispatch(self): """ Send `func' to Executor to be run as soon as possible. """ self._call = None self._executor.dispatch(self, self._timeout) self._step() The exception comes from self._executor.dispatch(self, self._timeout) So self._step() is not executed and the operation is not scheduled
(In reply to Roman Hodain from comment #0) > Description of problem: > When VMs get non responding. It can happen that in some cases the > executor tasks queue get full and exception TooManyTasks is raised. This > causes the operation not being scheduled any more. > > Version-Release number of selected component (if applicable): > vdsm-4.17.31-0.el7ev.noarch > > How reproducible: > Under heavy load and isue with qemu responsivness > > Steps to Reproduce: > Not completely clear We probably need slow, but not stuck, QEMU. > Actual results: > All VMs are marked as non-responding Yes, because sampled data is too old, in turn because we skip sampling. > Expected results: > The task is scheduled as soon as there is some space in the tasks queue > is > > Additional info: [...] > So self._step() is not executed and the operation is not scheduled Ooops. Good catch, thanks!
taking care of this bug. The missing rescheduling bug is obvious and easily fixable (see patch 61685). The backport for 3.6.z is simple and feasible. However there may be other issue underlying which caused trouble (stuck or slow QEMU). These issue could be hard to debug, so let's address them separately. I recommend making sure the customer is running at least libvirt-1.2.17-13.el7_2.5 , because this package contains one important fix in this area.
master branch has all the needed patch. Please clone to 4.0.z and/or 3.6.z if needed.
Cloned to z-stream bug 1364924
Verified with rhevm-4.1.0.4-0.1.el7. HOST: vdsm-4.19.4-1.el7ev.x86_64 1. Modified /etc/vdsm/vdsm.conf and added the lines: [sampling] periodic_workers = 1 periodic_task_per_worker = 1 2. restarted vdsm. 3. Created a vm pool with 8 vms. 4. Started all 8 vms. result: in vdsm log the new warning message wrapping the TooManyTask exception is issues in vdsm.log: 2017-02-06 14:20:13,790 WARN (vdsm.Scheduler) [virt.periodic.Operation] could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.BlockjobMonitor'> at 0x2914390>, executor queue full (periodic:218) 2017-02-06 14:20:13,791 WARN (vdsm.Scheduler) [virt.periodic.Operation] could not run <function <lambda> at 0x1531e60>, executor queue full (periodic:218) Did not see and vms in the engine set to 'not responding' state. When previous tasks are done the vms are started as well until eventually all are up as expected.