1364925 – VMs flip to non-responsive state for ever.

Bug 1364925 - VMs flip to non-responsive state for ever.

Summary: VMs flip to non-responsive state for ever.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.6.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	ovirt-3.6.9
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Depends On:	1361028 1364924
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-08 08:28 UTC by Michal Skrivanek
Modified:	2020-03-11 15:11 UTC (History)
CC List:	18 users (show)
Fixed In Version:	vdsm-4.17.34-1.el7ev.noarch
Doc Type:	If docs needed, set a value
Doc Text:	This update fixes an issue in the monitoring code which caused the VDSM to fail to detect that a stuck QEMU process has become responsive.
Clone Of:	1364924
Environment:
Last Closed:	2016-09-21 18:07:37 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1925	normal	SHIPPED_LIVE	vdsm 3.6.9 bug fix and enhancement update	2016-09-21 21:58:32 UTC
oVirt gerrit	61685	None	None	None	2016-08-08 08:28:12 UTC
oVirt gerrit	61769	ovirt-4.0	MERGED	virt: Limit the number of workers in executor	2016-08-17 07:10:04 UTC
oVirt gerrit	61770	ovirt-4.0	MERGED	virt: Fix of Executor._active_workers crash on modification	2016-08-17 07:22:11 UTC
oVirt gerrit	61889	ovirt-4.0	MERGED	periodic: always re-schedule operations	2016-08-17 07:22:16 UTC
oVirt gerrit	62401	ovirt-3.6	MERGED	periodic: always re-schedule operations	2016-08-17 12:19:26 UTC
oVirt gerrit	62402	ovirt-3.6	MERGED	virt: Limit the number of workers in executor	2016-08-17 12:39:11 UTC
oVirt gerrit	62403	ovirt-3.6	MERGED	virt: Fix of Executor._active_workers crash on modification	2016-08-17 12:39:41 UTC

Description Michal Skrivanek 2016-08-08 08:28:13 UTC

clone job doesn't work, so doing that manually

+++ This bug was initially created as a clone of Bug #1364924 +++

clone job doesn't work, so doing that manually

+++ This bug was initially created as a clone of Bug #1361028 +++

Description of problem:
   When VMs get non responding. It can happen that in some cases the executor tasks queue get full and exception TooManyTasks is raised. This causes the operation not being scheduled any more.

Version-Release number of selected component (if applicable):
     vdsm-4.17.31-0.el7ev.noarch

How reproducible:
     Under heavy load and isue with qemu responsivness 

Steps to Reproduce:
     Not completely clear

Actual results:
     All VMs are marked as non-responding

Expected results:
     The task is scheduled as soon as there is some space in the tasks queue is 

Additional info:

vdsm.Scheduler::ERROR::2016-07-16 16:15:28,745::schedule::213::Scheduler::(_execute) Unhandled exception in <bound method Operation._try_to_dispatch of <virt.periodic.Operation object at 0x45ea390>>
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/schedule.py", line 211, in _execute
    self._callable()
  File "/usr/share/vdsm/virt/periodic.py", line 190, in _try_to_dispatch
    self._dispatch()
  File "/usr/share/vdsm/virt/periodic.py", line 197, in _dispatch
    self._executor.dispatch(self, self._timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 101, in dispatch
    self._tasks.put((callable, timeout))
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 256, in put
    raise TooManyTasks()
TooManyTasks



vdsm/virt/periodic.py:
...
class Operation(object):
...
     def _dispatch(self):
         """
         Send `func' to Executor to be run as soon as possible.
         """
         self._call = None
         self._executor.dispatch(self, self._timeout)
         self._step()

The exception comes from 

    self._executor.dispatch(self, self._timeout)

So self._step() is not executed and the operation is not scheduled

Comment 1 Francesco Romani 2016-08-29 08:42:15 UTC

To reproduce:

1. modify /etc/vdsm/vdsm.conf, add those lines

[sampling]
periodic_workers = 1
periodic_task_per_worker = 1

2. restart vdsm

Now vdsm will have a very high chance to throw the TooManyTasks error, which triggers the condition which caused this bug.

Now we can use vdsClient (or engine or anything else) to see if stats are still updated or not.

Comment 2 sefi litmanovich 2016-08-31 13:10:29 UTC

Verified with rhevm-3.6.9-0.1.el6.noarch.
HOST: vdsm-4.17.34-1.el7ev.noarch

Followed comment 1:
1. Modified /etc/vdsm/vdsm.conf and added the lines:

[sampling]
periodic_workers = 1
periodic_task_per_worker = 1

2. restarted vdsm.
3. Created a vm pool with 8 vms.
4. Started all 8 vms.

result:

in vdsm log the new warning message wrapping the TooManyTask exception is issues in vdsm.log:

vdsm.Scheduler::WARNING::2016-08-29 15:27:12,161::periodic::203::virt.periodic.Operation::(_dispatch) could not run <V
mDispatcher operation=<class 'virt.periodic.BlockjobMonitor'> at 0x3be9590>, executor queue full

In the engine some of the vms are set to 'not responding' state for a short amount of time and when the previous tasks are done the vms are started as well until eventually all are up as expected.

Comment 6 errata-xmlrpc 2016-09-21 18:07:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1925.html

Note You need to log in before you can comment on or make changes to this bug.