1418078 – vms marked as unavailable

Bug 1418078 - vms marked as unavailable

Summary: vms marked as unavailable

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	3.6.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-31 19:16 UTC by Andreas Bleischwitz
Modified:	2020-01-17 16:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-09 15:55:26 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Andreas Bleischwitz 2017-01-31 19:16:12 UTC

Description of problem:
Virtual machines are being flagged as non-responsive and the vdsm.log reports the following:

BindingXMLRPC::INFO::2016-11-11 19:01:01,552::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:46252
Thread-375258::INFO::2016-11-11 19:01:01,553::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46252 started
Thread-375258::WARNING::2016-11-11 19:01:01,554::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`xxxxxxxxxxxx`::monitor become unresponsive (command timeout, age=78.6499999994)
...
Thread-375258::INFO::2016-11-11 19:01:01,995::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46252 stopped
Reactor thread::INFO::2016-11-11 19:01:01,995::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:46254
Reactor thread::DEBUG::2016-11-11 19:01:02,350::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11

There are NO relevant message seen in the engine.log despite messages from [org.ovirt.engine.core.vdsbroker.VmsStatisticsFetcher].

The VMs were flagged as non-responsive during a phase of high load (CPU/Memory/Network/Storage). Unfortunately there are no sa(r) logs available from that time.


Version-Release number of selected component (if applicable):
RHEL-H-20160920
vdsm-4.17.33-1.el7ev.noarch
libvirt-1.2.17-13.el7_2.5.x86_64
kernel-3.10.0-327.22.2.el7.x86_64

How reproducible:
As we do not have any monitoring data during that time, I can only suspect that the monitoring agent did not get enough CPU-cycles - which on the other hand seems very much unlikely on a system with 32 cpu-threads.

Comment 1 Sandro Bonazzola 2017-02-01 10:40:31 UTC

Moving to Virt for furhter investigation

Comment 2 Tomas Jelinek 2017-02-01 11:25:38 UTC

could you please provide the engine, vdsm and libvirt logs from the given time when it happend?

Comment 6 Francesco Romani 2017-02-08 14:07:23 UTC

(In reply to Andreas Bleischwitz from comment #0)

> How reproducible:
> As we do not have any monitoring data during that time, I can only suspect
> that the monitoring agent did not get enough CPU-cycles - which on the other
> hand seems very much unlikely on a system with 32 cpu-threads.

Is the cpu affinity enabled?
In 3.6.z we have support, but it is disabled by default (https://gerrit.ovirt.org/#/c/49463/)

If yes, you should spot something like 

MainThread::INFO::2017-02-08 15:01:28,891::vdsm::135::vds::(run) (PID: 17204) I am the actual vdsm 4.18.22-4.gite90f5e7.el7.centos kenji.rokugan.lan (3.10.0-514.6.1.el7.x86_64)
MainThread::INFO::2017-02-08 15:01:28,891::vdsm::235::vds::(__set_cpu_affinity) VDSM will run with cpu affinity: frozenset([1])

in vdsm logs.

In case this was disabled, please report if this helps.

Comment 7 Andreas Bleischwitz 2017-02-08 14:23:49 UTC

I searched the logfiles for "set_cpu_affinity" but was unable to find any occurrence. So I have to assume that this is still disabled.

As I currently have no way to check whether the issue is still existent - even with disabled cpu-affinity - we should ensure that vdsm is always getting some cpu-cycles - in case the issue is identified as cpu related.

Also if cpu-affinity would have been enabled - does vdsm/libvirt ensure that this cpu is for vdsmd only?

Could cgroup add such feature? Is vdsm already covered by cgroup?

Comment 8 Francesco Romani 2017-02-09 15:37:01 UTC

(In reply to Andreas Bleischwitz from comment #7)
> I searched the logfiles for "set_cpu_affinity" but was unable to find any
> occurrence. So I have to assume that this is still disabled.
> 
> As I currently have no way to check whether the issue is still existent -
> even with disabled cpu-affinity - we should ensure that vdsm is always
> getting some cpu-cycles - in case the issue is identified as cpu related.
> 
> Also if cpu-affinity would have been enabled - does vdsm/libvirt ensure that
> this cpu is for vdsmd only?
> 
> Could cgroup add such feature? Is vdsm already covered by cgroup?

Lack of affinity and VMs *temporarily* reported unresponsive during phase of high load provide a good enough explanation.
So, if VM were reported again responsive after that (so system recovered) I can only recommend to enable the affinity for improved performance, and there is no evidence of a new bug.

It is not enough for Vdsm to get enough cpu cycles (cgroup enforcement), it also must use it well. Unfortunately the core issue is a well known limitation of the python runtime, the global interpreter lock (GIL). With a heavily multithreaded application, performance are known to scale *worse* increasing the hardware parallelism[1]. This is the key reason why Vdsm has the affinity support and recommends it[2].

There is no other real fix besides enabling affinity than changing the architecture of Vdsm to reduce the number of threads.

+++

[1] there are or course technical reasons for that. Long story short, with many threads and enough cpu cores, the thread fight each other to acquire the GIL. The failed activations and the bounce between threads cause serious slowdown in the interpreter.
[2] enabled by default in 4.0+. Disabled by default in 3.6.z because it was introduced very late in the release cycle, and we avoided potentially invasive changes.

Comment 9 Andreas Bleischwitz 2017-02-09 15:55:26 UTC

Thanks a lot for the explanation.

From my point of view the BZ could be closed then.

Note You need to log in before you can comment on or make changes to this bug.