Description of problem: Virtual machines are being flagged as non-responsive and the vdsm.log reports the following: BindingXMLRPC::INFO::2016-11-11 19:01:01,552::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:46252 Thread-375258::INFO::2016-11-11 19:01:01,553::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46252 started Thread-375258::WARNING::2016-11-11 19:01:01,554::vm::5177::virt.vm::(_setUnresponsiveIfTimeout) vmId=`xxxxxxxxxxxx`::monitor become unresponsive (command timeout, age=78.6499999994) ... Thread-375258::INFO::2016-11-11 19:01:01,995::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46252 stopped Reactor thread::INFO::2016-11-11 19:01:01,995::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:46254 Reactor thread::DEBUG::2016-11-11 19:01:02,350::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11 There are NO relevant message seen in the engine.log despite messages from [org.ovirt.engine.core.vdsbroker.VmsStatisticsFetcher]. The VMs were flagged as non-responsive during a phase of high load (CPU/Memory/Network/Storage). Unfortunately there are no sa(r) logs available from that time. Version-Release number of selected component (if applicable): RHEL-H-20160920 vdsm-4.17.33-1.el7ev.noarch libvirt-1.2.17-13.el7_2.5.x86_64 kernel-3.10.0-327.22.2.el7.x86_64 How reproducible: As we do not have any monitoring data during that time, I can only suspect that the monitoring agent did not get enough CPU-cycles - which on the other hand seems very much unlikely on a system with 32 cpu-threads.
Moving to Virt for furhter investigation
could you please provide the engine, vdsm and libvirt logs from the given time when it happend?
(In reply to Andreas Bleischwitz from comment #0) > How reproducible: > As we do not have any monitoring data during that time, I can only suspect > that the monitoring agent did not get enough CPU-cycles - which on the other > hand seems very much unlikely on a system with 32 cpu-threads. Is the cpu affinity enabled? In 3.6.z we have support, but it is disabled by default (https://gerrit.ovirt.org/#/c/49463/) If yes, you should spot something like MainThread::INFO::2017-02-08 15:01:28,891::vdsm::135::vds::(run) (PID: 17204) I am the actual vdsm 4.18.22-4.gite90f5e7.el7.centos kenji.rokugan.lan (3.10.0-514.6.1.el7.x86_64) MainThread::INFO::2017-02-08 15:01:28,891::vdsm::235::vds::(__set_cpu_affinity) VDSM will run with cpu affinity: frozenset([1]) in vdsm logs. In case this was disabled, please report if this helps.
I searched the logfiles for "set_cpu_affinity" but was unable to find any occurrence. So I have to assume that this is still disabled. As I currently have no way to check whether the issue is still existent - even with disabled cpu-affinity - we should ensure that vdsm is always getting some cpu-cycles - in case the issue is identified as cpu related. Also if cpu-affinity would have been enabled - does vdsm/libvirt ensure that this cpu is for vdsmd only? Could cgroup add such feature? Is vdsm already covered by cgroup?
(In reply to Andreas Bleischwitz from comment #7) > I searched the logfiles for "set_cpu_affinity" but was unable to find any > occurrence. So I have to assume that this is still disabled. > > As I currently have no way to check whether the issue is still existent - > even with disabled cpu-affinity - we should ensure that vdsm is always > getting some cpu-cycles - in case the issue is identified as cpu related. > > Also if cpu-affinity would have been enabled - does vdsm/libvirt ensure that > this cpu is for vdsmd only? > > Could cgroup add such feature? Is vdsm already covered by cgroup? Lack of affinity and VMs *temporarily* reported unresponsive during phase of high load provide a good enough explanation. So, if VM were reported again responsive after that (so system recovered) I can only recommend to enable the affinity for improved performance, and there is no evidence of a new bug. It is not enough for Vdsm to get enough cpu cycles (cgroup enforcement), it also must use it well. Unfortunately the core issue is a well known limitation of the python runtime, the global interpreter lock (GIL). With a heavily multithreaded application, performance are known to scale *worse* increasing the hardware parallelism[1]. This is the key reason why Vdsm has the affinity support and recommends it[2]. There is no other real fix besides enabling affinity than changing the architecture of Vdsm to reduce the number of threads. +++ [1] there are or course technical reasons for that. Long story short, with many threads and enough cpu cores, the thread fight each other to acquire the GIL. The failed activations and the bounce between threads cause serious slowdown in the interpreter. [2] enabled by default in 4.0+. Disabled by default in 3.6.z because it was introduced very late in the release cycle, and we avoided potentially invasive changes.
Thanks a lot for the explanation. From my point of view the BZ could be closed then.