1435207 – [scale] - vm monitor became unresponsive

Bug 1435207 - [scale] - vm monitor became unresponsive

Summary: [scale] - vm monitor became unresponsive

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.19.9
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	eberman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-03-23 11:54 UTC by Eldad Marciano
Modified:	2021-09-09 12:14 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-07-30 09:22:30 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+

Attachments	(Terms of Use)
vdsm-4.19.12-1.el7ev.x86_64 vdsm logs (1.11 MB, application/zip) 2017-05-23 15:10 UTC, Eldad Marciano	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43443	0	None	None	None	2021-09-09 12:14:38 UTC

Description Eldad Marciano 2017-03-23 11:54:26 UTC

Description of problem:

In part of the scale up effort, when running ~500 vms.

vdsm reports plenty of warning messages:
2017-03-22 13:14:27,499+0200 WARN  (jsonrpc/5) [virt.vm] (vmId='c6a196a0-23ac-483d-ba6e-29e66393e03c') monitor became unresponsive (command timeout, age=80.3999999994) (vm:5013)

the default is 60 sec by the config.py, so why do we lack 20 sec?

there is an further information we can gather from libvirt qemu logs ?

host spec:
40 CPU's 1TB RAM, connected to NFS storage

this bug is pretty critical since, when engine see's unknown vms he made them unusable in terms of defending policies.

Version-Release number of selected component (if applicable):
4.1.1
vdsm-4.19.9-1.el7ev.x86_64
libvirt-client-2.0.0-10.el7_3.4.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. running ~500 vms


Actual results:
vdsm and engine reports about unknown vms.
while in this case they unusable by engine.

Expected results:
vdsm | libvirtd should be able to handle large scale monitoring.

Additional info:

Comment 1 Piotr Kliczewski 2017-03-23 12:48:45 UTC

it seems to be libvirt issue. Are we aware if there is known issue in libvirt about it?

Comment 2 Michal Skrivanek 2017-03-24 07:02:17 UTC

I do not think there is any other issue that simply hitting the performance limit. Question is if we want to aim higher in expense of less frequent stats

Comment 3 Michal Skrivanek 2017-03-24 09:50:54 UTC

Also, we should test collectd performance while we have the big scale setup

Comment 6 Yaniv Kaul 2017-03-27 10:36:51 UTC

Eldad, would you mind providing both VDSM and libvirt logs (might need to enable it) ?

Comment 7 Eldad Marciano 2017-03-30 13:04:27 UTC

(In reply to Yaniv Kaul from comment #6)
> Eldad, would you mind providing both VDSM and libvirt logs (might need to
> enable it) ?
adding logs

Comment 13 Eldad Marciano 2017-05-17 11:59:24 UTC

this bug reproduced with 500 real vms.
each vm has 512mb RAM and single CPU.
the bare-metal host has 40 cores and 1TB ram.

@libvirt logs can be seen the following warnings:

2017-05-17T11:50:28.133540Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2017-05-17T11:50:28.134423Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config


attaching the full logs as well.

Comment 15 Pei Zhang 2017-05-18 07:48:18 UTC

This warning message is expected.

Since the configuration of your VM is :
..... 
-smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0,mem=512 
......

If you change it to use all the vcpus (change to cpus=0-15), you will not hit the warning again :
......
-smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-15,mem=512 
......

Comment 16 chhu 2017-05-18 08:19:14 UTC

Hi, Tomas

There is a bug fix on libvirt-2.0.0-10.el7_3.9 for bug1442043.
Bug 1442043 - Default FD limit prevents running more than ~470 VMs

I'm not sure if you've modified the LimitNOFILE in /usr/lib/systemd/system/libvirtd.service manually, otherwise, you may hit the same problem: failed to start VM around 500VMs

Maybe, you can update the libvirt to the latest version, and collect the libvirtd log with setting:
1) In /etc/libvirt/libvirtd.conf
log_level = 1
log_filters = "1:qemu 1:libvirt 1:json 1:monitor 3:event 1:util 1:security"
log_outputs = "1:file:/var/log/libvirt/libvirtd.log"

2) Restart libvirtd service.
#service libvirtd restart

Please upload the libvirtd.log, if you still hit this problem with latest libvirt.

Thank you!


Regards,
chhu

Comment 17 Eldad Marciano 2017-05-18 08:59:10 UTC

Hi Chhu
from where can i get the latest version?

Comment 18 chhu 2017-05-18 09:36:03 UTC

Hi, Tomas

Sent email to you for the latest version repo.

Regards,
chhu

Comment 19 Eldad Marciano 2017-05-18 11:44:35 UTC

(In reply to chhu from comment #18)
> Hi, Tomas
> 
> Sent email to you for the latest version repo.
> 
> Regards,
> chhu

Hey, we have setup the latest version of libvirt plus extending the debug log levels.

currently the log size is 34GB.
may be we can consider use just ERROR as log level ?!

Comment 20 Michal Skrivanek 2017-05-18 12:58:50 UTC

Eldad, so rather than the size of the logs, more interesting question is whether  you you still experience the problem in latest libvirt (which one is that?)
as for comment #13 - that is bug 1437559

Comment 21 Eldad Marciano 2017-05-18 20:23:49 UTC

(In reply to Michal Skrivanek from comment #20)
> Eldad, so rather than the size of the logs, more interesting question is
> whether  you you still experience the problem in latest libvirt (which one
> is that?)
> as for comment #13 - that is bug 1437559

seems like yes

scale_-136.log:2017-05-18T20:22:32.072283Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
scale_-136.log:2017-05-18T20:22:32.072936Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
scale_-13.log:2017-05-18T20:22:24.907961Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
scale_-13.log:2017-05-18T20:22:24.908585Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config



libvirt version:
libvirt-client-2.0.0-10.el7_3.9.x86_64

Comment 22 Michal Skrivanek 2017-05-19 11:05:44 UTC

right, so again, the interesting question to answer is whether all your problems have been solved be upgrade to libvirt 2.0.0-10.el7_3.9

Comment 23 Eldad Marciano 2017-05-21 09:13:00 UTC

(In reply to Michal Skrivanek from comment #22)
> right, so again, the interesting question to answer is whether all your
> problems have been solved be upgrade to libvirt 2.0.0-10.el7_3.9

Hi Michal they not, as mentioned in comment in #21.

Comment 24 Michal Skrivanek 2017-05-21 16:59:22 UTC

Eldad, in comment #21 you only mention numa config which is explained in comment #20. So what issues do you stull experience?

Comment 25 Eldad Marciano 2017-05-21 17:10:38 UTC

Hi, for we experiencing non responsive vms
and 
scale_-136.log:2017-05-18T20:22:32.072283Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
scale_-136.log:2017-05-18T20:22:32.072936Z qemu-kvm: warning: All CPU(s) up to

Comment 26 Michal Skrivanek 2017-05-22 08:18:23 UTC

(In reply to Eldad Marciano from comment #25)
> Hi, for we experiencing non responsive vms
> and 
> scale_-136.log:2017-05-18T20:22:32.072283Z qemu-kvm: warning: CPU(s) not
> present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> scale_-136.log:2017-05-18T20:22:32.072936Z qemu-kvm: warning: All CPU(s) up
> to

OK. So let's focus on the non responsive VMs. As noted before it may be the bug 1419856 - do you still see the problem when running with this fix included?

Comment 27 Eldad Marciano 2017-05-22 15:50:06 UTC

(In reply to Michal Skrivanek from comment #26)
> (In reply to Eldad Marciano from comment #25)
> > Hi, for we experiencing non responsive vms
> > and 
> > scale_-136.log:2017-05-18T20:22:32.072283Z qemu-kvm: warning: CPU(s) not
> > present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > scale_-136.log:2017-05-18T20:22:32.072936Z qemu-kvm: warning: All CPU(s) up
> > to
> 
> OK. So let's focus on the non responsive VMs. As noted before it may be the
> bug 1419856 - do you still see the problem when running with this fix
> included?

it still reproduce on vdsm-4.19.12-1.el7ev.x86_64.
that bug #1419856 fixed to 4.19.12 ?

Comment 28 Michal Skrivanek 2017-05-22 16:14:14 UTC

Ok. It should be oncluded, yes. 
So please attach new logs from vdsm 4.19.12 and libvirt 2.0.0-10.el7_3.9.x86_64
or newer. Thanks

Comment 29 Eldad Marciano 2017-05-23 13:25:59 UTC

(In reply to Michal Skrivanek from comment #28)
> Ok. It should be oncluded, yes. 
> So please attach new logs from vdsm 4.19.12 and libvirt
> 2.0.0-10.el7_3.9.x86_64
> or newer. Thanks

due to HW change, we need to reproduce it, will update soon as possible.

Comment 30 Eldad Marciano 2017-05-23 15:09:51 UTC

Hi Michal
we reproduced the bug with 300 vms.

we spin up vms in stages of 100vms, on each cycle we stressed out vdsm with getAllVmIoTunePolice. 

results:
0-100 + stress - passed.
100-200 + stress - passed.
250-295 - failed (without the stress stage).

250-295 failed on idle (without stressing mom).


all the vms running idle os.

attaching vdsm log, currently libvirt is not logging, i'll setup the logs for libvirt as well but meanwhile we can progress with vdsm

Comment 31 Eldad Marciano 2017-05-23 15:10:51 UTC

Created attachment 1281622 [details]
vdsm-4.19.12-1.el7ev.x86_64 vdsm logs

Comment 32 Eldad Marciano 2017-05-25 10:11:07 UTC

Michail see the following libvirt errors, for 250 vms:

I see plenty of
2017-05-24 15:28:19.794+0000: 41159: error : virNetlinkEventCallback:634 : nl_recv returned with error: No buffer space available

2017-05-24 15:28:19.794+0000: 41159: error : udevEventHandleCallback:1295 : internal error: udev_monitor_receive_device returned NULL

2017-05-24 15:28:20.519+0000: 41159: error : qemuMonitorIO:695 : internal error: End of file from monitor

2017-05-24 15:28:20.520+0000: 41159: error : qemuAgentIO:636 : internal error: End of file from monitor

2017-05-24 15:31:22.464+0000: 41162: error : virCgroupGetValueStr:826 : Unable to read from '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d374\x2dper374.scope/cpuacct.stat': No such file or directory

er374.scope/cpuacct.usage': No such file or directory
2017-05-24 15:31:22.464+0000: 41162: error : virFileReadAll:1360 : Failed to open file '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d374\x2dper374.scope/cpuacct.stat': No such file or directory

2017-05-24 15:31:22.464+0000: 41162: error : virNetInterfaceStats:119 : internal error: /proc/net/dev: Interface not found

2017-05-24 15:28:46.002+0000: 41159: error : qemuMonitorIORead:586 : Unable to read from monitor: Connection reset by peer

Comment 33 Michal Skrivanek 2017-05-29 06:54:26 UTC

Eldad,
why did you stress vdsm with getIoTunePolicy? Those bugs should be fixed and IMO you should be testing regular conditions.

Is it the same setup you've run 500 VMs before? And now it can't run with 250-295?

The log you're attached are from wrong period, please either attach the right one or provide access to the system under test. Thanks

Comment 34 Eldad Marciano 2017-06-04 11:38:55 UTC

(In reply to Michal Skrivanek from comment #33)
> Eldad,
> why did you stress vdsm with getIoTunePolicy? Those bugs should be fixed and
> IMO you should be testing regular conditions.
> 
> Is it the same setup you've run 500 VMs before? And now it can't run with
> 250-295?
> 
> The log you're attached are from wrong period, please either attach the
> right one or provide access to the system under test. Thanks

Why do you found it as wrong period, I grab it exactly when the problem occurs

Comment 35 Michal Skrivanek 2017-06-04 11:43:24 UTC

comment #31 attachment contains a single log file with period
2017-05-22 13:01:01,651+0300 - 2017-05-23 18:01:45,678+0300
Your comment #32 talks about 2017-05-24

Please either attach the right one or provide access to the system under test. Thanks

Comment 36 Eldad Marciano 2017-06-04 14:02:50 UTC

(In reply to Michal Skrivanek from comment #35)
> comment #31 attachment contains a single log file with period
> 2017-05-22 13:01:01,651+0300 - 2017-05-23 18:01:45,678+0300
> Your comment #32 talks about 2017-05-24
> 
> Please either attach the right one or provide access to the system under
> test. Thanks

Yes Michal,
It is a different log, since we asked to test it with higher log level and newer version.

It looks like what interesting us is libvirt logs since the problems start there.

if it's not enough, we need to reproduce it again and give you some access to the env.

currently I'm not sure what the priority \ severity of this bug. 
bottom line vms become unresponsive under high scale (~290 vms on single host with 64 cores)

Please setup priority severity

Comment 37 Michal Skrivanek 2017-06-06 11:28:53 UTC

Eldad,
you quote errors which are not in the logs, how do you suppose we can analyze anything? There are few (different) issues in the log attached in comment #31 but nothing serious.
You say it's not able to handle ~290 VMs, yet original comment #0 says you've run into issues only with 500 VMs. How do you explain this difference?
Again, access would be best, otherwise please attach relevant logs including libvirt

Comment 38 Eldad Marciano 2017-06-20 21:10:09 UTC

(In reply to Michal Skrivanek from comment #37)
> Eldad,
> you quote errors which are not in the logs, how do you suppose we can
> analyze anything? There are few (different) issues in the log attached in
> comment #31 but nothing serious.
> You say it's not able to handle ~290 VMs, yet original comment #0 says
> you've run into issues only with 500 VMs. How do you explain this difference?
> Again, access would be best, otherwise please attach relevant logs including
> libvirt

Michal,
per comment #37 I stated we asked to test it on top of newer version and higher scale, but we faced the issue ~290 vms.
so probably we have any regression.
either way, the current state is that issue appears around ~290 vms so lets progress from that number.
the last logs I've added is for 250 and 290 vms.

In case you still needs us to reproduce it, and or give you access,
please set target release and severity to let Eyal prepare accordingly.

Comment 39 Yaniv Kaul 2017-07-30 09:22:30 UTC

Closing. We'll need a re-test with the watermak patches (4.1.5 or first in 4.2.0 upstream master branch) of ~250-300 VMs and see where we are at.
This assumes best of practices storage as well.

The last relevant comment (https://bugzilla.redhat.com/show_bug.cgi?id=1435207#c30 ) is I believe was fixed in 4.1.3.

Note You need to log in before you can comment on or make changes to this bug.