Description of problem: In part of the scale up effort, when running ~500 vms. vdsm reports plenty of warning messages: 2017-03-22 13:14:27,499+0200 WARN (jsonrpc/5) [virt.vm] (vmId='c6a196a0-23ac-483d-ba6e-29e66393e03c') monitor became unresponsive (command timeout, age=80.3999999994) (vm:5013) the default is 60 sec by the config.py, so why do we lack 20 sec? there is an further information we can gather from libvirt qemu logs ? host spec: 40 CPU's 1TB RAM, connected to NFS storage this bug is pretty critical since, when engine see's unknown vms he made them unusable in terms of defending policies. Version-Release number of selected component (if applicable): 4.1.1 vdsm-4.19.9-1.el7ev.x86_64 libvirt-client-2.0.0-10.el7_3.4.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 How reproducible: 100% Steps to Reproduce: 1. running ~500 vms Actual results: vdsm and engine reports about unknown vms. while in this case they unusable by engine. Expected results: vdsm | libvirtd should be able to handle large scale monitoring. Additional info:
it seems to be libvirt issue. Are we aware if there is known issue in libvirt about it?
I do not think there is any other issue that simply hitting the performance limit. Question is if we want to aim higher in expense of less frequent stats
Also, we should test collectd performance while we have the big scale setup
Eldad, would you mind providing both VDSM and libvirt logs (might need to enable it) ?
(In reply to Yaniv Kaul from comment #6) > Eldad, would you mind providing both VDSM and libvirt logs (might need to > enable it) ? adding logs
this bug reproduced with 500 real vms. each vm has 512mb RAM and single CPU. the bare-metal host has 40 cores and 1TB ram. @libvirt logs can be seen the following warnings: 2017-05-17T11:50:28.133540Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2017-05-17T11:50:28.134423Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config attaching the full logs as well.
This warning message is expected. Since the configuration of your VM is : ..... -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0,mem=512 ...... If you change it to use all the vcpus (change to cpus=0-15), you will not hit the warning again : ...... -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-15,mem=512 ......
Hi, Tomas There is a bug fix on libvirt-2.0.0-10.el7_3.9 for bug1442043. Bug 1442043 - Default FD limit prevents running more than ~470 VMs I'm not sure if you've modified the LimitNOFILE in /usr/lib/systemd/system/libvirtd.service manually, otherwise, you may hit the same problem: failed to start VM around 500VMs Maybe, you can update the libvirt to the latest version, and collect the libvirtd log with setting: 1) In /etc/libvirt/libvirtd.conf log_level = 1 log_filters = "1:qemu 1:libvirt 1:json 1:monitor 3:event 1:util 1:security" log_outputs = "1:file:/var/log/libvirt/libvirtd.log" 2) Restart libvirtd service. #service libvirtd restart Please upload the libvirtd.log, if you still hit this problem with latest libvirt. Thank you! Regards, chhu
Hi Chhu from where can i get the latest version?
Hi, Tomas Sent email to you for the latest version repo. Regards, chhu
(In reply to chhu from comment #18) > Hi, Tomas > > Sent email to you for the latest version repo. > > Regards, > chhu Hey, we have setup the latest version of libvirt plus extending the debug log levels. currently the log size is 34GB. may be we can consider use just ERROR as log level ?!
Eldad, so rather than the size of the logs, more interesting question is whether you you still experience the problem in latest libvirt (which one is that?) as for comment #13 - that is bug 1437559
(In reply to Michal Skrivanek from comment #20) > Eldad, so rather than the size of the logs, more interesting question is > whether you you still experience the problem in latest libvirt (which one > is that?) > as for comment #13 - that is bug 1437559 seems like yes scale_-136.log:2017-05-18T20:22:32.072283Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 scale_-136.log:2017-05-18T20:22:32.072936Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config scale_-13.log:2017-05-18T20:22:24.907961Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 scale_-13.log:2017-05-18T20:22:24.908585Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config libvirt version: libvirt-client-2.0.0-10.el7_3.9.x86_64
right, so again, the interesting question to answer is whether all your problems have been solved be upgrade to libvirt 2.0.0-10.el7_3.9
(In reply to Michal Skrivanek from comment #22) > right, so again, the interesting question to answer is whether all your > problems have been solved be upgrade to libvirt 2.0.0-10.el7_3.9 Hi Michal they not, as mentioned in comment in #21.
Eldad, in comment #21 you only mention numa config which is explained in comment #20. So what issues do you stull experience?
Hi, for we experiencing non responsive vms and scale_-136.log:2017-05-18T20:22:32.072283Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 scale_-136.log:2017-05-18T20:22:32.072936Z qemu-kvm: warning: All CPU(s) up to
(In reply to Eldad Marciano from comment #25) > Hi, for we experiencing non responsive vms > and > scale_-136.log:2017-05-18T20:22:32.072283Z qemu-kvm: warning: CPU(s) not > present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > scale_-136.log:2017-05-18T20:22:32.072936Z qemu-kvm: warning: All CPU(s) up > to OK. So let's focus on the non responsive VMs. As noted before it may be the bug 1419856 - do you still see the problem when running with this fix included?
(In reply to Michal Skrivanek from comment #26) > (In reply to Eldad Marciano from comment #25) > > Hi, for we experiencing non responsive vms > > and > > scale_-136.log:2017-05-18T20:22:32.072283Z qemu-kvm: warning: CPU(s) not > > present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > > scale_-136.log:2017-05-18T20:22:32.072936Z qemu-kvm: warning: All CPU(s) up > > to > > OK. So let's focus on the non responsive VMs. As noted before it may be the > bug 1419856 - do you still see the problem when running with this fix > included? it still reproduce on vdsm-4.19.12-1.el7ev.x86_64. that bug #1419856 fixed to 4.19.12 ?
Ok. It should be oncluded, yes. So please attach new logs from vdsm 4.19.12 and libvirt 2.0.0-10.el7_3.9.x86_64 or newer. Thanks
(In reply to Michal Skrivanek from comment #28) > Ok. It should be oncluded, yes. > So please attach new logs from vdsm 4.19.12 and libvirt > 2.0.0-10.el7_3.9.x86_64 > or newer. Thanks due to HW change, we need to reproduce it, will update soon as possible.
Hi Michal we reproduced the bug with 300 vms. we spin up vms in stages of 100vms, on each cycle we stressed out vdsm with getAllVmIoTunePolice. results: 0-100 + stress - passed. 100-200 + stress - passed. 250-295 - failed (without the stress stage). 250-295 failed on idle (without stressing mom). all the vms running idle os. attaching vdsm log, currently libvirt is not logging, i'll setup the logs for libvirt as well but meanwhile we can progress with vdsm
Created attachment 1281622 [details] vdsm-4.19.12-1.el7ev.x86_64 vdsm logs
Michail see the following libvirt errors, for 250 vms: I see plenty of 2017-05-24 15:28:19.794+0000: 41159: error : virNetlinkEventCallback:634 : nl_recv returned with error: No buffer space available 2017-05-24 15:28:19.794+0000: 41159: error : udevEventHandleCallback:1295 : internal error: udev_monitor_receive_device returned NULL 2017-05-24 15:28:20.519+0000: 41159: error : qemuMonitorIO:695 : internal error: End of file from monitor 2017-05-24 15:28:20.520+0000: 41159: error : qemuAgentIO:636 : internal error: End of file from monitor 2017-05-24 15:31:22.464+0000: 41162: error : virCgroupGetValueStr:826 : Unable to read from '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d374\x2dper374.scope/cpuacct.stat': No such file or directory er374.scope/cpuacct.usage': No such file or directory 2017-05-24 15:31:22.464+0000: 41162: error : virFileReadAll:1360 : Failed to open file '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d374\x2dper374.scope/cpuacct.stat': No such file or directory 2017-05-24 15:31:22.464+0000: 41162: error : virNetInterfaceStats:119 : internal error: /proc/net/dev: Interface not found 2017-05-24 15:28:46.002+0000: 41159: error : qemuMonitorIORead:586 : Unable to read from monitor: Connection reset by peer
Eldad, why did you stress vdsm with getIoTunePolicy? Those bugs should be fixed and IMO you should be testing regular conditions. Is it the same setup you've run 500 VMs before? And now it can't run with 250-295? The log you're attached are from wrong period, please either attach the right one or provide access to the system under test. Thanks
(In reply to Michal Skrivanek from comment #33) > Eldad, > why did you stress vdsm with getIoTunePolicy? Those bugs should be fixed and > IMO you should be testing regular conditions. > > Is it the same setup you've run 500 VMs before? And now it can't run with > 250-295? > > The log you're attached are from wrong period, please either attach the > right one or provide access to the system under test. Thanks Why do you found it as wrong period, I grab it exactly when the problem occurs
comment #31 attachment contains a single log file with period 2017-05-22 13:01:01,651+0300 - 2017-05-23 18:01:45,678+0300 Your comment #32 talks about 2017-05-24 Please either attach the right one or provide access to the system under test. Thanks
(In reply to Michal Skrivanek from comment #35) > comment #31 attachment contains a single log file with period > 2017-05-22 13:01:01,651+0300 - 2017-05-23 18:01:45,678+0300 > Your comment #32 talks about 2017-05-24 > > Please either attach the right one or provide access to the system under > test. Thanks Yes Michal, It is a different log, since we asked to test it with higher log level and newer version. It looks like what interesting us is libvirt logs since the problems start there. if it's not enough, we need to reproduce it again and give you some access to the env. currently I'm not sure what the priority \ severity of this bug. bottom line vms become unresponsive under high scale (~290 vms on single host with 64 cores) Please setup priority severity
Eldad, you quote errors which are not in the logs, how do you suppose we can analyze anything? There are few (different) issues in the log attached in comment #31 but nothing serious. You say it's not able to handle ~290 VMs, yet original comment #0 says you've run into issues only with 500 VMs. How do you explain this difference? Again, access would be best, otherwise please attach relevant logs including libvirt
(In reply to Michal Skrivanek from comment #37) > Eldad, > you quote errors which are not in the logs, how do you suppose we can > analyze anything? There are few (different) issues in the log attached in > comment #31 but nothing serious. > You say it's not able to handle ~290 VMs, yet original comment #0 says > you've run into issues only with 500 VMs. How do you explain this difference? > Again, access would be best, otherwise please attach relevant logs including > libvirt Michal, per comment #37 I stated we asked to test it on top of newer version and higher scale, but we faced the issue ~290 vms. so probably we have any regression. either way, the current state is that issue appears around ~290 vms so lets progress from that number. the last logs I've added is for 250 and 290 vms. In case you still needs us to reproduce it, and or give you access, please set target release and severity to let Eyal prepare accordingly.
Closing. We'll need a re-test with the watermak patches (4.1.5 or first in 4.2.0 upstream master branch) of ~250-300 VMs and see where we are at. This assumes best of practices storage as well. The last relevant comment (https://bugzilla.redhat.com/show_bug.cgi?id=1435207#c30 ) is I believe was fixed in 4.1.3.