Description of problem: many vms are up & running and engine reports 0% memory utilization. when i connected to host i found the qemu(vm) process running ~400MB RES memory i have also successfully connected to vm via ssh in vdsm there is no evidence for any kind of this issue by the logs. also engine ain't log anything similar. further investigation required. vms created from template: vms distribution: 1 CPU 256 ram 10GB thin provision disk currently reproduced on scale setup ~3600 vms. Version-Release number of selected component (if applicable): 3.5 VT13.1 How reproducible: Unknown Steps to Reproduce: 1. scale setup 2. ramp up vms via api. Actual results: many vms are in up status and reported 0% memory utilization. which is not true, when connencting to host vm running with RES memory of 400MB. also i have connected to that vm via ssh. seems like vm and hosts running as well. Expected results: synced memory utilization for vms. Additional info:
[root@host04-rack06 ~]# vdsClient -s 0 getVmStats 6e4c731c-958e-4ee2-9a45-1aa763a5f238 6e4c731c-958e-4ee2-9a45-1aa763a5f238 Status = Up displayInfo = [{'tlsPort': '5913', 'ipAddress': '10.1.64.24', 'port': '-1', 'type': 'spice'}] memUsage = 0 acpiEnable = true vdsm reports 0. so its either vdsm bug or libvirt
adding some relevant outputs: vdsClient -s 0 getVmStats afb0bd15-bd7b-469b-92f0-a59cea1b8161 report this: afb0bd15-bd7b-469b-92f0-a59cea1b8161 Status = Up displayInfo = [{'tlsPort': '5900', 'ipAddress': '10.1.64.24', 'port': '-1', 'type': 'spice'}] memUsage = 0 ====================================== in time top -p `ps -ef |grep 'afb0bd15-bd7b-469b-92f0-a59cea1b8161' report this: RES memory is 405mb Swap: 0k total, 0k used, 0k free, 700376k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21528 qemu 20 0 2248m 405m 6296 S 1.6 0.6 127:49.92 /usr/libexec/qemu-kvm -name vm_real_1_1 -S -M rhel6.5.0 -cpu SandyBridge -enable-kvm -m 256
looks like the guest agent stopped reporting some things (networks, mem usage...) don't know why yet...
What we have figured out so far: The issue seems to be caused by running a bigger number of VMs on the affected host. The connections to the virtio serial channels stay established, however it seems like there's no data handled anymore which is coming from the guests. Reducing the number from 100 VMs to 10VMs on that particular host restored the communication to normal levels again. VDSM did not show any errors in the logs at any time, nor did we encounter any other reported issues. One theory, which still yet has to be proven correct, is that the guest agent listener thread starves on CPU shares and does not get time to handle the incoming data due to the huge amount of threads for statistics etc.
eldad agreed to run a script which gradually adds more VMs to the host until the reported issues starts to emerge again. So we can see from what time (how many vms are running) the issues are starting to show. eldad, please post the conclusion of this test here. Thanks.
findings: I have tested 2 different hosts which their vms up with memusage 0. vms (guest) capacity, probably not a factor at this defect. test case: - shutdown vms (10 vms bulk each iteration). - restart vdsm. - expecting all vms reports statistics. host A with 110 vms. when i shut down 10 vms and restart vdsm all the vms at the host start report statistics. host B with 95 vms. when i shutdown 10 vms. vms still reports wrong statistics. keep debugging suspecting virtual memory leaks \ wrong locking.
(In reply to Eldad Marciano from comment #6) > findings: > I have tested 2 different hosts which their vms up with memusage 0. > vms (guest) capacity, probably not a factor at this defect. > > test case: > - shutdown vms (10 vms bulk each iteration). > - restart vdsm. > - expecting all vms reports statistics. > > host A with 110 vms. > when i shut down 10 vms and restart vdsm all the vms at the host start > report statistics. > > > host B with 95 vms. > when i shutdown 10 vms. vms still reports wrong statistics. > > > keep debugging > suspecting virtual memory leaks \ wrong locking. This is very interesting information. One quick question. Is 'memUsage' the only misreported fields? Other guest-agent-related fields are correct? Thanks for the investigation!
(In reply to Francesco Romani from comment #7) > (In reply to Eldad Marciano from comment #6) > > findings: > > I have tested 2 different hosts which their vms up with memusage 0. > > vms (guest) capacity, probably not a factor at this defect. > > > > test case: > > - shutdown vms (10 vms bulk each iteration). > > - restart vdsm. > > - expecting all vms reports statistics. > > > > host A with 110 vms. > > when i shut down 10 vms and restart vdsm all the vms at the host start > > report statistics. > > > > > > host B with 95 vms. > > when i shutdown 10 vms. vms still reports wrong statistics. > > > > > > keep debugging > > suspecting virtual memory leaks \ wrong locking. > > This is very interesting information. > One quick question. Is 'memUsage' the only misreported fields? Other > guest-agent-related fields are correct? > > Thanks for the investigation! There are more missing fields. There are no network interfaces reported, no disksUsage field, the FQDN seems to be missing, no memory statistics, and probably some other ones which did not yet pop up to my attention so far. It's basically just partial information.
(In reply to Vinzenz Feenstra [evilissimo] from comment #8) > (In reply to Francesco Romani from comment #7) > > (In reply to Eldad Marciano from comment #6) > > > findings: > > > I have tested 2 different hosts which their vms up with memusage 0. > > > vms (guest) capacity, probably not a factor at this defect. > > > > > > test case: > > > - shutdown vms (10 vms bulk each iteration). > > > - restart vdsm. > > > - expecting all vms reports statistics. > > > > > > host A with 110 vms. > > > when i shut down 10 vms and restart vdsm all the vms at the host start > > > report statistics. > > > > > > > > > host B with 95 vms. > > > when i shutdown 10 vms. vms still reports wrong statistics. > > > > > > > > > keep debugging > > > suspecting virtual memory leaks \ wrong locking. > > > > This is very interesting information. > > One quick question. Is 'memUsage' the only misreported fields? Other > > guest-agent-related fields are correct? > > > > Thanks for the investigation! > There are more missing fields. > > There are no network interfaces reported, no disksUsage field, the FQDN > seems to be missing, no memory statistics, and probably some other ones > which did not yet pop up to my attention so far. It's basically just partial > information. +1
While this issue might reflect on scheduling, I'm setting this BZ as blocker and moving back to 3.5.0.
testing some threading logic on '_wait_for_events()' method since some events handling getting stuck. if events: for (fileno, event) in events: __wait_for_events = threading.Thread(name='wait_for_events-%s' % fileno, target=self._handle_event, args=(fileno, event, blockts)) __wait_for_events.start() results: faster event handling and smaller event queue per VM Channel. issue still no resolved there is some more vms reporting memusgae zero.
looks like using threads for 'handle_event' resolve the problem. all vms reporting data. and event processed extremely faster.
Thanks Eldad! then this is not a regression but an existing behavior on highly loaded hosts. This is a scaling issue. Suggesting removing 3.5 blocker
Yuri, we are trying to determine if this issue is a regression from previous build or not. Did you see this problem with an older 3.5 build, when testing ~100 VMs?
note there hasn't been a change in this code for several releases. It is also related to the performance of that box, so it may happen with different number of VMs on different hw
I didn't encounter that problem with 200 VMs using RHEV-M 3.5.0.22 (vt12)
Michal, while we never saw it in 3.4 and not on older 3.5 builds, there is a good chance this problem was caused by some change made on one of the latest build. I suggest we keep it as blocker for now, cause IIUC, not fixing this for GA means we won't be able to use memory based polices at soon as this issue appears.
Created attachment 972031 [details] gerrit http://gerrit.ovirt.org/#/c/36342/
(In reply to Gil Klein from comment #18) > Michal, while we never saw it in 3.4 and not on older 3.5 builds, there is a > good chance this problem was caused by some change made on one of the latest > build. > > I suggest we keep it as blocker for now, cause IIUC, not fixing this for GA > means we won't be able to use memory based polices at soon as this issue > appears. IMO this is specific to Eldad's host, there was no change in very long time and Yuri confirms he hasn't seen the problem on different scaling setup. Also results vary on different hosts for Eldad. Should be easy to retry on the same hosts on vt12 to see it's not a regression. Can you please downgrade just the vdsm?
also9, please attach host specs and the actual load while the test is running. Would be interesting to compare
another interesting input: i have patched all of my hosts, expecting all vm reporting data. once patch applied all vdsm restarted. once hosts has been aligned i found some hosts which keep reporting memusage zero for their vms. so i tired to restart again and double check the vmchannel patch (restart does not help). from some reason supervdsm process was a zombie. once i killed him and restart again vdsm all vms report back data as expected. ===================== hosts specs: 24 cores 64 RAM no swap. RAID 1 1TB disk space.
(In reply to Eldad Marciano from comment #22) > another interesting input: > > i have patched all of my hosts, expecting all vm reporting data. > once patch applied all vdsm restarted. > once hosts has been aligned i found some hosts which keep reporting memusage > zero for their vms. > > so i tired to restart again and double check the vmchannel patch (restart > does not help). > > from some reason supervdsm process was a zombie. once i killed him and > restart again vdsm > all vms report back data as expected. > > ===================== > hosts specs: > 24 cores > 64 RAM > no swap. > RAID 1 1TB disk space. the zombie process which failed related to diffrent bug: https://bugzilla.redhat.com/show_bug.cgi?id=1176964 please ignore this one. according to patch applied in all host, all vms reporting data as expected.
Michal, based on comment #23 it seems the fix suggested in comment #12, is solving this problem. Would it be possible to submit a patch so we have it fixed for 3.5?
That's good news. However I wouldn't do that without further testing and review by Vinzenz. This is not a regression but most likely an existing behavior, I would wait for 3.5.z
(In reply to Eldad Marciano from comment #13) > looks like using threads for 'handle_event' resolve the problem. > all vms reporting data. > > and event processed extremely faster. Just to discuss this matter a bit and to share thoughts (since I see the patch was abandoned). I don't think this is the right approach. Let me elaborate. What we saw happen is, roughly: - until VDSM load (CPU usage) is under a given, yet unknown, threshold, everything seems to work and events are processed in a timely manner - when VDSM becomes too overloaded, the thread in charge of fetch and process events slows down, e.g. the processing of a single event starts to take too long. - thus when the events thread (vmchannels.py, Listener) get a chance to run, it it has more events to process (they are accumulated meantime) and processing of each one gets longer. Net result: in a given, fixed length timeslice, LESS work gets done and MORE cpu load is put on VDSM - the process started on the point above snowballs, and gets worse and worse as time goes and as load increases (more VMs) now, Eldad's patch it *is* beneficial somehow, because it prevents the events to accumulate. Since the processing is done on separate threads, events are grabbed quickly anough, so they don't accumulate. This is consistent with Eldad observed, and can give the impression of faster processing. But I don't see how this could possibly improve the other half of the problem, e.g. making the processing of a *single* event faster. Quite the opposite: now more and more threads are spawned, so the actual load on the VDSM process *increases*. The default python VM is notoriously not that brilliant (to be extremely kind) in handling so much threads, and we already have too many of them. For this reason I don't really like that patch (and/or I need more data to change my mind). Just to brainstorm, I believe a better approach both short and long term could be to spawn that code in a separate process, which main VDSM process will poll periodically. HTH
there seems to be too high CPU usage of the vdsm process which looks suspicious. We used to have no issues with even more VMs (i.e. more threads) so it may be related to that. Let's investigate why is it so hard on CPU... In parallel Francesco is working on thread reduction, but it's a long term change.
Francesco \Michal, I think the Francesco's option is an good idea. separate process's for main and polling probably invest better results specially on scale.
One thing we found is that we spend quite a lot of calls on the filtering of strings for not-support/allowed unicode characters. This is a recursive process going deep into the message objects after they have been decoded. We should try to optimize this part of the guest data processing. Which might help to increase the responsiveness and event handling performance of the guest agent communication.
Please note that this is not the only problem we have found here performance related. Please see https://bugzilla.redhat.com/show_bug.cgi?id=1177634 for more information.
since it is related to the overloaded host and not a new issue I'm going to close this as won'tfix as there is not real issue with the code other than it starts to break in extreme conditions. currently the reason for these conditions is bug 1177634 and we believe once it's fixed we will not see problem in guest agent reporting A separate process might be a solution, but we won't invest time into that until it is inevitable. For now addressing bug 1177634 should be enough