1174172 – engine report memory zero for running vms.

Bug 1174172 - engine report memory zero for running vms.

Summary: engine report memory zero for running vms.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.1
Assignee:	Vinzenz Feenstra [evilissimo]
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-12-15 10:17 UTC by Eldad Marciano
Modified:	2015-01-07 11:48 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-01-07 11:48:25 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
gerrit (34 bytes, patch) 2014-12-22 13:42 UTC, Eldad Marciano	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1177634	0	high	CLOSED	[scale][performance] getAllVmStats too slow when running many vms	2021-02-22 00:41:40 UTC
oVirt gerrit	36630	0	master	MERGED	virt: Optimize guest agent string filtering	Never

Internal Links: 1177634

Description Eldad Marciano 2014-12-15 10:17:46 UTC

Description of problem:
many vms are up & running and engine reports 0% memory utilization.
when i connected to host i found the qemu(vm) process running ~400MB RES memory
i have also successfully connected to vm via ssh

in vdsm there is no evidence for any kind of this issue by the logs.

also engine ain't log anything similar.

further investigation required.

vms created from template:
vms distribution:
1 CPU
256 ram
10GB thin provision disk

currently reproduced on scale setup ~3600 vms.

Version-Release number of selected component (if applicable):
3.5 VT13.1

How reproducible:
Unknown


Steps to Reproduce:
1. scale setup
2. ramp up vms via api. 

Actual results:
many vms are in up status and reported 0% memory utilization.
which is not true, when connencting to host vm running with RES memory of 400MB.
also i have connected to that vm via ssh.

seems like vm and hosts running as well.


Expected results:
synced memory utilization for vms.

Additional info:

Comment 1 Yaniv Bronhaim 2014-12-15 15:44:51 UTC

[root@host04-rack06 ~]# vdsClient -s 0 getVmStats 6e4c731c-958e-4ee2-9a45-1aa763a5f238

6e4c731c-958e-4ee2-9a45-1aa763a5f238
	Status = Up
	displayInfo = [{'tlsPort': '5913', 'ipAddress': '10.1.64.24', 'port': '-1', 'type': 'spice'}]
	memUsage = 0
	acpiEnable = true

vdsm reports 0. so its either vdsm bug or libvirt

Comment 2 Eldad Marciano 2014-12-15 15:57:03 UTC

adding some relevant outputs:


vdsClient -s 0 getVmStats afb0bd15-bd7b-469b-92f0-a59cea1b8161 report this:

afb0bd15-bd7b-469b-92f0-a59cea1b8161
        Status = Up
        displayInfo = [{'tlsPort': '5900', 'ipAddress': '10.1.64.24', 'port': '-1', 'type': 'spice'}]
        memUsage = 0


======================================
in time top -p `ps -ef |grep 'afb0bd15-bd7b-469b-92f0-a59cea1b8161' report this:
RES memory is 405mb


Swap:        0k total,        0k used,        0k free,   700376k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                          
21528 qemu      20   0 2248m 405m 6296 S  1.6  0.6 127:49.92 /usr/libexec/qemu-kvm -name vm_real_1_1 -S -M rhel6.5.0 -cpu SandyBridge -enable-kvm -m 256

Comment 3 Michal Skrivanek 2014-12-17 09:16:15 UTC

looks like the guest agent stopped reporting some things (networks, mem usage...)
don't know why yet...

Comment 4 Vinzenz Feenstra [evilissimo] 2014-12-17 15:19:03 UTC

What we have figured out so far:

The issue seems to be caused by running a bigger number of VMs on the affected host.
The connections to the virtio serial channels stay established, however it seems like there's no data handled anymore which is coming from the guests.

Reducing the number from 100 VMs to 10VMs on that particular host restored the communication to normal levels again.

VDSM did not show any errors in the logs at any time, nor did we encounter any other reported issues.


One theory, which still yet has to be proven correct, is that the guest agent listener thread starves on CPU shares and does not get time to handle the incoming data due to the huge amount of threads for statistics etc.

Comment 5 Vinzenz Feenstra [evilissimo] 2014-12-17 15:22:01 UTC

eldad agreed to run a script which gradually adds more VMs to the host until the reported issues starts to emerge again. So we can see from what time (how many vms are running) the issues are starting to show.

eldad, please post the conclusion of this test here. Thanks.

Comment 6 Eldad Marciano 2014-12-18 11:08:37 UTC

findings:
I have tested 2 different hosts which their vms up with memusage 0.
vms (guest) capacity, probably not a factor at this defect.

test case:
- shutdown vms (10 vms bulk each iteration).
- restart vdsm. 
- expecting all vms reports statistics.

host A with 110 vms.
when i shut down 10 vms and restart vdsm all the vms at the host start report statistics.


host B with 95 vms.
when i shutdown 10 vms. vms still reports wrong statistics.


keep debugging 
suspecting virtual memory leaks \ wrong locking.

Comment 7 Francesco Romani 2014-12-18 11:28:19 UTC

(In reply to Eldad Marciano from comment #6)
> findings:
> I have tested 2 different hosts which their vms up with memusage 0.
> vms (guest) capacity, probably not a factor at this defect.
> 
> test case:
> - shutdown vms (10 vms bulk each iteration).
> - restart vdsm. 
> - expecting all vms reports statistics.
> 
> host A with 110 vms.
> when i shut down 10 vms and restart vdsm all the vms at the host start
> report statistics.
> 
> 
> host B with 95 vms.
> when i shutdown 10 vms. vms still reports wrong statistics.
> 
> 
> keep debugging 
> suspecting virtual memory leaks \ wrong locking.

This is very interesting information.
One quick question. Is 'memUsage' the only misreported fields? Other guest-agent-related fields are correct?

Thanks for the investigation!

Comment 8 Vinzenz Feenstra [evilissimo] 2014-12-18 11:31:40 UTC

(In reply to Francesco Romani from comment #7)
> (In reply to Eldad Marciano from comment #6)
> > findings:
> > I have tested 2 different hosts which their vms up with memusage 0.
> > vms (guest) capacity, probably not a factor at this defect.
> > 
> > test case:
> > - shutdown vms (10 vms bulk each iteration).
> > - restart vdsm. 
> > - expecting all vms reports statistics.
> > 
> > host A with 110 vms.
> > when i shut down 10 vms and restart vdsm all the vms at the host start
> > report statistics.
> > 
> > 
> > host B with 95 vms.
> > when i shutdown 10 vms. vms still reports wrong statistics.
> > 
> > 
> > keep debugging 
> > suspecting virtual memory leaks \ wrong locking.
> 
> This is very interesting information.
> One quick question. Is 'memUsage' the only misreported fields? Other
> guest-agent-related fields are correct?
> 
> Thanks for the investigation!
There are more missing fields. 

There are no network interfaces reported, no disksUsage field, the FQDN seems to be missing, no memory statistics, and probably some other ones which did not yet pop up to my attention so far. It's basically just partial information.

Comment 9 Eldad Marciano 2014-12-18 11:32:59 UTC

(In reply to Vinzenz Feenstra [evilissimo] from comment #8)
> (In reply to Francesco Romani from comment #7)
> > (In reply to Eldad Marciano from comment #6)
> > > findings:
> > > I have tested 2 different hosts which their vms up with memusage 0.
> > > vms (guest) capacity, probably not a factor at this defect.
> > > 
> > > test case:
> > > - shutdown vms (10 vms bulk each iteration).
> > > - restart vdsm. 
> > > - expecting all vms reports statistics.
> > > 
> > > host A with 110 vms.
> > > when i shut down 10 vms and restart vdsm all the vms at the host start
> > > report statistics.
> > > 
> > > 
> > > host B with 95 vms.
> > > when i shutdown 10 vms. vms still reports wrong statistics.
> > > 
> > > 
> > > keep debugging 
> > > suspecting virtual memory leaks \ wrong locking.
> > 
> > This is very interesting information.
> > One quick question. Is 'memUsage' the only misreported fields? Other
> > guest-agent-related fields are correct?
> > 
> > Thanks for the investigation!
> There are more missing fields. 
> 
> There are no network interfaces reported, no disksUsage field, the FQDN
> seems to be missing, no memory statistics, and probably some other ones
> which did not yet pop up to my attention so far. It's basically just partial
> information.

+1

Comment 10 Gil Klein 2014-12-21 13:04:20 UTC

While this issue might reflect on scheduling, I'm setting this BZ as blocker and moving back to 3.5.0.

Comment 12 Eldad Marciano 2014-12-21 17:14:17 UTC

testing some threading logic on '_wait_for_events()' method since some events handling getting stuck.

if events:
            for (fileno, event) in events:
                __wait_for_events = threading.Thread(name='wait_for_events-%s' % fileno, target=self._handle_event, args=(fileno, event, blockts))
                __wait_for_events.start()

results: 
faster event handling and smaller event queue per VM Channel.

issue still no resolved there is some more vms reporting memusgae zero.

Comment 13 Eldad Marciano 2014-12-22 09:29:24 UTC

looks like using threads for 'handle_event' resolve the problem.
all vms reporting data.
 
and event processed extremely faster.

Comment 14 Michal Skrivanek 2014-12-22 11:27:20 UTC

Thanks Eldad!
then this is not a regression but an existing behavior on highly loaded hosts. This is a scaling issue. Suggesting removing 3.5 blocker

Comment 15 Gil Klein 2014-12-22 11:31:53 UTC

Yuri, we are trying to determine if this issue is a regression from previous build or not.

Did you see this problem with an older 3.5 build, when testing ~100 VMs?

Comment 16 Michal Skrivanek 2014-12-22 11:42:06 UTC

note there hasn't been a change in this code for several releases.
It is also related to the performance of that box, so it may happen with different number of VMs on different hw

Comment 17 Yuri Obshansky 2014-12-22 11:56:57 UTC

I didn't encounter that problem with 200 VMs using RHEV-M 3.5.0.22 (vt12)

Comment 18 Gil Klein 2014-12-22 12:29:52 UTC

Michal, while we never saw it in 3.4 and not on older 3.5 builds, there is a good chance this problem was caused by some change made on one of the latest build.

I suggest we keep it as blocker for now, cause IIUC, not fixing this for GA means we won't be able to use memory based polices at soon as this issue appears.

Comment 19 Eldad Marciano 2014-12-22 13:42:15 UTC

Created attachment 972031 [details]
gerrit

http://gerrit.ovirt.org/#/c/36342/

Comment 20 Michal Skrivanek 2014-12-23 09:20:09 UTC

(In reply to Gil Klein from comment #18)
> Michal, while we never saw it in 3.4 and not on older 3.5 builds, there is a
> good chance this problem was caused by some change made on one of the latest
> build.
> 
> I suggest we keep it as blocker for now, cause IIUC, not fixing this for GA
> means we won't be able to use memory based polices at soon as this issue
> appears.

IMO this is specific to Eldad's host, there was no change in very long time and Yuri confirms he hasn't seen the problem on different scaling setup. Also results vary on different hosts for Eldad. Should be easy to retry on the same hosts on vt12 to see it's not a regression. Can you please downgrade just the vdsm?

Comment 21 Michal Skrivanek 2014-12-23 09:21:11 UTC

also9, please attach host specs and the actual load while the test is running. Would be interesting to compare

Comment 22 Eldad Marciano 2014-12-23 10:08:03 UTC

another interesting input:

i have patched all of my hosts, expecting all vm reporting data.
once patch applied all vdsm restarted.
once hosts has been aligned i found some hosts which keep reporting memusage zero for their vms.

so i tired to restart again and double check the vmchannel patch (restart does not help).

from some reason supervdsm process was a zombie. once i killed him and restart again vdsm 
all vms report back data as expected.

=====================
hosts specs:
24 cores
64 RAM
no swap.
RAID 1 1TB disk space.

Comment 23 Eldad Marciano 2014-12-23 17:07:14 UTC

(In reply to Eldad Marciano from comment #22)
> another interesting input:
> 
> i have patched all of my hosts, expecting all vm reporting data.
> once patch applied all vdsm restarted.
> once hosts has been aligned i found some hosts which keep reporting memusage
> zero for their vms.
> 
> so i tired to restart again and double check the vmchannel patch (restart
> does not help).
> 
> from some reason supervdsm process was a zombie. once i killed him and
> restart again vdsm 
> all vms report back data as expected.
> 
> =====================
> hosts specs:
> 24 cores
> 64 RAM
> no swap.
> RAID 1 1TB disk space.

the zombie process which failed related to diffrent bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1176964

please ignore this one.



according to patch applied in all host, all vms reporting data as expected.

Comment 24 Gil Klein 2014-12-24 15:49:59 UTC

Michal, based on comment #23 it seems the fix suggested in comment #12, is solving this problem.

Would it be possible to submit a patch so we have it fixed for 3.5?

Comment 25 Michal Skrivanek 2014-12-24 16:45:15 UTC

That's good news. However I wouldn't do that without further testing and review by Vinzenz. This is not a regression but most likely an existing behavior, I would wait for 3.5.z

Comment 26 Francesco Romani 2014-12-29 09:52:53 UTC

(In reply to Eldad Marciano from comment #13)
> looks like using threads for 'handle_event' resolve the problem.
> all vms reporting data.
>  
> and event processed extremely faster.

Just to discuss this matter a bit and to share thoughts (since I see the patch was abandoned).

I don't think this is the right approach. Let me elaborate.
What we saw happen is, roughly:

- until VDSM load (CPU usage) is under a given, yet unknown, threshold, everything seems to work and events are processed in a timely manner
- when VDSM becomes too overloaded, the thread in charge of fetch and process events slows down, e.g. the processing of a single event starts to take too long.
- thus when the events thread (vmchannels.py, Listener) get a chance to run, it it has more events to process (they are accumulated meantime) and processing of each one gets longer. Net result: in a given, fixed length timeslice, LESS work gets done and MORE cpu load is put on VDSM
- the process started on the point above snowballs, and gets worse and worse as time goes and as load increases (more VMs)

now, Eldad's patch it *is* beneficial somehow, because it prevents the events to accumulate. Since the processing is done on separate threads, events are grabbed quickly anough, so they don't accumulate.

This is consistent with Eldad observed, and can give the impression of faster processing.
But I don't see how this could possibly improve the other half of the problem, e.g. making the processing of a *single* event faster.

Quite the opposite: now more and more threads are spawned, so the actual load on the VDSM process *increases*. The default python VM is notoriously not that brilliant (to be extremely kind) in handling so much threads, and we already have too many of them.

For this reason I don't really like that patch (and/or I need more data to change my mind).

Just to brainstorm, I believe a better approach both short and long term could be to spawn that code in a separate process, which main VDSM process will poll
periodically.

HTH

Comment 27 Michal Skrivanek 2015-01-05 10:46:41 UTC

there seems to be too high CPU usage of the vdsm process which looks suspicious. We used to have no issues with even more VMs (i.e. more threads) so it may be related to that. Let's investigate why is it so hard on CPU...

In parallel Francesco is working on thread reduction, but it's a long term change.

Comment 28 Eldad Marciano 2015-01-05 15:15:50 UTC

Francesco \Michal,
I think the Francesco's option is an good idea.
separate process's for main and polling probably invest better results specially on scale.

Comment 29 Vinzenz Feenstra [evilissimo] 2015-01-06 12:52:45 UTC

One thing we found is that we spend quite a lot of calls on the filtering of strings for not-support/allowed unicode characters.

This is a recursive process going deep into the message objects after they have been decoded. We should try to optimize this part of the guest data processing.
Which might help to increase the responsiveness and event handling performance of the guest agent communication.

Comment 30 Vinzenz Feenstra [evilissimo] 2015-01-06 12:53:42 UTC

Please note that this is not the only problem we have found here performance related. Please see https://bugzilla.redhat.com/show_bug.cgi?id=1177634 for more information.

Comment 31 Michal Skrivanek 2015-01-07 11:48:25 UTC

since it is related to the overloaded host and not a new issue I'm going to close this as won'tfix as there is not real issue with the code other than it starts to break in extreme conditions.
currently the reason for these conditions is bug 1177634 and we believe once it's fixed we will not see problem in guest agent reporting

A separate process might be a solution, but we won't invest time into that until it is inevitable. For now addressing bug 1177634 should be enough

Note You need to log in before you can comment on or make changes to this bug.