Bug 1185279 - [scale][performance] vdsm Host Monitoring issues
Summary: [scale][performance] vdsm Host Monitoring issues
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.0
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Shmuel Melamud
QA Contact: Eldad Marciano
URL:
Whiteboard:
Depends On: 1182094
Blocks: 1177634
TreeView+ depends on / blocked
 
Reported: 2015-01-23 11:23 UTC by Michal Skrivanek
Modified: 2016-04-20 01:10 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1177634
Environment:
Last Closed: 2016-04-20 01:10:12 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
protocol detector patch results (764.49 KB, application/octet-stream)
2015-01-27 12:41 UTC, Eldad Marciano
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 39191 0 master MERGED virt: configurable sampling interval for HostStatsThread Never

Description Michal Skrivanek 2015-01-23 11:23:01 UTC
+++ This bug was initially created as a clone of Bug #1177634 +++

--- Additional comment from Michal Skrivanek on 2015-01-23 12:19:51 CET ---

(In reply to Eldad Marciano from comment #91)
> Created attachment 982953 [details]
> vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof

unfortunately this is a different thing than advertised. It seems to be running for 259 seconds, includes VM creation.
Still, very helpful, I'm spinning off a new bug for host monitoring flaws;-)

Comment 1 Michal Skrivanek 2015-01-23 11:28:53 UTC
analyzing the profile data we found out:

1.
HostStatsThread;s sample() is creating a HostSample object again and again
Over the time period of 260 seconds we see 870 calls like that. It should have been less as the interval is 2s

2.
the samping interval is 2s whereas the query comes only every 15s from the engine, so it's useless. The value "2" is hardcoded and not part of sampling intervals as all the VM related stats

3.
in case NUMA is enalbed the NumaNodeMemorySample does an expensive libvirt call (one per each NUMA cell) whereas all the other monitoring code rather parses /proc information. We should avoid that since we already have a high contention on libvirt access.

Comment 2 Nir Soffer 2015-01-23 12:41:07 UTC
(In reply to Michal Skrivanek from comment #1)
> analyzing the profile data we found out:
> 
> 1.
> HostStatsThread;s sample() is creating a HostSample object again and again
> Over the time period of 260 seconds we see 870 calls like that. It should
> have been less as the interval is 2s

48442690 function calls (49377242 primitive calls) in 259.687 seconds

This means 259 cpu seconds, not wall clock seconds, so this does not show
a problem with the number of calls.

> 
> 2.
> the samping interval is 2s whereas the query comes only every 15s from the
> engine, so it's useless. The value "2" is hardcoded and not part of sampling
> intervals as all the VM related stats

Lets make it configurable like the other sample types?

Looks like we are running with low cpu usage (259/1800 -> 14%) and except
protocol detector noise (59 seconds), nothing take much cpu time.
(assuming that this profile did run for 1800 seconds).

Fri Jan 23 14:22:10 2015    vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof

         48442690 function calls (49377242 primitive calls) in 259.687 seconds

   Ordered by: internal time
   List reduced from 1978 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  4585540   30.187    0.000   59.127    0.000 protocoldetector.py:94(MultiProtocolAcceptor._process_events)
       37   19.794    1.799    3.463    0.094 threading.py:481(Thread.run)
  4589683   16.482    0.000   16.482    0.000 __builtin__:0(poll)
    75429   10.983    0.000   10.983    0.000 libvirtmod:0(virEventRunDefaultImpl)
   182701    8.879    0.000   11.962    0.000 migration.py:380(MonitorThread)
  1149205    7.616    0.000   12.013    0.000 guestagent.py:68(filterXmlChar)
    46120    7.490    0.000   14.677    0.000 expatbuilder.py:743(ExpatBuilderNS.start_element_handler)
  9380128    7.251    0.000    7.251    0.000 time:0(time)
  4585669    5.725    0.000    5.725    0.000 __builtin__:0(max)
       25    5.218    5.218    5.224    0.209 sampling.py:433(VmStatsThread.run)
   188403    4.549    0.002    5.137    0.000 minidom.py:305(_get_elements_by_tagName_helper)
   236685    4.496    0.000    4.496    0.000 __builtin__:0(<method 'read' of 'file' objects>)
     2975    3.769    0.001    3.769    0.001 libvirtmod:0(virDomainGetCPUStats)
    20585    3.336    0.000   10.519    0.001 decoder.py:162(JSONObject)
     2975    3.293    0.001    3.293    0.001 libvirtmod:0(virDomainGetVcpus)
    78238    3.181    0.000    3.275    0.000 minidom.py:349(Attr.__init__)
   111164    3.100    0.000    3.219    0.000 minidom.py:281(Document._append_child)
  1287046    3.060    0.000   15.073    0.000 guestagent.py:80(<genexpr>)
  4597821    2.967    0.000    2.967    0.000 __builtin__:0(ord)
    65044    2.941    0.000    4.672    0.000 expatbuilder.py:274(ExpatBuilderNS.character_data_handler_cdata)

I think we should repeat this profile with the patch fixing protocoldetector,
or with the a version including this fix, so we don't waste time on 
insignificant optimizations.

Comment 3 Eldad Marciano 2015-01-26 09:19:05 UTC
Nir \ Martin,
can you add please the protocoldetector BZ \ patch?

Comment 4 Nir Soffer 2015-01-26 11:43:48 UTC
(In reply to Eldad Marciano from comment #3)
> Nir \ Martin,
> can you add please the protocoldetector BZ \ patch?

bug 1177533 Excessive cpu usage due to wrong timeout value

Comment 5 Michal Skrivanek 2015-01-27 07:41:16 UTC
(In reply to Nir Soffer from comment #2)
> 
> This means 259 cpu seconds, not wall clock seconds, so this does not show
> a problem with the number of calls.

aah, great. that's good news. So it matches the 2s interval
 
> > 2.
> > the samping interval is 2s whereas the query comes only every 15s from the
> > engine, so it's useless. The value "2" is hardcoded and not part of sampling
> > intervals as all the VM related stats
> 
> Lets make it configurable like the other sample types?
> 
yes. And make it less frequent. it's gathering information we report to engine only once in 15s.

this is the only issue here - make it configurable and change to 15s (please don't mix up with protocoldetector, that should be another bug)

Comment 6 Michal Skrivanek 2015-01-27 07:42:34 UTC
(In reply to Michal Skrivanek from comment #5)
and still the 3. point from comment #1, that's valid too

Comment 7 Eldad Marciano 2015-01-27 10:12:16 UTC
(In reply to Michal Skrivanek from comment #5)
> (In reply to Nir Soffer from comment #2)
> > 
> > This means 259 cpu seconds, not wall clock seconds, so this does not show
> > a problem with the number of calls.
> 
> aah, great. that's good news. So it matches the 2s interval
>  
> > > 2.
> > > the samping interval is 2s whereas the query comes only every 15s from the
> > > engine, so it's useless. The value "2" is hardcoded and not part of sampling
> > > intervals as all the VM related stats
> > 
> > Lets make it configurable like the other sample types?
> > 
> yes. And make it less frequent. it's gathering information we report to
> engine only once in 15s.
> 
> this is the only issue here - make it configurable and change to 15s (please
> don't mix up with protocoldetector, that should be another bug)

Michal,
I testing the patch for the protocol detector. 
as said there is some configuration we should test too?

Comment 8 Eldad Marciano 2015-01-27 12:40:15 UTC
looks like using seconds reducing the protocol detector cpu usage
adding profile results.

Comment 9 Eldad Marciano 2015-01-27 12:41:11 UTC
Created attachment 984620 [details]
protocol detector patch results

Comment 10 Michal Skrivanek 2015-03-04 15:58:30 UTC
Shmuel, can you please do #2 from comment #1?

Comment 11 Michal Skrivanek 2015-05-11 11:34:36 UTC
I think #3 from comment #1 is not done. correct? Martin, any plans?

Comment 12 Martin Sivák 2015-05-11 12:07:19 UTC
There is no cheaper way to get the information and I made it even less time consuming. /proc does not contain the data afaik.

Btw tracking three issues in one bug is a really bad practice..

Comment 14 Eldad Marciano 2016-02-18 10:55:40 UTC
by profiling large host on top of 3.6.2 with 144 vms per 144 processors.
seems like the host stats is not listed any more, move to verified.

https://bugzilla.redhat.com/attachment.cgi?id=1128028


Note You need to log in before you can comment on or make changes to this bug.