+++ This bug was initially created as a clone of Bug #1177634 +++ --- Additional comment from Michal Skrivanek on 2015-01-23 12:19:51 CET --- (In reply to Eldad Marciano from comment #91) > Created attachment 982953 [details] > vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof unfortunately this is a different thing than advertised. It seems to be running for 259 seconds, includes VM creation. Still, very helpful, I'm spinning off a new bug for host monitoring flaws;-)
analyzing the profile data we found out: 1. HostStatsThread;s sample() is creating a HostSample object again and again Over the time period of 260 seconds we see 870 calls like that. It should have been less as the interval is 2s 2. the samping interval is 2s whereas the query comes only every 15s from the engine, so it's useless. The value "2" is hardcoded and not part of sampling intervals as all the VM related stats 3. in case NUMA is enalbed the NumaNodeMemorySample does an expensive libvirt call (one per each NUMA cell) whereas all the other monitoring code rather parses /proc information. We should avoid that since we already have a high contention on libvirt access.
(In reply to Michal Skrivanek from comment #1) > analyzing the profile data we found out: > > 1. > HostStatsThread;s sample() is creating a HostSample object again and again > Over the time period of 260 seconds we see 870 calls like that. It should > have been less as the interval is 2s 48442690 function calls (49377242 primitive calls) in 259.687 seconds This means 259 cpu seconds, not wall clock seconds, so this does not show a problem with the number of calls. > > 2. > the samping interval is 2s whereas the query comes only every 15s from the > engine, so it's useless. The value "2" is hardcoded and not part of sampling > intervals as all the VM related stats Lets make it configurable like the other sample types? Looks like we are running with low cpu usage (259/1800 -> 14%) and except protocol detector noise (59 seconds), nothing take much cpu time. (assuming that this profile did run for 1800 seconds). Fri Jan 23 14:22:10 2015 vdsmd_25vms_json_withnoMOM_1800sec_cpu.prof 48442690 function calls (49377242 primitive calls) in 259.687 seconds Ordered by: internal time List reduced from 1978 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 4585540 30.187 0.000 59.127 0.000 protocoldetector.py:94(MultiProtocolAcceptor._process_events) 37 19.794 1.799 3.463 0.094 threading.py:481(Thread.run) 4589683 16.482 0.000 16.482 0.000 __builtin__:0(poll) 75429 10.983 0.000 10.983 0.000 libvirtmod:0(virEventRunDefaultImpl) 182701 8.879 0.000 11.962 0.000 migration.py:380(MonitorThread) 1149205 7.616 0.000 12.013 0.000 guestagent.py:68(filterXmlChar) 46120 7.490 0.000 14.677 0.000 expatbuilder.py:743(ExpatBuilderNS.start_element_handler) 9380128 7.251 0.000 7.251 0.000 time:0(time) 4585669 5.725 0.000 5.725 0.000 __builtin__:0(max) 25 5.218 5.218 5.224 0.209 sampling.py:433(VmStatsThread.run) 188403 4.549 0.002 5.137 0.000 minidom.py:305(_get_elements_by_tagName_helper) 236685 4.496 0.000 4.496 0.000 __builtin__:0(<method 'read' of 'file' objects>) 2975 3.769 0.001 3.769 0.001 libvirtmod:0(virDomainGetCPUStats) 20585 3.336 0.000 10.519 0.001 decoder.py:162(JSONObject) 2975 3.293 0.001 3.293 0.001 libvirtmod:0(virDomainGetVcpus) 78238 3.181 0.000 3.275 0.000 minidom.py:349(Attr.__init__) 111164 3.100 0.000 3.219 0.000 minidom.py:281(Document._append_child) 1287046 3.060 0.000 15.073 0.000 guestagent.py:80(<genexpr>) 4597821 2.967 0.000 2.967 0.000 __builtin__:0(ord) 65044 2.941 0.000 4.672 0.000 expatbuilder.py:274(ExpatBuilderNS.character_data_handler_cdata) I think we should repeat this profile with the patch fixing protocoldetector, or with the a version including this fix, so we don't waste time on insignificant optimizations.
Nir \ Martin, can you add please the protocoldetector BZ \ patch?
(In reply to Eldad Marciano from comment #3) > Nir \ Martin, > can you add please the protocoldetector BZ \ patch? bug 1177533 Excessive cpu usage due to wrong timeout value
(In reply to Nir Soffer from comment #2) > > This means 259 cpu seconds, not wall clock seconds, so this does not show > a problem with the number of calls. aah, great. that's good news. So it matches the 2s interval > > 2. > > the samping interval is 2s whereas the query comes only every 15s from the > > engine, so it's useless. The value "2" is hardcoded and not part of sampling > > intervals as all the VM related stats > > Lets make it configurable like the other sample types? > yes. And make it less frequent. it's gathering information we report to engine only once in 15s. this is the only issue here - make it configurable and change to 15s (please don't mix up with protocoldetector, that should be another bug)
(In reply to Michal Skrivanek from comment #5) and still the 3. point from comment #1, that's valid too
(In reply to Michal Skrivanek from comment #5) > (In reply to Nir Soffer from comment #2) > > > > This means 259 cpu seconds, not wall clock seconds, so this does not show > > a problem with the number of calls. > > aah, great. that's good news. So it matches the 2s interval > > > > 2. > > > the samping interval is 2s whereas the query comes only every 15s from the > > > engine, so it's useless. The value "2" is hardcoded and not part of sampling > > > intervals as all the VM related stats > > > > Lets make it configurable like the other sample types? > > > yes. And make it less frequent. it's gathering information we report to > engine only once in 15s. > > this is the only issue here - make it configurable and change to 15s (please > don't mix up with protocoldetector, that should be another bug) Michal, I testing the patch for the protocol detector. as said there is some configuration we should test too?
looks like using seconds reducing the protocol detector cpu usage adding profile results.
Created attachment 984620 [details] protocol detector patch results
Shmuel, can you please do #2 from comment #1?
I think #3 from comment #1 is not done. correct? Martin, any plans?
There is no cheaper way to get the information and I made it even less time consuming. /proc does not contain the data afaik. Btw tracking three issues in one bug is a really bad practice..
by profiling large host on top of 3.6.2 with 144 vms per 144 processors. seems like the host stats is not listed any more, move to verified. https://bugzilla.redhat.com/attachment.cgi?id=1128028