1395608 – DWH sampling is too high

Bug 1395608 - DWH sampling is too high

Summary: DWH sampling is too high

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine-dwh
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Shirly Radco
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1398553 1478859
TreeView+	depends on / blocked

Reported:	2016-11-16 09:50 UTC by Roy Golan
Modified:	2017-12-20 11:38 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Clones:	1398553 1478859 (view as bug list)
Environment:
Last Closed:	2017-12-20 11:38:16 UTC
oVirt Team:	Metrics
Embargoed:
Flags:	ykaul: needinfo+ rule-engine: ovirt-4.2+ rule-engine: planning_ack+ rule-engine: devel_ack+ pstehlik: testing_ack+

Attachments	(Terms of Use)

Description Roy Golan 2016-11-16 09:50:53 UTC

Description of problem:
DHW sampling rate is currently 20 seconds and that incures a big amount of work and the postgres specifically if the history db is hosted with the engine db.

This leads to overheads on the postgres prefomance which affects engine performance:
- dwh takes almost 50% of the queries on the db
- increased io overhead
- contention on autovacuum workers
- little effect on the dashboard [1] which is not a core functionality 


[1] The cpu dashboad is a 24 hours overview. VDSM already samples an average of the last 15s and engine collects that every 15s, a sample of 20s by dhw will miss some engine samplings anyway.


actual: sample interval of 20s
expected: interval of 60s, if not higher

Comment 1 Michal Skrivanek 2016-11-22 10:01:44 UTC

requesting a 4.0.z fix since it is a regression in performance and basically just a revert of earlier patch to set it back to 60s

Comment 2 Shirly Radco 2016-11-22 11:48:59 UTC

(In reply to Roy Golan from comment #0)
> Description of problem:
> DHW sampling rate is currently 20 seconds and that incures a big amount of
> work and the postgres specifically if the history db is hosted with the
> engine db.
> 
> This leads to overheads on the postgres prefomance which affects engine
> performance:
> - dwh takes almost 50% of the queries on the db

We are sampling small amount of data with simple non joining selects every 20 seconds which is fine as long as it doesn't impact the engine usability

> - increased io overhead

iowait can be increased and yet everything is fine or it can be 0% and nothing works.

> - contention on autovacuum workers

Since not locking any engine table, this has 0 impact on the engine DB.  

> - little effect on the dashboard [1] which is not a core functionality 
> 
> 
> [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average
> of the last 15s and engine collects that every 15s, a sample of 20s by dhw
> will miss some engine samplings anyway.


The dwh is not meant to be in sink with the vdsm rate. 
We are collecting 5 samples out of the 6 VDSM reports which is quite good average and max calculations compared with the 1 out of 6 that we have from in the previous interval.

> 
> 
> actual: sample interval of 20s
> expected: interval of 60s, if not higher

This requires further testing.
I have asked to check the affect on the engine performance against the postgres db before and after the change from 20 back to 60 seconds.

Also, follow test Juan suggestion to limit the size of the java heap size.

Comment 3 Michal Skrivanek 2016-11-25 07:40:37 UTC

Please go through the email threads, results from Roy, me, again.

(In reply to Shirly Radco from comment #2)
> (In reply to Roy Golan from comment #0)
> > Description of problem:
> > DHW sampling rate is currently 20 seconds and that incures a big amount of
> > work and the postgres specifically if the history db is hosted with the
> > engine db.
> > 
> > This leads to overheads on the postgres prefomance which affects engine
> > performance:
> > - dwh takes almost 50% of the queries on the db
> 
> We are sampling small amount of data with simple non joining selects every
> 20 seconds which is fine as long as it doesn't impact the engine usability

The previous statement clearly says does impact the engine usability

> 
> > - increased io overhead
> 
> iowait can be increased and yet everything is fine or it can be 0% and
> nothing works.

increased io overhead does impact the engine usability. It is *not* "fine".

> 
> > - contention on autovacuum workers
> 
> Since not locking any engine table, this has 0 impact on the engine DB.

testing proves it does
 
> 
> > - little effect on the dashboard [1] which is not a core functionality 
> > 
> > 
> > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average
> > of the last 15s and engine collects that every 15s, a sample of 20s by dhw
> > will miss some engine samplings anyway.
> 
> 
> The dwh is not meant to be in sink with the vdsm rate. 
> We are collecting 5 samples out of the 6 VDSM reports which is quite good
> average and max calculations compared with the 1 out of 6 that we have from
> in the previous interval.

how is this related to the argument above? What Roy is saying is that there is no reason to sample such often as it doesn't improve accuracy or any data, it's just increases load

> 
> > 
> > 
> > actual: sample interval of 20s
> > expected: interval of 60s, if not higher
> 
> This requires further testing.
> I have asked to check the affect on the engine performance against the
> postgres db before and after the change from 20 back to 60 seconds.
> 
> Also, follow test Juan suggestion to limit the size of the java heap size.

yes, we can continue exploring that. But first move back to sane default interval please.
And in comment #1 I requested to do that in 4.0.z

Comment 4 Shirly Radco 2016-11-27 10:14:20 UTC

(In reply to Michal Skrivanek from comment #3)
> Please go through the email threads, results from Roy, me, again.
> 
> (In reply to Shirly Radco from comment #2)
> > (In reply to Roy Golan from comment #0)
> > > Description of problem:
> > > DHW sampling rate is currently 20 seconds and that incures a big amount of
> > > work and the postgres specifically if the history db is hosted with the
> > > engine db.
> > > 
> > > This leads to overheads on the postgres prefomance which affects engine
> > > performance:
> > > - dwh takes almost 50% of the queries on the db
> > 
> > We are sampling small amount of data with simple non joining selects every
> > 20 seconds which is fine as long as it doesn't impact the engine usability
> 
> The previous statement clearly says does impact the engine usability

What I'm saying is that the tests done are not enough to conclude this.
Further tests should be done before reaching this decision.

> 
> > 
> > > - increased io overhead
> > 
> > iowait can be increased and yet everything is fine or it can be 0% and
> > nothing works.
> 
> increased io overhead does impact the engine usability. It is *not* "fine".
> 
> > 
> > > - contention on autovacuum workers
> > 
> > Since not locking any engine table, this has 0 impact on the engine DB.
> 
> testing proves it does

Please attach the test results we already have to the bug.


>  
> > 
> > > - little effect on the dashboard [1] which is not a core functionality 
> > > 
> > > 
> > > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average
> > > of the last 15s and engine collects that every 15s, a sample of 20s by dhw
> > > will miss some engine samplings anyway.
> > 
> > 
> > The dwh is not meant to be in sink with the vdsm rate. 
> > We are collecting 5 samples out of the 6 VDSM reports which is quite good
> > average and max calculations compared with the 1 out of 6 that we have from
> > in the previous interval.
> 
> how is this related to the argument above? What Roy is saying is that there
> is no reason to sample such often as it doesn't improve accuracy or any
> data, it's just increases load

Very relevant. It does increase accuracy of data presented in the dashboard.

> 
> > 
> > > 
> > > 
> > > actual: sample interval of 20s
> > > expected: interval of 60s, if not higher
> > 
> > This requires further testing.
> > I have asked to check the affect on the engine performance against the
> > postgres db before and after the change from 20 back to 60 seconds.
> > 
> > Also, follow test Juan suggestion to limit the size of the java heap size.
> 
> yes, we can continue exploring that. But first move back to sane default
> interval please.
> And in comment #1 I requested to do that in 4.0.z

Will not move back to previous until further testing.

1.Changing the heap size to 1g before / after changing the sampling interval.
2.Testing engine response time before / after changing the sampling interval.

Comment 5 Yaniv Kaul 2016-11-27 10:27:17 UTC

(In reply to Shirly Radco from comment #4)
> (In reply to Michal Skrivanek from comment #3)
> > Please go through the email threads, results from Roy, me, again.
> > 
> > (In reply to Shirly Radco from comment #2)
> > > (In reply to Roy Golan from comment #0)
> > > > Description of problem:
> > > > DHW sampling rate is currently 20 seconds and that incures a big amount of
> > > > work and the postgres specifically if the history db is hosted with the
> > > > engine db.
> > > > 
> > > > This leads to overheads on the postgres prefomance which affects engine
> > > > performance:
> > > > - dwh takes almost 50% of the queries on the db
> > > 
> > > We are sampling small amount of data with simple non joining selects every
> > > 20 seconds which is fine as long as it doesn't impact the engine usability
> > 
> > The previous statement clearly says does impact the engine usability
> 
> What I'm saying is that the tests done are not enough to conclude this.
> Further tests should be done before reaching this decision.

Shirly, I'm setting NEEDINFO on you to provide the test results. 
Specifically, work with Roy to get the needed numbers.
I'd be happy to see them by next week - I think most of the work has been done already - but we are lacking some conclusions and possibly overall effect.

I'd do it after you implement the small changes Juan has mentioned in the way we run the DWH process, of course (which you've specified below).


> 
> > 
> > > 
> > > > - increased io overhead
> > > 
> > > iowait can be increased and yet everything is fine or it can be 0% and
> > > nothing works.
> > 
> > increased io overhead does impact the engine usability. It is *not* "fine".
> > 
> > > 
> > > > - contention on autovacuum workers
> > > 
> > > Since not locking any engine table, this has 0 impact on the engine DB.
> > 
> > testing proves it does
> 
> Please attach the test results we already have to the bug.
> 
> 
> >  
> > > 
> > > > - little effect on the dashboard [1] which is not a core functionality 
> > > > 
> > > > 
> > > > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average
> > > > of the last 15s and engine collects that every 15s, a sample of 20s by dhw
> > > > will miss some engine samplings anyway.
> > > 
> > > 
> > > The dwh is not meant to be in sink with the vdsm rate. 
> > > We are collecting 5 samples out of the 6 VDSM reports which is quite good
> > > average and max calculations compared with the 1 out of 6 that we have from
> > > in the previous interval.
> > 
> > how is this related to the argument above? What Roy is saying is that there
> > is no reason to sample such often as it doesn't improve accuracy or any
> > data, it's just increases load
> 
> Very relevant. It does increase accuracy of data presented in the dashboard.
> 
> > 
> > > 
> > > > 
> > > > 
> > > > actual: sample interval of 20s
> > > > expected: interval of 60s, if not higher
> > > 
> > > This requires further testing.
> > > I have asked to check the affect on the engine performance against the
> > > postgres db before and after the change from 20 back to 60 seconds.
> > > 
> > > Also, follow test Juan suggestion to limit the size of the java heap size.
> > 
> > yes, we can continue exploring that. But first move back to sane default
> > interval please.
> > And in comment #1 I requested to do that in 4.0.z
> 
> Will not move back to previous until further testing.
> 
> 1.Changing the heap size to 1g before / after changing the sampling interval.
> 2.Testing engine response time before / after changing the sampling interval.

Comment 6 Eldad Marciano 2017-01-29 16:59:06 UTC

after syncing with rgolan,
we found that ovirt_engine_history workloads afftects a lot on other queries response time please see:
https://bugzilla.redhat.com/show_bug.cgi?id=1417471

Comment 7 Eldad Marciano 2017-01-29 21:45:58 UTC

(In reply to Eldad Marciano from comment #6)
> after syncing with rgolan,
> we found that ovirt_engine_history workloads afftects a lot on other queries
> response time please see:
> https://bugzilla.redhat.com/show_bug.cgi?id=1417471

when shutting down ovirt-engine-dwhd 
we can see faster response time for getvmdisksguid query and less CPU utilization on the DB machine.

Comment 8 Shirly Radco 2017-01-30 08:20:23 UTC

Shutting  down ovirt-engine-dwhd is not the way to test the  ovirt_engine_history workloads affects on the engine. This should be tested 20s vs. 60s sampling interval.

Off course that stopping it completely will have an affect, but it is required for the dashboards.

Comment 9 Roy Golan 2017-01-30 08:39:36 UTC

(In reply to Shirly Radco from comment #8)
> Shutting  down ovirt-engine-dwhd is not the way to test the 
> ovirt_engine_history workloads affects on the engine. This should be tested
> 20s vs. 60s sampling interval.
> 

> Off course that stopping it completely will have an affect, but it is
> required for the dashboards.

We can't ignore the impact and the contention and we should work in parallel to make the behaviour better - both make the engine queries better and relax the interval.
Changing the interval is quicker and cheaper to change than fixing all of the engine queries. It will take us time but we will get there.

Eldad please add an fio read/write test on /var/lib/pgsql/data/testfile to estimate the disk speed.

Comment 10 Eldad Marciano 2017-01-30 08:58:42 UTC

adding disk speed status of fio

random-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.10
Starting 1 process
random-read: Laying out IO file(s) (1 file(s) / 128MB)
Jobs: 1 (f=1): [r] [-.-% done] [60371KB/0KB/0KB /s] [15.1K/0/0 iops] [eta 00m:00s]
random-read: (groupid=0, jobs=1): err= 0: pid=41185: Sun Jan 29 16:16:12 2017
  read : io=131072KB, bw=60681KB/s, iops=15170, runt=  2160msec
    clat (usec): min=53, max=18245, avg=61.92, stdev=102.75
     lat (usec): min=54, max=18245, avg=62.17, stdev=102.75
    clat percentiles (usec):
     |  1.00th=[   55],  5.00th=[   56], 10.00th=[   56], 20.00th=[   56],
     | 30.00th=[   57], 40.00th=[   57], 50.00th=[   58], 60.00th=[   59],
     | 70.00th=[   60], 80.00th=[   63], 90.00th=[   73], 95.00th=[   78],
     | 99.00th=[  100], 99.50th=[  106], 99.90th=[  117], 99.95th=[  129],
     | 99.99th=[  233]
    bw (KB  /s): min=58944, max=61680, per=100.00%, avg=60790.00, stdev=1276.45
    lat (usec) : 100=98.94%, 250=1.05%, 500=0.01%
    lat (msec) : 4=0.01%, 20=0.01%
  cpu          : usr=9.08%, sys=38.68%, ctx=32769, majf=0, minf=36
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=32768/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=131072KB, aggrb=60681KB/s, minb=60681KB/s, maxb=60681KB/s, mint=2160msec, maxt=2160msec

Disk stats (read/write):
    dm-0: ios=30980/98, merge=0/0, ticks=1216/10, in_queue=1226, util=57.02%, aggrios=32768/102, aggrmerge=0/0, aggrticks=1251/11, aggrin_queue=1239, aggrutil=54.13%
  sda: ios=32768/102, merge=0/0, ticks=1251/11, in_queue=1239, util=54.13%

Comment 11 Yaniv Kaul 2017-01-30 09:13:36 UTC

fio shows it's a regular disk, nothing exciting.
There's no argument that 20s instead of 60s takes more of everything: CPU, memory, disk and possibly network.
That being said, I'm still looking for an apples-to-apples comparison of the effect.

I'm also looking for a reasoning for the value it adds. The fact it is (slightly) more accurate is not very convincing right now.

Comment 12 Roy Golan 2017-01-30 10:11:01 UTC

Elad please execute the following 'explain' with and without the dwh

'explain analyze select * from all_disks_for_vms'

This is the view behind the slow query.

Comment 14 Sandro Bonazzola 2017-02-01 16:01:55 UTC

oVirt 4.1.0 GA has been released, re-targeting to 4.1.1.
Please check if this issue is correctly targeted or already included in 4.1.0.

Comment 24 Lukas Svaty 2017-09-14 15:43:54 UTC

verified in auto

Comment 25 Sandro Bonazzola 2017-12-20 11:38:16 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.