Description of problem: DHW sampling rate is currently 20 seconds and that incures a big amount of work and the postgres specifically if the history db is hosted with the engine db. This leads to overheads on the postgres prefomance which affects engine performance: - dwh takes almost 50% of the queries on the db - increased io overhead - contention on autovacuum workers - little effect on the dashboard [1] which is not a core functionality [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average of the last 15s and engine collects that every 15s, a sample of 20s by dhw will miss some engine samplings anyway. actual: sample interval of 20s expected: interval of 60s, if not higher
requesting a 4.0.z fix since it is a regression in performance and basically just a revert of earlier patch to set it back to 60s
(In reply to Roy Golan from comment #0) > Description of problem: > DHW sampling rate is currently 20 seconds and that incures a big amount of > work and the postgres specifically if the history db is hosted with the > engine db. > > This leads to overheads on the postgres prefomance which affects engine > performance: > - dwh takes almost 50% of the queries on the db We are sampling small amount of data with simple non joining selects every 20 seconds which is fine as long as it doesn't impact the engine usability > - increased io overhead iowait can be increased and yet everything is fine or it can be 0% and nothing works. > - contention on autovacuum workers Since not locking any engine table, this has 0 impact on the engine DB. > - little effect on the dashboard [1] which is not a core functionality > > > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average > of the last 15s and engine collects that every 15s, a sample of 20s by dhw > will miss some engine samplings anyway. The dwh is not meant to be in sink with the vdsm rate. We are collecting 5 samples out of the 6 VDSM reports which is quite good average and max calculations compared with the 1 out of 6 that we have from in the previous interval. > > > actual: sample interval of 20s > expected: interval of 60s, if not higher This requires further testing. I have asked to check the affect on the engine performance against the postgres db before and after the change from 20 back to 60 seconds. Also, follow test Juan suggestion to limit the size of the java heap size.
Please go through the email threads, results from Roy, me, again. (In reply to Shirly Radco from comment #2) > (In reply to Roy Golan from comment #0) > > Description of problem: > > DHW sampling rate is currently 20 seconds and that incures a big amount of > > work and the postgres specifically if the history db is hosted with the > > engine db. > > > > This leads to overheads on the postgres prefomance which affects engine > > performance: > > - dwh takes almost 50% of the queries on the db > > We are sampling small amount of data with simple non joining selects every > 20 seconds which is fine as long as it doesn't impact the engine usability The previous statement clearly says does impact the engine usability > > > - increased io overhead > > iowait can be increased and yet everything is fine or it can be 0% and > nothing works. increased io overhead does impact the engine usability. It is *not* "fine". > > > - contention on autovacuum workers > > Since not locking any engine table, this has 0 impact on the engine DB. testing proves it does > > > - little effect on the dashboard [1] which is not a core functionality > > > > > > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average > > of the last 15s and engine collects that every 15s, a sample of 20s by dhw > > will miss some engine samplings anyway. > > > The dwh is not meant to be in sink with the vdsm rate. > We are collecting 5 samples out of the 6 VDSM reports which is quite good > average and max calculations compared with the 1 out of 6 that we have from > in the previous interval. how is this related to the argument above? What Roy is saying is that there is no reason to sample such often as it doesn't improve accuracy or any data, it's just increases load > > > > > > > actual: sample interval of 20s > > expected: interval of 60s, if not higher > > This requires further testing. > I have asked to check the affect on the engine performance against the > postgres db before and after the change from 20 back to 60 seconds. > > Also, follow test Juan suggestion to limit the size of the java heap size. yes, we can continue exploring that. But first move back to sane default interval please. And in comment #1 I requested to do that in 4.0.z
(In reply to Michal Skrivanek from comment #3) > Please go through the email threads, results from Roy, me, again. > > (In reply to Shirly Radco from comment #2) > > (In reply to Roy Golan from comment #0) > > > Description of problem: > > > DHW sampling rate is currently 20 seconds and that incures a big amount of > > > work and the postgres specifically if the history db is hosted with the > > > engine db. > > > > > > This leads to overheads on the postgres prefomance which affects engine > > > performance: > > > - dwh takes almost 50% of the queries on the db > > > > We are sampling small amount of data with simple non joining selects every > > 20 seconds which is fine as long as it doesn't impact the engine usability > > The previous statement clearly says does impact the engine usability What I'm saying is that the tests done are not enough to conclude this. Further tests should be done before reaching this decision. > > > > > > - increased io overhead > > > > iowait can be increased and yet everything is fine or it can be 0% and > > nothing works. > > increased io overhead does impact the engine usability. It is *not* "fine". > > > > > > - contention on autovacuum workers > > > > Since not locking any engine table, this has 0 impact on the engine DB. > > testing proves it does Please attach the test results we already have to the bug. > > > > > > - little effect on the dashboard [1] which is not a core functionality > > > > > > > > > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average > > > of the last 15s and engine collects that every 15s, a sample of 20s by dhw > > > will miss some engine samplings anyway. > > > > > > The dwh is not meant to be in sink with the vdsm rate. > > We are collecting 5 samples out of the 6 VDSM reports which is quite good > > average and max calculations compared with the 1 out of 6 that we have from > > in the previous interval. > > how is this related to the argument above? What Roy is saying is that there > is no reason to sample such often as it doesn't improve accuracy or any > data, it's just increases load Very relevant. It does increase accuracy of data presented in the dashboard. > > > > > > > > > > > > actual: sample interval of 20s > > > expected: interval of 60s, if not higher > > > > This requires further testing. > > I have asked to check the affect on the engine performance against the > > postgres db before and after the change from 20 back to 60 seconds. > > > > Also, follow test Juan suggestion to limit the size of the java heap size. > > yes, we can continue exploring that. But first move back to sane default > interval please. > And in comment #1 I requested to do that in 4.0.z Will not move back to previous until further testing. 1.Changing the heap size to 1g before / after changing the sampling interval. 2.Testing engine response time before / after changing the sampling interval.
(In reply to Shirly Radco from comment #4) > (In reply to Michal Skrivanek from comment #3) > > Please go through the email threads, results from Roy, me, again. > > > > (In reply to Shirly Radco from comment #2) > > > (In reply to Roy Golan from comment #0) > > > > Description of problem: > > > > DHW sampling rate is currently 20 seconds and that incures a big amount of > > > > work and the postgres specifically if the history db is hosted with the > > > > engine db. > > > > > > > > This leads to overheads on the postgres prefomance which affects engine > > > > performance: > > > > - dwh takes almost 50% of the queries on the db > > > > > > We are sampling small amount of data with simple non joining selects every > > > 20 seconds which is fine as long as it doesn't impact the engine usability > > > > The previous statement clearly says does impact the engine usability > > What I'm saying is that the tests done are not enough to conclude this. > Further tests should be done before reaching this decision. Shirly, I'm setting NEEDINFO on you to provide the test results. Specifically, work with Roy to get the needed numbers. I'd be happy to see them by next week - I think most of the work has been done already - but we are lacking some conclusions and possibly overall effect. I'd do it after you implement the small changes Juan has mentioned in the way we run the DWH process, of course (which you've specified below). > > > > > > > > > > - increased io overhead > > > > > > iowait can be increased and yet everything is fine or it can be 0% and > > > nothing works. > > > > increased io overhead does impact the engine usability. It is *not* "fine". > > > > > > > > > - contention on autovacuum workers > > > > > > Since not locking any engine table, this has 0 impact on the engine DB. > > > > testing proves it does > > Please attach the test results we already have to the bug. > > > > > > > > > > > - little effect on the dashboard [1] which is not a core functionality > > > > > > > > > > > > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average > > > > of the last 15s and engine collects that every 15s, a sample of 20s by dhw > > > > will miss some engine samplings anyway. > > > > > > > > > The dwh is not meant to be in sink with the vdsm rate. > > > We are collecting 5 samples out of the 6 VDSM reports which is quite good > > > average and max calculations compared with the 1 out of 6 that we have from > > > in the previous interval. > > > > how is this related to the argument above? What Roy is saying is that there > > is no reason to sample such often as it doesn't improve accuracy or any > > data, it's just increases load > > Very relevant. It does increase accuracy of data presented in the dashboard. > > > > > > > > > > > > > > > > > > actual: sample interval of 20s > > > > expected: interval of 60s, if not higher > > > > > > This requires further testing. > > > I have asked to check the affect on the engine performance against the > > > postgres db before and after the change from 20 back to 60 seconds. > > > > > > Also, follow test Juan suggestion to limit the size of the java heap size. > > > > yes, we can continue exploring that. But first move back to sane default > > interval please. > > And in comment #1 I requested to do that in 4.0.z > > Will not move back to previous until further testing. > > 1.Changing the heap size to 1g before / after changing the sampling interval. > 2.Testing engine response time before / after changing the sampling interval.
after syncing with rgolan, we found that ovirt_engine_history workloads afftects a lot on other queries response time please see: https://bugzilla.redhat.com/show_bug.cgi?id=1417471
(In reply to Eldad Marciano from comment #6) > after syncing with rgolan, > we found that ovirt_engine_history workloads afftects a lot on other queries > response time please see: > https://bugzilla.redhat.com/show_bug.cgi?id=1417471 when shutting down ovirt-engine-dwhd we can see faster response time for getvmdisksguid query and less CPU utilization on the DB machine.
Shutting down ovirt-engine-dwhd is not the way to test the ovirt_engine_history workloads affects on the engine. This should be tested 20s vs. 60s sampling interval. Off course that stopping it completely will have an affect, but it is required for the dashboards.
(In reply to Shirly Radco from comment #8) > Shutting down ovirt-engine-dwhd is not the way to test the > ovirt_engine_history workloads affects on the engine. This should be tested > 20s vs. 60s sampling interval. > > Off course that stopping it completely will have an affect, but it is > required for the dashboards. We can't ignore the impact and the contention and we should work in parallel to make the behaviour better - both make the engine queries better and relax the interval. Changing the interval is quicker and cheaper to change than fixing all of the engine queries. It will take us time but we will get there. Eldad please add an fio read/write test on /var/lib/pgsql/data/testfile to estimate the disk speed.
adding disk speed status of fio random-read: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.10 Starting 1 process random-read: Laying out IO file(s) (1 file(s) / 128MB) Jobs: 1 (f=1): [r] [-.-% done] [60371KB/0KB/0KB /s] [15.1K/0/0 iops] [eta 00m:00s] random-read: (groupid=0, jobs=1): err= 0: pid=41185: Sun Jan 29 16:16:12 2017 read : io=131072KB, bw=60681KB/s, iops=15170, runt= 2160msec clat (usec): min=53, max=18245, avg=61.92, stdev=102.75 lat (usec): min=54, max=18245, avg=62.17, stdev=102.75 clat percentiles (usec): | 1.00th=[ 55], 5.00th=[ 56], 10.00th=[ 56], 20.00th=[ 56], | 30.00th=[ 57], 40.00th=[ 57], 50.00th=[ 58], 60.00th=[ 59], | 70.00th=[ 60], 80.00th=[ 63], 90.00th=[ 73], 95.00th=[ 78], | 99.00th=[ 100], 99.50th=[ 106], 99.90th=[ 117], 99.95th=[ 129], | 99.99th=[ 233] bw (KB /s): min=58944, max=61680, per=100.00%, avg=60790.00, stdev=1276.45 lat (usec) : 100=98.94%, 250=1.05%, 500=0.01% lat (msec) : 4=0.01%, 20=0.01% cpu : usr=9.08%, sys=38.68%, ctx=32769, majf=0, minf=36 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=32768/w=0/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=131072KB, aggrb=60681KB/s, minb=60681KB/s, maxb=60681KB/s, mint=2160msec, maxt=2160msec Disk stats (read/write): dm-0: ios=30980/98, merge=0/0, ticks=1216/10, in_queue=1226, util=57.02%, aggrios=32768/102, aggrmerge=0/0, aggrticks=1251/11, aggrin_queue=1239, aggrutil=54.13% sda: ios=32768/102, merge=0/0, ticks=1251/11, in_queue=1239, util=54.13%
fio shows it's a regular disk, nothing exciting. There's no argument that 20s instead of 60s takes more of everything: CPU, memory, disk and possibly network. That being said, I'm still looking for an apples-to-apples comparison of the effect. I'm also looking for a reasoning for the value it adds. The fact it is (slightly) more accurate is not very convincing right now.
Elad please execute the following 'explain' with and without the dwh 'explain analyze select * from all_disks_for_vms' This is the view behind the slow query.
oVirt 4.1.0 GA has been released, re-targeting to 4.1.1. Please check if this issue is correctly targeted or already included in 4.1.0.
verified in auto
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.