Bug 1478859

Summary: [downstream clone] DWH sampling is too high - switch back to 60s
Product: Red Hat Enterprise Virtualization Manager Reporter: Shirly Radco <sradco>
Component: ovirt-engine-dwhAssignee: Shirly Radco <sradco>
Status: CLOSED CURRENTRELEASE QA Contact: Lukas Svaty <lsvaty>
Severity: high Docs Contact:
Priority: low    
Version: 4.1.5CC: amarchuk, bugs, eberman, emarcian, guchen, lsurette, lsvaty, lveyde, michal.skrivanek, mwest, pstehlik, rbalakri, rgolan, Rhev-m-bugs, sradco, srevivo, ykaul, ylavi
Target Milestone: ovirt-4.1.6Keywords: Performance, Regression, Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-dwh-4.1.7 Doc Type: Bug Fix
Doc Text:
Cause: DHW sampling rate was 20 seconds. Consequence: That created load on postgres, specifically if the history db is hosted with the engine db. And created warning messages in the dwh, when the engine heartbeat did not update in the required interval. Fix: Moved back to 60 seconds interval. Result: Warning message are now gone, lees stress on the database.
Story Points: ---
Clone Of: 1395608 Environment:
Last Closed: 2017-10-16 10:10:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Metrics RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1395608, 1490272    
Bug Blocks: 1398553    

Comment 2 Yaniv Lavi 2017-08-07 15:39:25 UTC
We will not change the DWH sampling mid-stream unless it causes performance regression even when on a remote host. Nacking this for now.

Comment 3 RHEL Program Management 2017-08-07 15:42:42 UTC
Product Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 6 Lukas Svaty 2017-09-11 10:15:58 UTC
[root@pm-rh40 ~]# engine-config -g DwhHeartBeatInterval
DwhHeartBeatInterval: 15 version: general
[root@pm-rh40 ~]# rpm -q ovirt-engine-dwh
ovirt-engine-dwh-4.1.6.1-2.el7ev.noarch

Comment 7 Yaniv Kaul 2017-09-11 10:30:31 UTC
(In reply to Lukas Svaty from comment #6)
> [root@pm-rh40 ~]# engine-config -g DwhHeartBeatInterval
> DwhHeartBeatInterval: 15 version: general
> [root@pm-rh40 ~]# rpm -q ovirt-engine-dwh
> ovirt-engine-dwh-4.1.6.1-2.el7ev.noarch

Is that the right parameter? I thought it was DWH_SAMPLING in ovirt-engine-dwhd.conf

However, it's a good question why we need the heartbeat every 15 secs, if we move to 60secs collection interval.

Comment 8 Lukas Svaty 2017-09-11 10:38:15 UTC
Ah, my mistake did not read the bug correctly.

[root@pm-rh40 ~]# grep SAMPLING /usr/share/ovirt-engine-dwh/services/ovirt-engine-dwhd/ovirt-engine-dwhd.conf
DWH_SAMPLING=60

Moving to ON_QA, as I would like to check the service as well, when BZ#1490272 is unblocked, Shirly please check as well.

AFAIK we don't have any engine-config values for this. 

Moving needinfo to Shirly if we wanna change the heartbeat as well.

Comment 9 Lukas Svaty 2017-09-12 15:42:59 UTC
verified in ovirt-engine-dwh-4.1.7-1.el7ev.noarch

[root@pm-rh40 ~]# vim /etc/ovirt-engine-dwh/ovirt-engine-dwhd.conf.d/logging.conf
[root@pm-rh40 ~]# service ovirt-engine-dwhd restart && tail -f /var/log/ovirt-
2017-09-12 17:38:14|ETL Service Stopped
2017-09-12 17:38:16|ETL Service Started... omitted output
2017-09-12 17:39:00|ZltQkz|IgH59r|MDVNSt|1257|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|4.1|Default||begin||
2017-09-12 17:39:00 Statistics sync ended. Duration: 847 milliseconds 
2017-09-12 17:40:00|ZltQkz|IgH59r|MDVNSt|1257|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|4.1|Default||end|success|60001
2017-09-12 17:40:00|jgIWHe|IgH59r|MDVNSt|1257|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|4.1|Default||begin||
2017-09-12 17:40:00 Statistics sync ended. Duration: 356 milliseconds 
2017-09-12 17:41:00|jgIWHe|IgH59r|MDVNSt|1257|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|4.1|Default||end|success|60002
2017-09-12 17:41:00|ovCTrq|IgH59r|MDVNSt|1257|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|4.1|Default||begin||
2017-09-12 17:41:00 Statistics sync ended. Duration: 283 milliseconds

Comment 10 Lukas Svaty 2017-09-12 15:43:39 UTC
re-adding the needinfo as Shirly removed it during setting of Fixed in version.

Comment 11 Shirly Radco 2017-09-12 17:49:33 UTC
If we change the engine heartbeat we cant support 20 seconds interval.
Not sure if we want it dynamic when setting the dwh interval. Yaniv?

Comment 12 Lukas Svaty 2017-09-13 06:19:00 UTC
Per bug the sampling interval was changed to 60 seconds and per comment#6 hearbeat is on 15 seconds. Where does 20 seconds come from?

[root@pm-rh40 ~]# engine-config -g DwhHeartBeatInterval
DwhHeartBeatInterval: 15 version: general

or am I missing something?

Comment 13 Yaniv Lavi 2017-09-18 17:20:30 UTC
(In reply to Shirly Radco from comment #11)
> If we change the engine heartbeat we cant support 20 seconds interval.
> Not sure if we want it dynamic when setting the dwh interval. Yaniv?

The heartbeat is on the engine side to know that the metrics are current. 
As long as it is lower than the collection interval, we should be ok.

Comment 14 Shirly Radco 2017-09-26 07:35:32 UTC
The dwh checks that the heartbeat timestamp is later then the last sampling/error timestamp in dwh_history_timekeeping in engine db.

The default now is 60 sec.
If user chooses to set to 20 sec then interval and we change DwhHeartBeatInterval back to 30 sec, then the dwh will not collect the data since heartbeat is not lower then 20 sec.
Should we move DwhHeartBeatInterval back to 30 seconds?

Comment 16 Yaniv Lavi 2017-10-29 17:05:29 UTC
(In reply to Shirly Radco from comment #14)
> The dwh checks that the heartbeat timestamp is later then the last
> sampling/error timestamp in dwh_history_timekeeping in engine db.
> 
> The default now is 60 sec.
> If user chooses to set to 20 sec then interval and we change
> DwhHeartBeatInterval back to 30 sec, then the dwh will not collect the data
> since heartbeat is not lower then 20 sec.
> Should we move DwhHeartBeatInterval back to 30 seconds?

no