Bug 1285788

Summary: [RFE] Enable logging of dwh ETL process in debug mode
Product: [oVirt] ovirt-engine-dwh Reporter: Lukas Svaty <lsvaty>
Component: ETLAssignee: Shirly Radco <sradco>
Status: CLOSED CURRENTRELEASE QA Contact: Lukas Svaty <lsvaty>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.6.1CC: bgraveno, bugs, lsvaty, mkalinin, pnovotny, sbonazzo, sradco, ykaul, ylavi
Target Milestone: ovirt-4.0.0-alphaKeywords: FutureFeature
Target Release: 4.0.0Flags: rule-engine: ovirt-4.0.0+
gklein: testing_plan_complete-
ylavi: planning_ack+
sradco: devel_ack+
gklein: testing_ack+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Added a DEBUG mode to log sampling, hourly, and daily job times in the ovirt-engine-dwhd.log file. To enable DEBUG mode, add a configuration file, for example /etc/ovirt-engine-dwhd/logging.conf, and add the value DWH_AGGREGATION_DEBUG=true in the file.
Story Points: ---
Clone Of:
: 1332915 (view as bug list) Environment:
Last Closed: 2016-07-05 07:49:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Metrics RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1332915, 1428245    

Description Lukas Svaty 2015-11-26 13:26:45 UTC
Description of problem:
For QA process automation it is necessary to check consistency of dwh database, and if the ETL process is working correctly within dwh. For this purpse it is required to check consistency of database snapshot before and after dwh ETL process. 

At the moment we are unable to track when the ETL process of dwh starts as it is running aggregation every minute/hour/day and only way to check this is by higher cpu consumption of dwh.

This should be included within logs. It was noted that adding this message every minute to logs would spam them, so I propose putting either all of them as DEBUG message which can be enabled by admin or just the minute/hour ETL start/stop to DEBUG and daily aggregation messages to INFO.


What is required:
Log messages like as DEBUG/INFO messages of dwh logs:
Minutely dwh ETL process aggregation started.
Minutely dwh ETL process aggregation stopped.

Hourly dwh ETL process aggregation started.
Hourly dwh ETL process aggregation stopped.

Daily dwh ETL process aggregation started.
Daily dwh ETL process aggregation stopped.

Comment 1 Yaniv Kaul 2015-11-26 13:33:20 UTC
shirly - makes sense to me to add a DEBUG level log for this actions, so QE can (in their functionality tests) look for it.

Comment 2 Shirly Radco 2015-11-26 14:07:10 UTC
(In reply to Yaniv Kaul from comment #1)
> shirly - makes sense to me to add a DEBUG level log for this actions, so QE
> can (in their functionality tests) look for it.

Yes. Already checking how to implement this requirement.

Comment 3 Shirly Radco 2015-12-13 09:14:19 UTC
How whould you like the log files? 

Each log file for each aggregation (samples/hourly/daily) or
All in same file separated from the ovirt-engine-dwh.log file or
All in ovirt-engine-dwh.log?

Comment 4 Lukas Svaty 2015-12-14 07:07:55 UTC
This seems to be like a suitable approach also for customer:
I would say added to the ovirt-engine-dwh.log as DEBUG message disabled by default. With config in /etc/ovirt-engine-dwhd/ containing information and configuration about logging.

Easier solution would be just to add these messages to ovirt-engine-dwhd.log when DEBUG is enabled on dwhd.

Both should be ok to handle within qa automation.

Comment 5 Shirly Radco 2015-12-14 12:16:44 UTC
I believe that different aggregations that happen on different threads simultaneously will not be clear if sent all to the same log file.

Not sure I understand what you mean in the first option you suggested.

Comment 6 Lukas Svaty 2015-12-14 12:43:28 UTC
We need to be able to enable DEBUG messages -> thus some config file for DWH should have this. I would put it to some config file for example 

/etc/ovirt-engine-dwhd/logging.conf

with values:
logging = DEBUG/WARNING/INFO/ERROR

Putting sample/hour/day logs to different files will create confusion. No need to create that many log files, putting this to ovirt-engine-dwhd.log is sufficient. These messages won't be visible unless the admin will specify that he wants to see DEBUG messages (what we will do in QA automation).

Regarding multithreaded aggregation and logs. Aggregation should not run anyway while other aggregation on the same database is running, so because of this we will avoid the timing problem. I guess enough fflush will do the job. To verify the order pls add time of the log messages.

Example flow:
1. change logging variable in /etc/ovirt-engine-dwhd/logging.conf from INFO to DEBUG
2. see logged messages in format, where INFO messages are already used, and debug will be implemented by this fix

2015-12-14 12:01:07 INFO ETL Service Stopped
2015-12-14 13:01:02 INFO ETL Service Started
2015-12-14 13:01:03 INFO ETL Service will now stop because a flag was raised to disconnect DWH on Engine DB
2015-12-14 00:00:00 DEBUG Aggregation on minutely tables started
2015-12-14 00:00:10 DEBUG Aggregation on minutely tables started
2015-12-14 00:00:12 DEBUG Aggregation on hourly tables started
2015-12-14 00:01:10 DEBUG Aggregation on hourly tables started
2015-12-14 00:00:12 DEBUG Aggregation on daily tables started
2015-12-14 00:01:10 DEBUG Aggregation on daily tables started

Comment 7 Shirly Radco 2016-01-24 08:51:36 UTC
Any update on this?

Comment 8 Pavel Novotny 2016-01-25 11:59:08 UTC
(In reply to Shirly Radco from comment #7)
> Any update on this?

We are working on it. But as we have to use an unofficial build to do a pre-testing again, it takes longer time than usual verification.

Comment 9 Lukas Svaty 2016-01-25 13:40:57 UTC
Testing environment: Fedora 22 with fresh new engine and dwh (no reports installed). Tested on clean environment and on setup with 2hosts, 1nfs, 2 VMs up, 2VMs down.

1. Performance
Performence-wise this patch does not generate any cpu/memory consumption. 
However some tests with bigger environment might be necessary when this feature is disaled.

2. Log message
Format contains all the necessary information which is expected and is readable/parsable pretty easily.

3. Log frequency.
Start of aggregation log message is logged every 1 minute/hour/day.
What I see a problem in is the length of aggregation.

Does really the 1 hour aggregation lasts whole hour or is it done only at end of each hour?

See cut out log messages of hourly aggregation:

2016-01-12|18:00:00|hsoDyY|VlFHdx|KjUDQ7|16354|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||begin||
../omitted output...
2016-01-12|19:00:00|hsoDyY|VlFHdx|KjUDQ7|16354|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||end|success|3599964

By the IDs the log is suggesting that this aggregation started @6pm and ended @7pm. I believe it might have aggregated data from clean engine a bit quicker than exact 1 hour.

How can I track how much time did only the aggregation to tables take?
Does that means we are aggregating (running aggregation process) whole day to get the daily tables (which seems to me like resource wasting)?

Comment 10 Shirly Radco 2016-03-10 08:26:55 UTC
Example for the log in DEBUG mode:

2016-03-09 22:00:00|3VRs7n|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||begin||
2016-03-09 22:01:28 Sampeling jobs ended. Duration: 115345 milliseconds
2016-03-09 22:01:28|g4WTw7|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||end|success|115346
2016-03-09 22:01:28|Xre7kb|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||begin||
2016-03-09 22:01:45 Aggregation to Hourly ended.
2016-03-09 22:03:17 Sampeling jobs ended. Duration: 108699 milliseconds
2016-03-09 22:03:17|Xre7kb|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||end|success|108700
2016-03-09 22:03:17|5mO6o0|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||begin||
2016-03-09 22:05:04 Sampeling jobs ended. Duration: 107059 milliseconds
2016-03-09 22:05:04|5mO6o0|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||end|success|107060
2016-03-09

.....


2016-03-09 23:00:00|3VRs7n|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||end|success|3599996
2016-03-09 23:00:00|fQ4i9B|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||begin||

...

Comment 11 Lukas Svaty 2016-03-10 13:47:03 UTC
2016-03-09 22:01:45 Aggregation to Hourly ended.
Thank you I'll be able to work with that. However few notes:
Can we have the same format as the rest of the log? 
Also do we have this for daily and minutely aggregation?
ID of which hourly aggregation ended?

I suggest something like this:
2016-03-09 22:00:00|3VRs7n|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||end_of_aggregation||

Next new log message is 

2016-03-09 22:03:17 Sampeling jobs ended. Duration: 108699 milliseconds

Which sampling job? Hourly, Daily, Minutely? Again format consistency. Id of aggregation since we have them.
Sampeling typo => Sampling.
What is this message trying to tell me, it seems like a number that I can see in next log message:
22:05:04|5mO6o0|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||end|success|107060
So I believe it's not necessary.

Comment 12 Shirly Radco 2016-03-10 14:07:20 UTC
(In reply to Lukas Svaty from comment #11)
> 2016-03-09 22:01:45 Aggregation to Hourly ended.
> Thank you I'll be able to work with that. However few notes:
> Can we have the same format as the rest of the log? 
> Also do we have this for daily and minutely aggregation?
> ID of which hourly aggregation ended?
> 
> I suggest something like this:
> 2016-03-09
> 22:00:
> 00|3VRs7n|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6E
> eCOk4qSIFFDtw|3.6|Default||end_of_aggregation||
> 

I don't have all they details for this point.

> Next new log message is 
> 
> 2016-03-09 22:03:17 Sampeling jobs ended. Duration: 108699 milliseconds
> 
> Which sampling job? Hourly, Daily, Minutely? Again format consistency. Id of
> aggregation since we have them.
> Sampeling typo => Sampling.

Sampling is for minutes, but can be change according to user settings to be less.
This is why the job is called SampleTimeKeepingJob.
The duration is not the same.

I'm currently investigating why it is taking much longer than it should.


> What is this message trying to tell me, it seems like a number that I can
> see in next log message:
> 22:05:
> 04|5mO6o0|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqE
> eCaj-T1n0SCFw|3.6|Default||end|success|107060
> So I believe it's not necessary.

Comment 13 Mike McCune 2016-03-28 23:36:13 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 14 Lukas Svaty 2016-05-13 10:14:17 UTC
verified in ovirt-engine-dwh-4.0.0-1.el7ev.noarch

Comment 15 Sandro Bonazzola 2016-07-05 07:49:07 UTC
oVirt 4.0.0 has been released, closing current release.