Bug 1285788 - [RFE] Enable logging of dwh ETL process in debug mode
[RFE] Enable logging of dwh ETL process in debug mode
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine-dwh
Classification: oVirt
Component: ETL (Show other bugs)
3.6.1
All All
unspecified Severity low (vote)
: ovirt-4.0.0-alpha
: 4.0.0
Assigned To: Shirly Radco
Lukas Svaty
: FutureFeature
Depends On:
Blocks: 1428245 1332915
  Show dependency treegraph
 
Reported: 2015-11-26 08:26 EST by Lukas Svaty
Modified: 2017-03-02 01:37 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Added a DEBUG mode to log sampling, hourly, and daily job times in the ovirt-engine-dwhd.log file. To enable DEBUG mode, add a configuration file, for example /etc/ovirt-engine-dwhd/logging.conf, and add the value DWH_AGGREGATION_DEBUG=true in the file.
Story Points: ---
Clone Of:
: 1332915 (view as bug list)
Environment:
Last Closed: 2016-07-05 03:49:07 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Metrics
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.0.0+
gklein: testing_plan_complete-
ylavi: planning_ack+
sradco: devel_ack+
gklein: testing_ack+


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 51645 master MERGED history: enabled logging of dwh ETL process in debug mode 2016-04-13 03:50 EDT
oVirt gerrit 54502 ovirt-engine-dwh-3.6 MERGED history: enabled logging of dwh ETL process in debug mode 2016-04-12 07:23 EDT

  None (edit)
Description Lukas Svaty 2015-11-26 08:26:45 EST
Description of problem:
For QA process automation it is necessary to check consistency of dwh database, and if the ETL process is working correctly within dwh. For this purpse it is required to check consistency of database snapshot before and after dwh ETL process. 

At the moment we are unable to track when the ETL process of dwh starts as it is running aggregation every minute/hour/day and only way to check this is by higher cpu consumption of dwh.

This should be included within logs. It was noted that adding this message every minute to logs would spam them, so I propose putting either all of them as DEBUG message which can be enabled by admin or just the minute/hour ETL start/stop to DEBUG and daily aggregation messages to INFO.


What is required:
Log messages like as DEBUG/INFO messages of dwh logs:
Minutely dwh ETL process aggregation started.
Minutely dwh ETL process aggregation stopped.

Hourly dwh ETL process aggregation started.
Hourly dwh ETL process aggregation stopped.

Daily dwh ETL process aggregation started.
Daily dwh ETL process aggregation stopped.
Comment 1 Yaniv Kaul 2015-11-26 08:33:20 EST
shirly - makes sense to me to add a DEBUG level log for this actions, so QE can (in their functionality tests) look for it.
Comment 2 Shirly Radco 2015-11-26 09:07:10 EST
(In reply to Yaniv Kaul from comment #1)
> shirly - makes sense to me to add a DEBUG level log for this actions, so QE
> can (in their functionality tests) look for it.

Yes. Already checking how to implement this requirement.
Comment 3 Shirly Radco 2015-12-13 04:14:19 EST
How whould you like the log files? 

Each log file for each aggregation (samples/hourly/daily) or
All in same file separated from the ovirt-engine-dwh.log file or
All in ovirt-engine-dwh.log?
Comment 4 Lukas Svaty 2015-12-14 02:07:55 EST
This seems to be like a suitable approach also for customer:
I would say added to the ovirt-engine-dwh.log as DEBUG message disabled by default. With config in /etc/ovirt-engine-dwhd/ containing information and configuration about logging.

Easier solution would be just to add these messages to ovirt-engine-dwhd.log when DEBUG is enabled on dwhd.

Both should be ok to handle within qa automation.
Comment 5 Shirly Radco 2015-12-14 07:16:44 EST
I believe that different aggregations that happen on different threads simultaneously will not be clear if sent all to the same log file.

Not sure I understand what you mean in the first option you suggested.
Comment 6 Lukas Svaty 2015-12-14 07:43:28 EST
We need to be able to enable DEBUG messages -> thus some config file for DWH should have this. I would put it to some config file for example 

/etc/ovirt-engine-dwhd/logging.conf

with values:
logging = DEBUG/WARNING/INFO/ERROR

Putting sample/hour/day logs to different files will create confusion. No need to create that many log files, putting this to ovirt-engine-dwhd.log is sufficient. These messages won't be visible unless the admin will specify that he wants to see DEBUG messages (what we will do in QA automation).

Regarding multithreaded aggregation and logs. Aggregation should not run anyway while other aggregation on the same database is running, so because of this we will avoid the timing problem. I guess enough fflush will do the job. To verify the order pls add time of the log messages.

Example flow:
1. change logging variable in /etc/ovirt-engine-dwhd/logging.conf from INFO to DEBUG
2. see logged messages in format, where INFO messages are already used, and debug will be implemented by this fix

2015-12-14 12:01:07 INFO ETL Service Stopped
2015-12-14 13:01:02 INFO ETL Service Started
2015-12-14 13:01:03 INFO ETL Service will now stop because a flag was raised to disconnect DWH on Engine DB
2015-12-14 00:00:00 DEBUG Aggregation on minutely tables started
2015-12-14 00:00:10 DEBUG Aggregation on minutely tables started
2015-12-14 00:00:12 DEBUG Aggregation on hourly tables started
2015-12-14 00:01:10 DEBUG Aggregation on hourly tables started
2015-12-14 00:00:12 DEBUG Aggregation on daily tables started
2015-12-14 00:01:10 DEBUG Aggregation on daily tables started
Comment 7 Shirly Radco 2016-01-24 03:51:36 EST
Any update on this?
Comment 8 Pavel Novotny 2016-01-25 06:59:08 EST
(In reply to Shirly Radco from comment #7)
> Any update on this?

We are working on it. But as we have to use an unofficial build to do a pre-testing again, it takes longer time than usual verification.
Comment 9 Lukas Svaty 2016-01-25 08:40:57 EST
Testing environment: Fedora 22 with fresh new engine and dwh (no reports installed). Tested on clean environment and on setup with 2hosts, 1nfs, 2 VMs up, 2VMs down.

1. Performance
Performence-wise this patch does not generate any cpu/memory consumption. 
However some tests with bigger environment might be necessary when this feature is disaled.

2. Log message
Format contains all the necessary information which is expected and is readable/parsable pretty easily.

3. Log frequency.
Start of aggregation log message is logged every 1 minute/hour/day.
What I see a problem in is the length of aggregation.

Does really the 1 hour aggregation lasts whole hour or is it done only at end of each hour?

See cut out log messages of hourly aggregation:

2016-01-12|18:00:00|hsoDyY|VlFHdx|KjUDQ7|16354|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||begin||
../omitted output...
2016-01-12|19:00:00|hsoDyY|VlFHdx|KjUDQ7|16354|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||end|success|3599964

By the IDs the log is suggesting that this aggregation started @6pm and ended @7pm. I believe it might have aggregated data from clean engine a bit quicker than exact 1 hour.

How can I track how much time did only the aggregation to tables take?
Does that means we are aggregating (running aggregation process) whole day to get the daily tables (which seems to me like resource wasting)?
Comment 10 Shirly Radco 2016-03-10 03:26:55 EST
Example for the log in DEBUG mode:

2016-03-09 22:00:00|3VRs7n|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||begin||
2016-03-09 22:01:28 Sampeling jobs ended. Duration: 115345 milliseconds
2016-03-09 22:01:28|g4WTw7|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||end|success|115346
2016-03-09 22:01:28|Xre7kb|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||begin||
2016-03-09 22:01:45 Aggregation to Hourly ended.
2016-03-09 22:03:17 Sampeling jobs ended. Duration: 108699 milliseconds
2016-03-09 22:03:17|Xre7kb|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||end|success|108700
2016-03-09 22:03:17|5mO6o0|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||begin||
2016-03-09 22:05:04 Sampeling jobs ended. Duration: 107059 milliseconds
2016-03-09 22:05:04|5mO6o0|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||end|success|107060
2016-03-09

.....


2016-03-09 23:00:00|3VRs7n|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||end|success|3599996
2016-03-09 23:00:00|fQ4i9B|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||begin||

...
Comment 11 Lukas Svaty 2016-03-10 08:47:03 EST
2016-03-09 22:01:45 Aggregation to Hourly ended.
Thank you I'll be able to work with that. However few notes:
Can we have the same format as the rest of the log? 
Also do we have this for daily and minutely aggregation?
ID of which hourly aggregation ended?

I suggest something like this:
2016-03-09 22:00:00|3VRs7n|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6EeCOk4qSIFFDtw|3.6|Default||end_of_aggregation||

Next new log message is 

2016-03-09 22:03:17 Sampeling jobs ended. Duration: 108699 milliseconds

Which sampling job? Hourly, Daily, Minutely? Again format consistency. Id of aggregation since we have them.
Sampeling typo => Sampling.
What is this message trying to tell me, it seems like a number that I can see in next log message:
22:05:04|5mO6o0|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqEeCaj-T1n0SCFw|3.6|Default||end|success|107060
So I believe it's not necessary.
Comment 12 Shirly Radco 2016-03-10 09:07:20 EST
(In reply to Lukas Svaty from comment #11)
> 2016-03-09 22:01:45 Aggregation to Hourly ended.
> Thank you I'll be able to work with that. However few notes:
> Can we have the same format as the rest of the log? 
> Also do we have this for daily and minutely aggregation?
> ID of which hourly aggregation ended?
> 
> I suggest something like this:
> 2016-03-09
> 22:00:
> 00|3VRs7n|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|_Q9_uQNI6E
> eCOk4qSIFFDtw|3.6|Default||end_of_aggregation||
> 

I don't have all they details for this point.

> Next new log message is 
> 
> 2016-03-09 22:03:17 Sampeling jobs ended. Duration: 108699 milliseconds
> 
> Which sampling job? Hourly, Daily, Minutely? Again format consistency. Id of
> aggregation since we have them.
> Sampeling typo => Sampling.

Sampling is for minutes, but can be change according to user settings to be less.
This is why the job is called SampleTimeKeepingJob.
The duration is not the same.

I'm currently investigating why it is taking much longer than it should.


> What is this message trying to tell me, it seems like a number that I can
> see in next log message:
> 22:05:
> 04|5mO6o0|BdYeaY|AaeemT|7393|OVIRT_ENGINE_DWH|SampleTimeKeepingJob|_FvEy8LzqE
> eCaj-T1n0SCFw|3.6|Default||end|success|107060
> So I believe it's not necessary.
Comment 13 Mike McCune 2016-03-28 19:36:13 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 14 Lukas Svaty 2016-05-13 06:14:17 EDT
verified in ovirt-engine-dwh-4.0.0-1.el7ev.noarch
Comment 15 Sandro Bonazzola 2016-07-05 03:49:07 EDT
oVirt 4.0.0 has been released, closing current release.

Note You need to log in before you can comment on or make changes to this bug.