Bug 1371111

Summary: update dwh heartbeat error message to alert only after it did not update for a minute
Product: [oVirt] ovirt-engine-dwh Reporter: Shirly Radco <sradco>
Component: ETLAssignee: Shirly Radco <sradco>
Status: CLOSED CURRENTRELEASE QA Contact: Lukas Svaty <lsvaty>
Severity: high Docs Contact:
Priority: medium    
Version: ---CC: bugs, jbryant, kshukla, mburman, mgoldboi, mperina, mtessun, shipatil, sradco, trefex, usurse, ylavi
Target Milestone: ovirt-4.1.1-1Flags: rule-engine: ovirt-4.1+
rule-engine: exception+
Target Release: 4.1.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: In some environments, the audit log was filled with dwh errors, due to dwh heartbeat not updating in the required interval. Consequence: Many log errors in admin portal audit log. Fix: We now send the error to the audit log only if the dwh was unable to sample data for at least a minute, due do dwh heartbeat not updating. Result:
Story Points: ---
Clone Of:
: 1373456 1430666 (view as bug list) Environment:
Last Closed: 2017-04-21 09:37:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Metrics RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1373456, 1390631, 1430666    
Attachments:
Description Flags
engine.log none

Description Shirly Radco 2016-08-29 11:03:57 UTC
Description of problem:
Engine Heartbeat should update every 15 seconds, but in some cases it may take longer.
If it takes longer than 20 seconds the dwh will alert
"Can not sample data, oVirt Engine is not updating the statistics" .

Version-Release number of selected component (if applicable):
4.0.2

How reproducible:


Steps to Reproduce:
1.Try to load the engine machine with dwh installed.
2.
3.

Actual results:
Will get multiple "Can not sample data, oVirt Engine is not updating the statistics" errors in the log.

Expected results:
Should not alert each time. Should wait for about a minute before alerting.
In order to allow the connection to restore and not load the user with errors.

Additional info:

Comment 5 Eli Mesika 2016-11-09 10:54:54 UTC
Can we have engine log with DEBUG messages attached so we can check what part of code is responsible for that 
I added DEBUG messages to figure out what's going on in patch https://gerrit.ovirt.org/#/c/64139/

Comment 6 Shirly Radco 2016-11-29 10:55:18 UTC
Created attachment 1225785 [details]
engine.log

Comment 7 Shirly Radco 2016-11-29 10:59:46 UTC
Please use this link due to file size.
engine.log: https://drive.google.com/open?id=0B8qzHycX6vljVlg5dVYzMHVGMkk

Comment 8 Shirly Radco 2017-03-08 13:22:04 UTC
*** Bug 1425868 has been marked as a duplicate of this bug. ***

Comment 12 Oved Ourfali 2017-03-09 09:44:22 UTC
I have changed the title to reflect the upcoming changes.

Comment 13 Shirly Radco 2017-03-09 14:24:33 UTC
This fix updates the error message sent to the audit log to be sent only if the heartbeat did not update at least a minute from the last sampling.

The error messages are still sent each time to the dwh log, since it means that it missed a sampling.

Comment 14 Shirly Radco 2017-03-09 14:28:36 UTC
*** Bug 1425868 has been marked as a duplicate of this bug. ***

Comment 15 Shirly Radco 2017-03-19 15:37:20 UTC
*** Bug 1433101 has been marked as a duplicate of this bug. ***

Comment 16 Lukas Svaty 2017-03-30 12:25:40 UTC
verified in ovirt-engine-dwh-4.1.1-1.el7ev.noarch

I was not able to see such message in numerous setups, if you will encounter this message again please reopen this bug, and we should consider either expanding the timeout or adjusting it based on environment