Bug 1371111 - update dwh heartbeat error message to alert only after it did not update for a minute
Summary: update dwh heartbeat error message to alert only after it did not update for ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine-dwh
Classification: oVirt
Component: ETL
Version: ---
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.1.1-1
: 4.1.1
Assignee: Shirly Radco
QA Contact: Lukas Svaty
URL:
Whiteboard:
: 1425868 1433101 (view as bug list)
Depends On:
Blocks: 1373456 1390631 1430666
TreeView+ depends on / blocked
 
Reported: 2016-08-29 11:03 UTC by Shirly Radco
Modified: 2021-03-11 15:05 UTC (History)
12 users (show)

Fixed In Version:
Clone Of:
: 1373456 1430666 (view as bug list)
Environment:
Last Closed: 2017-04-21 09:37:37 UTC
oVirt Team: Metrics
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: exception+


Attachments (Terms of Use)
engine.log (61 bytes, text/plain)
2016-11-29 10:55 UTC, Shirly Radco
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3024741 0 None None None 2017-08-02 06:09:52 UTC
oVirt gerrit 73827 0 'None' MERGED history: heartbeat error message interval 2021-01-07 20:03:28 UTC
oVirt gerrit 74046 0 'None' MERGED history: heartbeat error message interval 2021-01-07 20:02:51 UTC

Description Shirly Radco 2016-08-29 11:03:57 UTC
Description of problem:
Engine Heartbeat should update every 15 seconds, but in some cases it may take longer.
If it takes longer than 20 seconds the dwh will alert
"Can not sample data, oVirt Engine is not updating the statistics" .

Version-Release number of selected component (if applicable):
4.0.2

How reproducible:


Steps to Reproduce:
1.Try to load the engine machine with dwh installed.
2.
3.

Actual results:
Will get multiple "Can not sample data, oVirt Engine is not updating the statistics" errors in the log.

Expected results:
Should not alert each time. Should wait for about a minute before alerting.
In order to allow the connection to restore and not load the user with errors.

Additional info:

Comment 5 Eli Mesika 2016-11-09 10:54:54 UTC
Can we have engine log with DEBUG messages attached so we can check what part of code is responsible for that 
I added DEBUG messages to figure out what's going on in patch https://gerrit.ovirt.org/#/c/64139/

Comment 6 Shirly Radco 2016-11-29 10:55:18 UTC
Created attachment 1225785 [details]
engine.log

Comment 7 Shirly Radco 2016-11-29 10:59:46 UTC
Please use this link due to file size.
engine.log: https://drive.google.com/open?id=0B8qzHycX6vljVlg5dVYzMHVGMkk

Comment 8 Shirly Radco 2017-03-08 13:22:04 UTC
*** Bug 1425868 has been marked as a duplicate of this bug. ***

Comment 12 Oved Ourfali 2017-03-09 09:44:22 UTC
I have changed the title to reflect the upcoming changes.

Comment 13 Shirly Radco 2017-03-09 14:24:33 UTC
This fix updates the error message sent to the audit log to be sent only if the heartbeat did not update at least a minute from the last sampling.

The error messages are still sent each time to the dwh log, since it means that it missed a sampling.

Comment 14 Shirly Radco 2017-03-09 14:28:36 UTC
*** Bug 1425868 has been marked as a duplicate of this bug. ***

Comment 15 Shirly Radco 2017-03-19 15:37:20 UTC
*** Bug 1433101 has been marked as a duplicate of this bug. ***

Comment 16 Lukas Svaty 2017-03-30 12:25:40 UTC
verified in ovirt-engine-dwh-4.1.1-1.el7ev.noarch

I was not able to see such message in numerous setups, if you will encounter this message again please reopen this bug, and we should consider either expanding the timeout or adjusting it based on environment


Note You need to log in before you can comment on or make changes to this bug.