Bug 1744752

Summary: MachineMAOMetricsDown is not firing when MAO is down
Product: OpenShift Container Platform Reporter: Pawel Krupa <pkrupa>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: agarcial
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:37:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot none

Description Pawel Krupa 2019-08-22 19:22:55 UTC
Description of problem:

Alerting rule responsible for checking if MAO is up doesn't fire when MAO is down.


Version-Release number of selected component (if applicable): 
OpenShift 4.2.0-0.ci-2019-08-22-144620


How reproducible:

Start a cluster and scale MAO to 0


Steps to Reproduce:
1. Scale CVO to 0 or add exception for MAO
2. Scale MAO and MAC to 0
3. Check prometheus UI to see if alert is firing

Actual results:

MachineMAOMetricsDown is not firing

Expected results:

MachineMAOMetricsDown is firing



Additional info:

Prometheus won't send alerts based on a metric when there are no metric to alert on, so expression like `mapi_mao_collector_up == 0` will always be true. To check if target is down you need to use `absent()` function. 

Changing alerting rule from `mapi_mao_collector_up == 0` to absent(up{job="machine-api-operator"} == 1) should solve this issue. Additionally changing this makes `mapi_mao_collector_up` metric unnecessary.

Comment 2 Jianwei Hou 2019-09-03 07:30:02 UTC
Created attachment 1611034 [details]
screenshot

Verified in 4.2.0-0.nightly-2019-09-02-172410, alert is fired when machine-api-operator is down.

Comment 4 errata-xmlrpc 2019-10-16 06:37:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922