Bug 1941592 - HAProxyDown not Firing
Summary: HAProxyDown not Firing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Stephen Greene
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-22 12:56 UTC by Apurva Nisal
Modified: 2021-07-27 22:54 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: HAProxyDown alert message was vague Consequence: End users thought HAProxyDown alert meant that the router pods were no available (instead of specifically just HAProxy) Fix: Make the HAProxyDown alert message more detailed
Clone Of:
Environment:
Last Closed: 2021-07-27 22:54:36 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 597 0 None open Bug 1941592: Alerts: Fix up HAProxyDown Alert Message 2021-04-12 13:49:11 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:54:57 UTC

Description Apurva Nisal 2021-03-22 12:56:44 UTC
Description of problem:

HAProxyDown not Firing  when all router pods (-n openshift-ingress) are down or all nodes on which router pods are scheduled are down

Version-Release number of selected component (if applicable):
RHOCP 4.6

Actual results:
HAProxyDown not Firing 

Expected results:
HAProxyDown should be Firing

Comment 2 Andrew McDermott 2021-03-23 18:04:44 UTC
The HAProxyDown alert fires when haproxy is down, not when there are no openshift router pods running.
We will fix the message so that it reports that "haproxy is down" to avoid confusion.

ClusterOperatorDegraded and ClusterOperatorDown alerts should fire if no router pods are scheduled or running.

For example:

https://github.com/openshift/cluster-version-operator/blob/master/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L73-L88

Comment 3 Stephen Greene 2021-03-31 19:31:05 UTC
I will work on this bug during the 4.8 bug fix phase.

Comment 5 jechen 2021-04-21 19:32:28 UTC
attempted to verify in 4.8.0-0.nightly-2021-04-21-084059, pull #597 is listed in release status for this build, but  Prometheus rule definition is still in old way of description:  HAProxy metrics are reporting that the router is down.  Suspect pull #597 is not in this build.   Will wait for next build to verify

Comment 6 jechen 2021-04-21 23:46:43 UTC
verified https://github.com/openshift/cluster-ingress-operator/pull/597 in 4.8.0-0.nightly-2021-04-21-172405

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-21-172405   True        False         42m     Cluster version is 4.8.0-0.nightly-2021-04-21-172405



$ oc -n openshift-ingress-operator get PrometheusRule -oyaml
<--snip-->
      rules:
      - alert: HAProxyReloadFail
        annotations:
          message: HAProxy reloads are failing on {{ $labels.pod }}. Router is not respecting recently created or modified routes
        expr: template_router_reload_failure == 1
        for: 5m
        labels:
          severity: warning
      - alert: HAProxyDown
        annotations:
          message: HAProxy metrics are reporting that HAProxy is down on pod {{ $labels.namespace }} / {{ $labels.pod }}    <--verified https://github.com/openshift/cluster-ingress-operator/pull/597/
        expr: haproxy_up == 0
        for: 5m
        labels:
          severity: critical

Comment 9 errata-xmlrpc 2021-07-27 22:54:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.