Bug 1941592

Summary: HAProxyDown not Firing
Product: OpenShift Container Platform Reporter: Apurva Nisal <anisal>
Component: NetworkingAssignee: Stephen Greene <sgreene>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: alegrand, amcdermo, anpicker, aos-bugs, erooth, jechen, juzhao, kakkoyun, lcosic, mjoseph, pkrupa, sgreene, surbania
Version: 4.6   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: HAProxyDown alert message was vague Consequence: End users thought HAProxyDown alert meant that the router pods were no available (instead of specifically just HAProxy) Fix: Make the HAProxyDown alert message more detailed
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:54:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Apurva Nisal 2021-03-22 12:56:44 UTC
Description of problem:

HAProxyDown not Firing  when all router pods (-n openshift-ingress) are down or all nodes on which router pods are scheduled are down

Version-Release number of selected component (if applicable):
RHOCP 4.6

Actual results:
HAProxyDown not Firing 

Expected results:
HAProxyDown should be Firing

Comment 2 Andrew McDermott 2021-03-23 18:04:44 UTC
The HAProxyDown alert fires when haproxy is down, not when there are no openshift router pods running.
We will fix the message so that it reports that "haproxy is down" to avoid confusion.

ClusterOperatorDegraded and ClusterOperatorDown alerts should fire if no router pods are scheduled or running.

For example:

https://github.com/openshift/cluster-version-operator/blob/master/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L73-L88

Comment 3 Stephen Greene 2021-03-31 19:31:05 UTC
I will work on this bug during the 4.8 bug fix phase.

Comment 5 jechen 2021-04-21 19:32:28 UTC
attempted to verify in 4.8.0-0.nightly-2021-04-21-084059, pull #597 is listed in release status for this build, but  Prometheus rule definition is still in old way of description:  HAProxy metrics are reporting that the router is down.  Suspect pull #597 is not in this build.   Will wait for next build to verify

Comment 6 jechen 2021-04-21 23:46:43 UTC
verified https://github.com/openshift/cluster-ingress-operator/pull/597 in 4.8.0-0.nightly-2021-04-21-172405

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-21-172405   True        False         42m     Cluster version is 4.8.0-0.nightly-2021-04-21-172405



$ oc -n openshift-ingress-operator get PrometheusRule -oyaml
<--snip-->
      rules:
      - alert: HAProxyReloadFail
        annotations:
          message: HAProxy reloads are failing on {{ $labels.pod }}. Router is not respecting recently created or modified routes
        expr: template_router_reload_failure == 1
        for: 5m
        labels:
          severity: warning
      - alert: HAProxyDown
        annotations:
          message: HAProxy metrics are reporting that HAProxy is down on pod {{ $labels.namespace }} / {{ $labels.pod }}    <--verified https://github.com/openshift/cluster-ingress-operator/pull/597/
        expr: haproxy_up == 0
        for: 5m
        labels:
          severity: critical

Comment 9 errata-xmlrpc 2021-07-27 22:54:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438