Bug 1888546 - Haproxy template reloading errors trigger prometheus alert
Summary: Haproxy template reloading errors trigger prometheus alert
Keywords:
Status: CLOSED DUPLICATE of bug 1885688
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Andrew McDermott
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-15 07:35 UTC by Felipe M
Modified: 2024-03-25 16:43 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-15 13:09:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Prometheus graph for the metric on the default alert (239.31 KB, image/png)
2020-10-15 07:35 UTC, Felipe M
no flags Details

Description Felipe M 2020-10-15 07:35:27 UTC
Created attachment 1721757 [details]
Prometheus graph for the metric on the default alert

Created attachment 1721757 [details]
Prometheus graph for the metric on the default alert

Description of problem:
Router template reloading known error triggers prometheus alerts.

```
E1008 10:38:15.642166       1 limiter.go:165] error reloading router: wait: no child processes
```

I know this is a known router error but I only found a 3.6 bugzilla related to this stating that it won't be fixed since it is a complex race condiction that usually doesn't have any impact on the cluster.

The customer case I'm attending however, has a default alert triggering because this error are too frequent with about ~90 routes. [attached screenshot]

I'm unsure how common is this or if there's some work to try and fix it (the 3.6 bugzilla is quite old). More information is welcome.

I've recommended the customer to silence the default alert and create a new one according to it's current cluster error triggering as workaround.

Version-Release number of selected component (if applicable):
SOURCE_GIT_TAG=4.0.0-143-ge3b9390
BUILD_VERSION=v4.5.0

How reproducible:
See description and attached screenshot.


Additional info:
- 3.6 bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1442904
- Prometheus alert resource: https://github.com/openshift/cluster-ingress-operator/blob/8aa1ce2f3fc2384f7c1688b8cc16d599f9ac89ea/manifests/0000_90_ingress-operator_03_prometheusrules.yaml

Comment 1 Miciah Dashiel Butler Masters 2020-10-15 13:09:47 UTC
We have fixed the issue in 4.6 as bug 1859134 and are in the process of backporting the fix to 4.5 as bug 1885688, so I am marking this bug as a duplicate of the latter.

*** This bug has been marked as a duplicate of bug 1885688 ***


Note You need to log in before you can comment on or make changes to this bug.