Bug 1670380 - Alertmanager triggers error when updating the image
Summary: Alertmanager triggers error when updating the image
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 3.11.z
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard: groom
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-29 13:05 UTC by Vladislav Walek
Modified: 2020-04-13 09:23 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-19 19:53:43 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 304 0 'None' closed apply simple anomaly detectionto KubeAPILatencyHigh alert 2021-02-16 04:31:49 UTC
Github openshift cluster-monitoring-operator pull 607 0 None closed Port KubeAPILatencyHigh alert to 3.11 2021-02-16 04:31:49 UTC
Red Hat Knowledge Base (Solution) 4981991 0 None None None 2020-04-13 09:23:47 UTC
Red Hat Product Errata RHBA-2020:0402 0 None None None 2020-02-19 19:53:57 UTC

Internal Links: 1771342

Description Vladislav Walek 2019-01-29 13:05:31 UTC
Description of problem:

The alertmanager is triggering alert when the imagestream is automatically synchronized with new version of image, showing error like:
[FIRING:1] KubeAPILatencyHigh <node> kubernetes (https apiserver default openshift-monitoring/k8s 0.99 imagestreamimports namespace warning POST)

The alert is shown every 15 minutes.
How it can be disabled? 

Version-Release number of selected component (if applicable):
OpenShift Container Platform

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 minden 2019-01-29 14:23:27 UTC
> the imagestream is automatically synchronized with new version of image

I am guessing that in this case the image stream updated the Kubernetes API Server image?

> The alert is shown every 15 minutes.

Once the image update process is done, the alert should resolve and thereby not fire anymore. For how long are you seeing the alert fire?


In general, even though I don't think it is applicable for this scenario, you can silence an alert in the Alertmanager UI for a specific period of time.

Comment 2 Vladislav Walek 2019-02-04 12:21:17 UTC
(In reply to minden from comment #1)
> > the imagestream is automatically synchronized with new version of image
> I am guessing that in this case the image stream updated the Kubernetes API
> Server image?

Yes. the image is update automatically from the cluster.

> > The alert is shown every 15 minutes.
> 
> Once the image update process is done, the alert should resolve and thereby
> not fire anymore. For how long are you seeing the alert fire?
> 
> In general, even though I don't think it is applicable for this scenario,
> you can silence an alert in the Alertmanager UI for a specific period of
> time.

Will check it.

Comment 3 Vladislav Walek 2019-02-12 10:48:20 UTC
Hello,

customer has good point about the alert. So the alert is showing all the time when the threshold is exceeded.
In their case, the error message shows:

I0204 12:04:59.840639       1 trace.go:76] Trace[209327862]: "Create /apis/image.openshift.io/v1/namespaces/<namespace>/imagestreamimports" (started: <date> 12:04:54.523708746 +0000 UTC m=+449966.882803763) (total time: 5.31691254s):
Trace[209327862]: [5.31667951s] [5.316515341s] Object stored in database

Which actually shows when the time is longer than some limit, the alert will pop up - showing that it just took longer.
However, the time can depend on various different variables.

Is it a problem the error above?
Isn't the limit to strict for the import image?

Thx

Comment 4 minden 2019-02-14 11:31:12 UTC
> I0204 12:04:59.840639       1 trace.go:76] Trace[209327862]: "Create /apis/image.openshift.io/v1/namespaces/<namespace>/imagestreamimports" (started: <date> 12:04:54.523708746 +0000 UTC m=+449966.882803763) (total time: 5.31691254s):
Trace[209327862]: [5.31667951s] [5.316515341s] Object stored in database

I am a bit confused. This is not an Alertmanager log line. Where is this from?

> customer has good point about the alert. So the alert is showing all the time when the threshold is exceeded.

How often are they updating their API server images?

Comment 5 Frederic Branczyk 2019-02-19 16:46:29 UTC
Upfront, Prometheus generates/triggers alerts, Alertmanager just routes alerts.

I believe in this case we should just ignore imagestreamimports from the general latency alert, as these calls are often expected to take much longer than 4s, at which point we already page.

I can't say when we can get to this, in the mean time I recommend silencing the alert in Alertmanager, then you will not get notifications because for them.

Comment 6 Vladislav Walek 2019-02-21 09:21:48 UTC
(In reply to minden from comment #4)
> > I0204 12:04:59.840639       1 trace.go:76] Trace[209327862]: "Create /apis/image.openshift.io/v1/namespaces/<namespace>/imagestreamimports" (started: <date> 12:04:54.523708746 +0000 UTC m=+449966.882803763) (total time: 5.31691254s):
> Trace[209327862]: [5.31667951s] [5.316515341s] Object stored in database
> 
> I am a bit confused. This is not an Alertmanager log line. Where is this
> from?

This is from the master logs, the alert is shown so they found the exact log line in the master logs.
The alert is triggered due the time is 5s+.

> 
> > customer has good point about the alert. So the alert is showing all the time when the threshold is exceeded.
> 
> How often are they updating their API server images?

It is done automatically from the OpenShift. They say it is once per 15 min.

(In reply to Frederic Branczyk from comment #5)
> Upfront, Prometheus generates/triggers alerts, Alertmanager just routes
> alerts.
> 
> I believe in this case we should just ignore imagestreamimports from the
> general latency alert, as these calls are often expected to take much longer
> than 4s, at which point we already page.
> 
> I can't say when we can get to this, in the mean time I recommend silencing
> the alert in Alertmanager, then you will not get notifications because for
> them.

I agree, this can be silenced.
Thx

Comment 18 trumbaut 2019-11-20 11:14:20 UTC
For your reference: the same issue is being described at https://github.com/openshift/origin/issues/21508.

Comment 26 Junqi Zhao 2020-02-04 07:37:20 UTC
Issue is fixed with cluster-monitoring-operator:v3.11.169

Comment 28 errata-xmlrpc 2020-02-19 19:53:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0402


Note You need to log in before you can comment on or make changes to this bug.