Description of problem: The alertmanager is triggering alert when the imagestream is automatically synchronized with new version of image, showing error like: [FIRING:1] KubeAPILatencyHigh <node> kubernetes (https apiserver default openshift-monitoring/k8s 0.99 imagestreamimports namespace warning POST) The alert is shown every 15 minutes. How it can be disabled? Version-Release number of selected component (if applicable): OpenShift Container Platform How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
> the imagestream is automatically synchronized with new version of image I am guessing that in this case the image stream updated the Kubernetes API Server image? > The alert is shown every 15 minutes. Once the image update process is done, the alert should resolve and thereby not fire anymore. For how long are you seeing the alert fire? In general, even though I don't think it is applicable for this scenario, you can silence an alert in the Alertmanager UI for a specific period of time.
(In reply to minden from comment #1) > > the imagestream is automatically synchronized with new version of image > I am guessing that in this case the image stream updated the Kubernetes API > Server image? Yes. the image is update automatically from the cluster. > > The alert is shown every 15 minutes. > > Once the image update process is done, the alert should resolve and thereby > not fire anymore. For how long are you seeing the alert fire? > > In general, even though I don't think it is applicable for this scenario, > you can silence an alert in the Alertmanager UI for a specific period of > time. Will check it.
Hello, customer has good point about the alert. So the alert is showing all the time when the threshold is exceeded. In their case, the error message shows: I0204 12:04:59.840639 1 trace.go:76] Trace[209327862]: "Create /apis/image.openshift.io/v1/namespaces/<namespace>/imagestreamimports" (started: <date> 12:04:54.523708746 +0000 UTC m=+449966.882803763) (total time: 5.31691254s): Trace[209327862]: [5.31667951s] [5.316515341s] Object stored in database Which actually shows when the time is longer than some limit, the alert will pop up - showing that it just took longer. However, the time can depend on various different variables. Is it a problem the error above? Isn't the limit to strict for the import image? Thx
> I0204 12:04:59.840639 1 trace.go:76] Trace[209327862]: "Create /apis/image.openshift.io/v1/namespaces/<namespace>/imagestreamimports" (started: <date> 12:04:54.523708746 +0000 UTC m=+449966.882803763) (total time: 5.31691254s): Trace[209327862]: [5.31667951s] [5.316515341s] Object stored in database I am a bit confused. This is not an Alertmanager log line. Where is this from? > customer has good point about the alert. So the alert is showing all the time when the threshold is exceeded. How often are they updating their API server images?
Upfront, Prometheus generates/triggers alerts, Alertmanager just routes alerts. I believe in this case we should just ignore imagestreamimports from the general latency alert, as these calls are often expected to take much longer than 4s, at which point we already page. I can't say when we can get to this, in the mean time I recommend silencing the alert in Alertmanager, then you will not get notifications because for them.
(In reply to minden from comment #4) > > I0204 12:04:59.840639 1 trace.go:76] Trace[209327862]: "Create /apis/image.openshift.io/v1/namespaces/<namespace>/imagestreamimports" (started: <date> 12:04:54.523708746 +0000 UTC m=+449966.882803763) (total time: 5.31691254s): > Trace[209327862]: [5.31667951s] [5.316515341s] Object stored in database > > I am a bit confused. This is not an Alertmanager log line. Where is this > from? This is from the master logs, the alert is shown so they found the exact log line in the master logs. The alert is triggered due the time is 5s+. > > > customer has good point about the alert. So the alert is showing all the time when the threshold is exceeded. > > How often are they updating their API server images? It is done automatically from the OpenShift. They say it is once per 15 min. (In reply to Frederic Branczyk from comment #5) > Upfront, Prometheus generates/triggers alerts, Alertmanager just routes > alerts. > > I believe in this case we should just ignore imagestreamimports from the > general latency alert, as these calls are often expected to take much longer > than 4s, at which point we already page. > > I can't say when we can get to this, in the mean time I recommend silencing > the alert in Alertmanager, then you will not get notifications because for > them. I agree, this can be silenced. Thx
For your reference: the same issue is being described at https://github.com/openshift/origin/issues/21508.
Issue is fixed with cluster-monitoring-operator:v3.11.169
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0402