Bug 2172588 - Prometheus error "duplicate sample for timestamp" on disk ops alerts
Summary: Prometheus error "duplicate sample for timestamp" on disk ops alerts
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Service Telemetry Framework
Classification: Red Hat
Component: alerting
Version: 1.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 1.5 (STF)
Assignee: Nobody
QA Contact: Leonid Natapov
mgeary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-22 15:50 UTC by Chris Sibbitt
Modified: 2023-07-07 18:47 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-08 13:50:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker STF-1262 0 None None None 2023-02-22 15:52:22 UTC

Description Chris Sibbitt 2023-02-22 15:50:28 UTC
Description of problem:

We are seeing the following errors in the prometheus log:

ts=2023-02-22T15:18:22.189Z caller=manager.go:715 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml group=./openstack.rules name=job:disk:time:read:rate_5m index=18 msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=5

ts=2023-02-22T15:18:22.199Z caller=manager.go:715 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml group=./openstack.rules name=job:disk:time:write:rate_5m index=23 msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=37


Version-Release number of selected component (if applicable):
stable-1.5

How reproducible:
Always

Steps to Reproduce:
Deploy alerts.yaml from the repo and check prometheus logs or "Rules" screen in the UI

Actual results:
ERR	
duplicate sample for timestamp

Expected results:
OK


Additional info:
The cause is because of a series of typos in the alert rules leading to duplicate recording rules:

record: 'job:disk:time:read:rate_5m'
* Correct: https://github.com/infrawatch/service-telemetry-operator/blob/37dceed7e55856820c10fb812da0ed9cd6551a3b/deploy/alerts/alerts.yaml#L45
* Should be 'job:disk:ops:read:rate_5m' https://github.com/infrawatch/service-telemetry-operator/blob/37dceed7e55856820c10fb812da0ed9cd6551a3b/deploy/alerts/alerts.yaml#L91

record: 'job:disk:time:write:rate_5m'
* Correct: https://github.com/infrawatch/service-telemetry-operator/blob/37dceed7e55856820c10fb812da0ed9cd6551a3b/deploy/alerts/alerts.yaml#L68
* Should be 'job:disk:ops:write:rate_5m' https://github.com/infrawatch/service-telemetry-operator/blob/37dceed7e55856820c10fb812da0ed9cd6551a3b/deploy/alerts/alerts.yaml#L114

Associated lines also need to be adjusted.

This bug likely causes the "disk ops" alarms to be non-functional; I think intstead they would function as duplicates of the "IO" alarms.

Comment 1 Leif Madsen 2023-03-27 18:24:47 UTC
Re-targeting this for STF 1.5.2 as it slipped for STF 1.5.1 release.


Note You need to log in before you can comment on or make changes to this bug.