Bug 2172588

Summary: Prometheus error "duplicate sample for timestamp" on disk ops alerts
Product: Service Telemetry Framework Reporter: Chris Sibbitt <csibbitt>
Component: alertingAssignee: Nobody <nobody>
Status: CLOSED CURRENTRELEASE QA Contact: Leonid Natapov <lnatapov>
Severity: medium Docs Contact: mgeary <mgeary>
Priority: medium    
Version: 1.5CC: lmadsen, mrunge, parthee
Target Milestone: z2Keywords: Triaged
Target Release: 1.5 (STF)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-08 13:50:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Sibbitt 2023-02-22 15:50:28 UTC
Description of problem:

We are seeing the following errors in the prometheus log:

ts=2023-02-22T15:18:22.189Z caller=manager.go:715 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml group=./openstack.rules name=job:disk:time:read:rate_5m index=18 msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=5

ts=2023-02-22T15:18:22.199Z caller=manager.go:715 level=warn component="rule manager" file=/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml group=./openstack.rules name=job:disk:time:write:rate_5m index=23 msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=37


Version-Release number of selected component (if applicable):
stable-1.5

How reproducible:
Always

Steps to Reproduce:
Deploy alerts.yaml from the repo and check prometheus logs or "Rules" screen in the UI

Actual results:
ERR	
duplicate sample for timestamp

Expected results:
OK


Additional info:
The cause is because of a series of typos in the alert rules leading to duplicate recording rules:

record: 'job:disk:time:read:rate_5m'
* Correct: https://github.com/infrawatch/service-telemetry-operator/blob/37dceed7e55856820c10fb812da0ed9cd6551a3b/deploy/alerts/alerts.yaml#L45
* Should be 'job:disk:ops:read:rate_5m' https://github.com/infrawatch/service-telemetry-operator/blob/37dceed7e55856820c10fb812da0ed9cd6551a3b/deploy/alerts/alerts.yaml#L91

record: 'job:disk:time:write:rate_5m'
* Correct: https://github.com/infrawatch/service-telemetry-operator/blob/37dceed7e55856820c10fb812da0ed9cd6551a3b/deploy/alerts/alerts.yaml#L68
* Should be 'job:disk:ops:write:rate_5m' https://github.com/infrawatch/service-telemetry-operator/blob/37dceed7e55856820c10fb812da0ed9cd6551a3b/deploy/alerts/alerts.yaml#L114

Associated lines also need to be adjusted.

This bug likely causes the "disk ops" alarms to be non-functional; I think intstead they would function as duplicates of the "IO" alarms.

Comment 1 Leif Madsen 2023-03-27 18:24:47 UTC
Re-targeting this for STF 1.5.2 as it slipped for STF 1.5.1 release.