Bug 2304076

Summary:	duplicate metrics being produced
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	iwatson
Component:	Multi-Cloud Object Gateway	Assignee:	Aayush Chouhan <achouhan>
Status:	ASSIGNED ---	QA Contact:	Sagi Hirshfeld <shirshfe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.16	CC:	aakobi, abhishku, achouhan, amark, dosypenk, ebenahar, etamir, jdelaros, jeremy.coulombe, julien.prost, kelwhite, lmauda, mmanjuna, muagarwa, nachahua, nbecker, nimrody, nthomas, odf-bz-bot, sbiradar, spasquie, ssonigra, yhuang
Target Milestone:	---
Target Release:	ODF 4.18.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.17.0-117	Doc Type:	Bug Fix
Doc Text:	Cause: Prometheus client dependency upgrade caused process level metris (res mem, VSS, heap, etc.) to be collected when custom metrics are collected as well. Consequence: Duplicate metrics since NooBaa was collecting process level metrics as well as custom metrics. These dups caused an alert in Prometheus Fix: Remove process-level collections as those are collected now upon custom metrics. Result: Remove duplication of metrics reporting and the alert in Prometheus	Story Points:	---
Clone Of:
Clones:	2321231 2322896 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2321231, 2322896

Description iwatson 2024-08-12 08:37:08 UTC

The following alerts is firing PrometheusDuplicateTimestamps
 
Turning on debug logs indicates that the noobaa-mgmt-service-monitor and s3-service-monitor are the issue

1 ts=2024-08-09T12:35:11.506Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-storage/noobaa-mgmt-service-monitor/0 target=http://10.129.2.15:8080/metrics/web_server msg="Duplicate sample for timestamp" series=NooBaa_health_status 


This can be shown by manually curling the metrics

oc exec prometheus-k8s-1 -- curl "http://10.129.2.15:8080/metrics/web_server" > metrics.txt 
 
Which indeed is returning duplicate metrics by searching for 'NooBaa_health_status 0' as a example
 
Upon investigation this is because of this block
https://github.com/noobaa/noobaa-core/blob/ad73e9cb3bd483f6f34de9a28a9f4ba3ea060eb3/src/server/analytic_services/prometheus_reporting.js#L44
 
If I call /metrics/web_server/nodejs and /metrics/web_server/core seperatly they return the same results.
 
So the solution is to either alert the code above or change the service monitor items to append /nodejs onto the end. 
 
Such as
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: noobaa-mgmt-service-monitor
  labels:
    app: noobaa
spec:
  endpoints:
  - port: mgmt
    path: /metrics/web_server/nodejs
  - port: mgmt
    path: /metrics/bg_workers
  - port: mgmt
    path: /metrics/hosted_agents
  namespaceSelector: {}
  selector:
    matchLabels:
      noobaa-mgmt-svc: "true"

Comment 3 Sonigra Saurab 2024-08-20 12:18:05 UTC

created KCS

Comment 6 Neyder Achahuanco Apaza 2024-08-20 19:42:56 UTC

Greetings, due to this bug, any change in configmap cluster-monitoring-config could not be processed, so this is impacting new deployments to configure persistence.

Comment 12 Adebi Akobi 2024-09-23 12:59:55 UTC

Hello team,

 Thank you for the information. I see the target for fix is approved for 4.17 bug fix cycle ..Is it possible to get a fix/patch in 4.16 while waiting for the 4.17 cycle? Please and thank you

Adebi

Comment 13 Liran Mauda 2024-09-30 07:40:09 UTC

Hi,

Fixing on an older version, while on a newer version, there is an issue, is a regression.
We will fix it on 4.17, and consider it to be in one of the 4.16.z

Best Regards,
Liran.

Comment 15 Sunil Kumar Acharya 2024-10-08 13:17:11 UTC

Please update the RDT flag/text appropriately.