Bug 2304076

Summary: duplicate metrics being produced
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: iwatson
Component: Multi-Cloud Object GatewayAssignee: Aayush Chouhan <achouhan>
Status: ASSIGNED --- QA Contact: Sagi Hirshfeld <shirshfe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.16CC: aakobi, abhishku, achouhan, amark, dosypenk, ebenahar, etamir, jdelaros, jeremy.coulombe, julien.prost, kelwhite, lmauda, mmanjuna, muagarwa, nachahua, nbecker, nimrody, nthomas, odf-bz-bot, sbiradar, spasquie, ssonigra, yhuang
Target Milestone: ---   
Target Release: ODF 4.18.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.17.0-117 Doc Type: Bug Fix
Doc Text:
Cause: Prometheus client dependency upgrade caused process level metris (res mem, VSS, heap, etc.) to be collected when custom metrics are collected as well. Consequence: Duplicate metrics since NooBaa was collecting process level metrics as well as custom metrics. These dups caused an alert in Prometheus Fix: Remove process-level collections as those are collected now upon custom metrics. Result: Remove duplication of metrics reporting and the alert in Prometheus
Story Points: ---
Clone Of:
: 2321231 2322896 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2321231, 2322896    

Description iwatson 2024-08-12 08:37:08 UTC
The following alerts is firing PrometheusDuplicateTimestamps
 
Turning on debug logs indicates that the noobaa-mgmt-service-monitor and s3-service-monitor are the issue

1 ts=2024-08-09T12:35:11.506Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-storage/noobaa-mgmt-service-monitor/0 target=http://10.129.2.15:8080/metrics/web_server msg="Duplicate sample for timestamp" series=NooBaa_health_status 


This can be shown by manually curling the metrics

oc exec prometheus-k8s-1 -- curl "http://10.129.2.15:8080/metrics/web_server" > metrics.txt 
 
Which indeed is returning duplicate metrics by searching for 'NooBaa_health_status 0' as a example
 
Upon investigation this is because of this block
https://github.com/noobaa/noobaa-core/blob/ad73e9cb3bd483f6f34de9a28a9f4ba3ea060eb3/src/server/analytic_services/prometheus_reporting.js#L44
 
If I call /metrics/web_server/nodejs and /metrics/web_server/core seperatly they return the same results.
 
So the solution is to either alert the code above or change the service monitor items to append /nodejs onto the end. 
 
Such as
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: noobaa-mgmt-service-monitor
  labels:
    app: noobaa
spec:
  endpoints:
  - port: mgmt
    path: /metrics/web_server/nodejs
  - port: mgmt
    path: /metrics/bg_workers
  - port: mgmt
    path: /metrics/hosted_agents
  namespaceSelector: {}
  selector:
    matchLabels:
      noobaa-mgmt-svc: "true"

Comment 3 Sonigra Saurab 2024-08-20 12:18:05 UTC
created KCS

Comment 6 Neyder Achahuanco Apaza 2024-08-20 19:42:56 UTC
Greetings, due to this bug, any change in configmap cluster-monitoring-config could not be processed, so this is impacting new deployments to configure persistence.

Comment 12 Adebi Akobi 2024-09-23 12:59:55 UTC
Hello team,

 Thank you for the information. I see the target for fix is approved for 4.17 bug fix cycle ..Is it possible to get a fix/patch in 4.16 while waiting for the 4.17 cycle? Please and thank you

Adebi

Comment 13 Liran Mauda 2024-09-30 07:40:09 UTC
Hi,

Fixing on an older version, while on a newer version, there is an issue, is a regression.
We will fix it on 4.17, and consider it to be in one of the 4.16.z

Best Regards,
Liran.

Comment 15 Sunil Kumar Acharya 2024-10-08 13:17:11 UTC
Please update the RDT flag/text appropriately.