1897352 – Prometheus unable to ingest metrics, operator logs indicate "Failed to list PodMonitor"

Bug 1897352 - Prometheus unable to ingest metrics, operator logs indicate "Failed to list PodMonitor"

Summary: Prometheus unable to ingest metrics, operator logs indicate "Failed to list P...

Keywords:
Status:	CLOSED DUPLICATE of bug 1891815
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-12 20:50 UTC by Naveen Malik
Modified:	2020-11-18 22:55 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-13 08:35:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Naveen Malik 2020-11-12 20:50:51 UTC

Description of problem:
Two seemingly unrelated issues on OSD today where metrics that should be in prometheus were not showing up.  Finally noticed prometheus operator was logging errors.  We didn't snag a must-gather for them as they were happening.  Will make sure we watch for it and do that next time.

Version-Release number of selected component (if applicable):
4.5.16

How reproducible:
Unknown

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
Prometheus doesn't register new targets.
Prometheus doesn't update existing targets.

Expected results:
Prometheus registers new targets and updates existing targets.

Additional info:


E1112 20:40:04.648331       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:485: Failed to list *v1.ServiceMonitor: resourceVersion: Invalid value: "72124
060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/7212406
0/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/
72124060/72124060/72124060": strconv.ParseUint: parsing "72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060
/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/7
2124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060/72124060": invalid syntax
E1112 20:40:37.584245       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/thanos/operator.go:322: Failed to list *v1.PrometheusRule: resourceVersion: Invalid value: "76179311/
76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76
179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/7617
9311/76179311/76179311": strconv.ParseUint: parsing "76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311/76179311": invalid syntax
E1112 20:40:47.349089       1 reflector.go:178] github.com/coreos/prometheus-operator/pkg/prometheus/operator.go:486: Failed to list *v1.PodMonitor: resourceVersion: Invalid value: "72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754": strconv.ParseUint: parsing "72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754/72127754": invalid syntax

Comment 1 Naveen Malik 2020-11-12 20:53:49 UTC

On one cluster it was an SRE operator having problems registering a new ServiceMonitor.
On another cluster it was FluentdNodeDown with message "Prometheus could not scrape fluentd  for more than 10m."

For the fluentd cluster, prom query `{job="fluentd"}` had no results (none!).  On a healthy cluster w/ logging that query returns 1400+ time series.

Workaround right is to delete the prometheus-operator pod.

Comment 3 Junqi Zhao 2020-11-13 01:39:06 UTC

it seems the same issue with bug 1891815

Comment 4 Simon Pasquier 2020-11-13 08:35:32 UTC


*** This bug has been marked as a duplicate of bug 1891815 ***

Note You need to log in before you can comment on or make changes to this bug.