1805116 – Prometheus federation write is failing on a large scale cluster

Bug 1805116 - Prometheus federation write is failing on a large scale cluster

Summary: Prometheus federation write is failing on a large scale cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1804574 (view as bug list)
Depends On:	1794616
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-20 09:58 UTC by Sergiusz Urbaniak
Modified:	2023-08-21 07:47 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-24 14:33:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 664	None	closed	Bug 1805116: Use aggregated up metrics	2021-01-10 11:59:21 UTC
Github	openshift telemeter pull 309	None	closed	Bug 1805116: jsonnet/telemeter/metrics: add aggregated up metric, remove node_uname_info	2021-01-10 11:59:58 UTC
Red Hat Product Errata	RHBA-2020:0858	None	None	None	2020-03-24 14:34:03 UTC

Description Sergiusz Urbaniak 2020-02-20 09:58:20 UTC

This bug was initially created as a copy of Bug #1794616

I am copying this bug because: 



Description of problem:
we observed bunch of errors related to "caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"" ( full log attached ) on a large scale 4.3 cluster ( 800 nodes ) built using 4.3.0-rc.3 payload. Looking at the logs, increasing the scrape_timeout and scrape_interval might help avoid this at large node counts. We are not exactly sure why the federation is enabled by default.

Version-Release number of selected component (if applicable):
OpenShift build/payload version: 4.3.0-rc.3

How reproducible:
We haven't seen this error before as we didn't really scale the cluster to large number of nodes.

Steps to Reproduce:
1. Build a cluster using 4.3.0-rc.3 payload.
2. Scale the cluster to larger node counts.
3. Look at one of the prom replica logs.

Actual results:
caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"

Expected results:
Metrics scraped within the scrape_timeout interval.

Additional info:
Full logs are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/prometheus_scrape/

Comment 1 Sergiusz Urbaniak 2020-02-24 08:42:50 UTC

*** Bug 1804574 has been marked as a duplicate of this bug. ***

Comment 4 Junqi Zhao 2020-03-10 04:21:19 UTC

Tested with 4.3.0-0.nightly-2020-03-09-200240, {__name__="up"} is removed, "count:up0" and "count:up1" are added in the telemeter-client deployment now, and can push metrics to telemeter-server
# oc -n openshift-monitoring get deploy telemeter-client -oyaml | grep match
        - --match={__name__="count:up0"}
        - --match={__name__="count:up1"}

Comment 6 errata-xmlrpc 2020-03-24 14:33:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0858

Note You need to log in before you can comment on or make changes to this bug.