Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1804574

Summary:	Prometheus federation write is failing on a large scale cluster
Product:	OpenShift Container Platform	Reporter:	Sergiusz Urbaniak <surbania>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone:	---
Target Release:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-24 08:42:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1794616
Bug Blocks:

Description Sergiusz Urbaniak 2020-02-19 08:08:25 UTC

This bug was initially created as a copy of Bug #1794616

I am copying this bug because: 



Description of problem:
we observed bunch of errors related to "caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"" ( full log attached ) on a large scale 4.3 cluster ( 800 nodes ) built using 4.3.0-rc.3 payload. Looking at the logs, increasing the scrape_timeout and scrape_interval might help avoid this at large node counts. We are not exactly sure why the federation is enabled by default.

Version-Release number of selected component (if applicable):
OpenShift build/payload version: 4.3.0-rc.3

How reproducible:
We haven't seen this error before as we didn't really scale the cluster to large number of nodes.

Steps to Reproduce:
1. Build a cluster using 4.3.0-rc.3 payload.
2. Scale the cluster to larger node counts.
3. Look at one of the prom replica logs.

Actual results:
caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"

Expected results:
Metrics scraped within the scrape_timeout interval.

Additional info:
Full logs are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/prometheus_scrape/