This bug was initially created as a copy of Bug #1794616 I am copying this bug because: Description of problem: we observed bunch of errors related to "caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"" ( full log attached ) on a large scale 4.3 cluster ( 800 nodes ) built using 4.3.0-rc.3 payload. Looking at the logs, increasing the scrape_timeout and scrape_interval might help avoid this at large node counts. We are not exactly sure why the federation is enabled by default. Version-Release number of selected component (if applicable): OpenShift build/payload version: 4.3.0-rc.3 How reproducible: We haven't seen this error before as we didn't really scale the cluster to large number of nodes. Steps to Reproduce: 1. Build a cluster using 4.3.0-rc.3 payload. 2. Scale the cluster to larger node counts. 3. Look at one of the prom replica logs. Actual results: caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe" Expected results: Metrics scraped within the scrape_timeout interval. Additional info: Full logs are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/prometheus_scrape/
*** Bug 1804574 has been marked as a duplicate of this bug. ***
Tested with 4.3.0-0.nightly-2020-03-09-200240, {__name__="up"} is removed, "count:up0" and "count:up1" are added in the telemeter-client deployment now, and can push metrics to telemeter-server # oc -n openshift-monitoring get deploy telemeter-client -oyaml | grep match - --match={__name__="count:up0"} - --match={__name__="count:up1"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0858