DescriptionNaga Ravi Chaitanya Elluri
2020-01-24 02:50:59 UTC
Description of problem:
we observed bunch of errors related to "caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"" ( full log attached ) on a large scale 4.3 cluster ( 800 nodes ) built using 4.3.0-rc.3 payload. Looking at the logs, increasing the scrape_timeout and scrape_interval might help avoid this at large node counts. We are not exactly sure why the federation is enabled by default.
Version-Release number of selected component (if applicable):
OpenShift build/payload version: 4.3.0-rc.3
How reproducible:
We haven't seen this error before as we didn't really scale the cluster to large number of nodes.
Steps to Reproduce:
1. Build a cluster using 4.3.0-rc.3 payload.
2. Scale the cluster to larger node counts.
3. Look at one of the prom replica logs.
Actual results:
caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"
Expected results:
Metrics scraped within the scrape_timeout interval.
Additional info:
Full logs are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/prometheus_scrape/
Comment 1Sergiusz Urbaniak
2020-01-27 14:41:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2020:0581