Bug 1794616

Summary:

Prometheus federation write is failing on a large scale cluster

Product:

OpenShift Container Platform

Reporter:

Naga Ravi Chaitanya Elluri <nelluri>

Component:

Monitoring

Assignee:

Sergiusz Urbaniak <surbania>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.3.0

CC:

alegrand, anpicker, erooth, kakkoyun, lcosic, lseelye, mloibl, pkrupa, surbania

Target Milestone:

---

Target Release:

4.4.0

Hardware:

Unspecified

OS:

Linux

Whiteboard:

aos-scalability-43

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-05-04 11:26:35 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1804574, 1805116

Attachments:

Description	Flags
Total time series in an 800 nodes cluster: 8301	none

Description Naga Ravi Chaitanya Elluri 2020-01-24 02:50:59 UTC

Description of problem:
we observed bunch of errors related to "caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"" ( full log attached ) on a large scale 4.3 cluster ( 800 nodes ) built using 4.3.0-rc.3 payload. Looking at the logs, increasing the scrape_timeout and scrape_interval might help avoid this at large node counts. We are not exactly sure why the federation is enabled by default.

Version-Release number of selected component (if applicable):
OpenShift build/payload version: 4.3.0-rc.3

How reproducible:
We haven't seen this error before as we didn't really scale the cluster to large number of nodes.

Steps to Reproduce:
1. Build a cluster using 4.3.0-rc.3 payload.
2. Scale the cluster to larger node counts.
3. Look at one of the prom replica logs.

Actual results:
caller=federate.go:184 component=web msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:36894: write: broken pipe"

Expected results:
Metrics scraped within the scrape_timeout interval.

Additional info:
Full logs are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/prometheus_scrape/

Comment 1 Sergiusz Urbaniak 2020-01-27 14:41:46 UTC

Created attachment 1655695 [details]
Total time series in an 800 nodes cluster: 8301

Comment 15 Junqi Zhao 2020-02-19 09:10:18 UTC

Tested with 4.4.0-0.nightly-2020-02-17-211020,
the "up" metric is replaced by "count:up0" and "count:up1", "node_uname_info" is removed

Comment 17 errata-xmlrpc 2020-05-04 11:26:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581