1707681 – After the cluster is up for a few days it stops sending telemetry data

Bug 1707681 - After the cluster is up for a few days it stops sending telemetry data

Summary: After the cluster is up for a few days it stops sending telemetry data

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1698816 (view as bug list)
Depends On:	1708648
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-08 05:17 UTC by Oved Ourfali
Modified:	2019-06-04 10:48 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:48:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift telemeter pull 169	0	'None'	closed	Bug 1707681: pkg/forwarder: fix ever growing URI	2020-12-23 10:30:29 UTC
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:48:41 UTC

Description Oved Ourfali 2019-05-08 05:17:01 UTC

Description of problem:
I have a cluster that's running for 3 days now. This morning it stopped sending telemetry data.

It isn't the first time I hear about this issue, so opening this bug to track it.
I'll keep the cluster up, and add the connection details on a separate comment.

Comment 7 Junqi Zhao 2019-05-08 12:56:42 UTC

*** Bug 1698816 has been marked as a duplicate of this bug. ***

Comment 8 Ben Parees 2019-05-08 14:15:34 UTC

does restarting either the client pod or prometheus itself have any impact on this behavior?

Comment 9 Clayton Coleman 2019-05-08 14:49:56 UTC

This is a stop ship bug

Comment 10 Matthias Loibl 2019-05-08 15:25:35 UTC

(In reply to Ben Parees from comment #8)
> does restarting either the client pod or prometheus itself have any impact
> on this behavior?

If you restart the Telemeter client Pod, things unfortunately start working again.
Prometheus itself doesn't seem to be impacted by this.

Comment 11 Ben Parees 2019-05-08 15:28:51 UTC

> If you restart the Telemeter client Pod, things unfortunately start working again.

unfortunate?  I'd consider that fortunate... it narrows where the issue is, and implies that the client is accumulating something over time that it should not be, which it is passing on every request.  (http 431 is "error header field too large")

it also gives us potential workarounds.

Comment 12 Ben Parees 2019-05-08 15:29:42 UTC

(e.g. just wrap the client start script with a supervisor script that kills the client every hour)

Comment 13 Frederic Branczyk 2019-05-08 15:55:24 UTC

We believe to have found the source of the issue. After an initial hunch due to the 431 error code, we started investigating the length of the federation query, which is statically configured, so should never change. However, after turning on some debug logging, we could observe that the query does indeed grow over time, and eventually will be so big that it causes the 431 error code. We have yet to locate the problem in code, but this is a plausible explanation for the symptoms, so I'm confident we'll chase it down soon. We'll keep everyone updated.

Comment 30 errata-xmlrpc 2019-06-04 10:48:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.