Bug 1707681
Summary: | After the cluster is up for a few days it stops sending telemetry data | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Oved Ourfali <oourfali> |
Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | anpicker, bparees, cben, ccoleman, eparis, erooth, fbranczy, juzhao, mloibl, nstielau, pkrupa, surbania, vlaad |
Target Milestone: | --- | ||
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:48:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1708648 | ||
Bug Blocks: |
Description
Oved Ourfali
2019-05-08 05:17:01 UTC
*** Bug 1698816 has been marked as a duplicate of this bug. *** does restarting either the client pod or prometheus itself have any impact on this behavior? This is a stop ship bug (In reply to Ben Parees from comment #8) > does restarting either the client pod or prometheus itself have any impact > on this behavior? If you restart the Telemeter client Pod, things unfortunately start working again. Prometheus itself doesn't seem to be impacted by this. > If you restart the Telemeter client Pod, things unfortunately start working again.
unfortunate? I'd consider that fortunate... it narrows where the issue is, and implies that the client is accumulating something over time that it should not be, which it is passing on every request. (http 431 is "error header field too large")
it also gives us potential workarounds.
(e.g. just wrap the client start script with a supervisor script that kills the client every hour) We believe to have found the source of the issue. After an initial hunch due to the 431 error code, we started investigating the length of the federation query, which is statically configured, so should never change. However, after turning on some debug logging, we could observe that the query does indeed grow over time, and eventually will be so big that it causes the 431 error code. We have yet to locate the problem in code, but this is a plausible explanation for the symptoms, so I'm confident we'll chase it down soon. We'll keep everyone updated. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |