Summary: | Fluentd wedged nodes on 4.1 cluster | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alex Krzos <akrzos> |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | aos-bugs, ewolinet, pweil, rmeggins |
Target Milestone: | --- | Flags: | akrzos:
needinfo-
|
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | aos-scalability-41 | ||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:48:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: |
Description
Alex Krzos
2019-05-07 15:10:51 UTC
(In reply to Alex Krzos from comment #0) > Description of problem: > We ran scalability tests on 4.1 HTB4, then upgraded to a newer build. After > the testing we scaled the cluster down from 250 nodes to 3 worker nodes. At > this point fluentd ended up leaving many buffer files on the 3 worker nodes > which ended up evicting all pods and rendering the worker nodes useless > however in Ready state. Please provide more information here: * Was fluentd still running on the nodes that had issues? * Is Elasticsearch still running and functional on some 'infra' node? Fluent will buffer files when it is unable to push to ES or not able to push fast enough. Fluent will churn through these and only cleans them up as it processes them. Moved back to 4.1 as bz merged into master The Toleration-> node.kubernetes.io/disk-pressure have been added to fluentd when deployed using openshift/ose-cluster-logging-operator:v4.1.0-201905071513. But I don't think that can control the amount of space fluentd buffer occupy. it just provide a way to prevent fluentd strucking the node. To controll the disk size, it is better to provide standalone storage for fluentd buffer. (In reply to Anping Li from comment #6) > The Toleration-> node.kubernetes.io/disk-pressure have been added to fluentd > when deployed using > openshift/ose-cluster-logging-operator:v4.1.0-201905071513. But I don't > think that can control the amount of space fluentd buffer occupy. it just > provide a way to prevent fluentd strucking the node. To controll the disk > size, it is better to provide standalone storage for fluentd buffer. Correct. Ideally there would be something like a "Stateful DaemonSet". My theor is there were a chain of events that likely led to this issue. Cluster nodes where scaling down which caused the pod density to increase and increasing the log traffic on the remaining worker loads. The collector was probably both unable to keep up with the traffic as well as unable to push to ES for whatever reason. Fluent start to buffer logs, filling up disk and eventually gets evicted because the disk is getting full Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |