Bug 1780698 - DiskPressure due to 80 GB /var/lib/fluentd
Summary: DiskPressure due to 80 GB /var/lib/fluentd
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Periklis Tsirakidis
QA Contact: Anping Li
URL:
Whiteboard:
: 1819964 (view as bug list)
Depends On:
Blocks: 1826861
TreeView+ depends on / blocked
 
Reported: 2019-12-06 16:41 UTC by Wolfgang Kulhanek
Modified: 2020-08-13 08:43 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: On high incoming log rates Fluentd could possible flood the node's filesystem because the buffer queues were not limited. Consequence: A node under disk pressure could eventually crash the node and thus the applications would be rescheduled. Fix: The fluentd buffer queue per output is limited to a fixed amount of chunks (default 32). Result: Node disk pressure due to fluentd buffers should be omited by this fix.
Clone Of:
Environment:
Last Closed: 2020-07-13 17:12:38 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 479 None closed Bug 1780698: Fix disk pressure for fluentd buffer chunks 2020-09-16 06:05:40 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:13:05 UTC

Description Wolfgang Kulhanek 2019-12-06 16:41:17 UTC
Description of problem:
I have a cluster that was on 4.1.23 (upgraded continuously from about 4.1.4).

The upgrade to either 4.1.24 or 4.1.25 fails with a download error:

info: An upgrade is in progress. Unable to apply 4.1.25: could not download the update

Updates:

VERSION IMAGE
4.1.25  quay.io/openshift-release-dev/ocp-release@sha256:5f824fa3b3c44c6a78a5fc6a77a82edc47cf2b495bb6b2b31e3e0a4d3d77684b
4.1.24  quay.io/openshift-release-dev/ocp-release@sha256:6f87fb66dfa907db03981e69474ea3069deab66358c18d965f6331bd727ff23f

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.23    True        True          44h     Unable to apply 4.1.25: could not download the update

All Cluster Operators show on 4.1.23.

oc adm must-gather is at https://drive.google.com/open?id=18mqD6BpEwAQbApb1_5MD9j-cMBckBIA6

I can provide a kubeconfig as well to poke around there.

Comment 1 Scott Dodson 2019-12-06 17:35:51 UTC
(In reply to Wolfgang Kulhanek from comment #0)

> oc adm must-gather is at
> https://drive.google.com/open?id=18mqD6BpEwAQbApb1_5MD9j-cMBckBIA6

Can you open this up to redhat.com users, if necessary moving it to our account?

Comment 2 Wolfgang Kulhanek 2019-12-06 17:39:40 UTC
Done. I think. :-)

Comment 3 Abhinav Dahiya 2019-12-06 18:58:18 UTC
> namespaces/openshift-cluster-version/pods.yaml

all the pods that are used to download the release image are failing due to DiskPressure.

```
- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: 2019-12-06T15:31:03Z
    generateName: version-4.1.25-zrktd-
    labels:
      controller-uid: 3ff37e46-183d-11ea-a000-02156538c61c
      job-name: version-4.1.25-zrktd
    name: version-4.1.25-zrktd-s22mg
    namespace: openshift-cluster-version
    ownerReferences:
    - apiVersion: batch/v1
      blockOwnerDeletion: true
      controller: true
      kind: Job
      name: version-4.1.25-zrktd
      uid: 3ff37e46-183d-11ea-a000-02156538c61c
    resourceVersion: "104418852"
    selfLink: /api/v1/namespaces/openshift-cluster-version/pods/version-4.1.25-zrktd-s22mg
    uid: 69cb647a-183d-11ea-8d5e-0628dcb211ba
  spec:
    containers:
    - args:
      - -c
      - mkdir -p /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ && mv /manifests /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ/manifests
        && mkdir -p /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ && mv /release-manifests
        /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ/release-manifests
      command:
      - /bin/sh
      image: quay.io/openshift-release-dev/ocp-release@sha256:5f824fa3b3c44c6a78a5fc6a77a82edc47cf2b495bb6b2b31e3e0a4d3d77684b
      imagePullPolicy: IfNotPresent
      name: payload
      resources: {}
      securityContext:
        privileged: true
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /etc/cvo/updatepayloads
        name: payloads
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-82cr5
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: default-dockercfg-94rsb
    nodeName: ip-10-0-149-16.us-west-2.compute.internal
    nodeSelector:
      node-role.kubernetes.io/master: ""
    priority: 0
    restartPolicy: OnFailure
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    - key: node-role.kubernetes.io/master
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - hostPath:
        path: /etc/cvo/updatepayloads
        type: ""
      name: payloads
    - name: default-token-82cr5
      secret:
        defaultMode: 420
        secretName: default-token-82cr5
  status:
    message: 'Pod The node had condition: [DiskPressure]. '
    phase: Failed
    reason: Evicted
    startTime: 2019-12-06T15:31:03Z
```

And the node is actually under disk pressure

```
apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/shared-rhpds-jhtvl-master-1
    machineconfiguration.openshift.io/currentConfig: rendered-master-ef6f6422b89eea806cfd0b4a3ed3312f
    machineconfiguration.openshift.io/desiredConfig: rendered-master-ef6f6422b89eea806cfd0b4a3ed3312f
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: 2019-07-26T20:33:00Z
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m5.2xlarge
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: us-west-2
    failure-domain.beta.kubernetes.io/zone: us-west-2b
    kubernetes.io/hostname: ip-10-0-149-16
    node-role.kubernetes.io/master: ""
    node.openshift.io/os_id: rhcos
    node.openshift.io/os_version: "4.1"
  name: ip-10-0-149-16.us-west-2.compute.internal
  resourceVersion: "104453442"
  selfLink: /api/v1/nodes/ip-10-0-149-16.us-west-2.compute.internal
  uid: 8f2b0fc8-afe4-11e9-901a-06682ae2372e
spec:
  providerID: aws:///us-west-2b/i-0a2892adb5f0216dd
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    timeAdded: 2019-12-02T16:49:29Z
```

is it possible to find out what's the taking up that space, and what's the disk size?

Comment 4 Wolfgang Kulhanek 2019-12-06 20:33:23 UTC
This is a regular IPI install on AWS. That node that these pods are running on is a master (master-1 to be exact). It's a regular m5.2xlarge instance.

ebs for root device, 120GB
Root disk shows 86% full

$ du -hc / | sort -rh | head -20

204G	total
204G	/
102G	/var
102G	/sysroot/ostree
102G	/sysroot
99G	/sysroot/ostree/deploy/rhcos/var
99G	/sysroot/ostree/deploy/rhcos
99G	/sysroot/ostree/deploy
96G	/var/lib
93G	/sysroot/ostree/deploy/rhcos/var/lib
89G	/var/lib/fluentd
89G	/sysroot/ostree/deploy/rhcos/var/lib/fluentd
64G	/var/lib/fluentd/buffer-output-es-config
64G	/sysroot/ostree/deploy/rhcos/var/lib/fluentd/buffer-output-es-config
25G	/var/lib/fluentd/es-retry
25G	/sysroot/ostree/deploy/rhcos/var/lib/fluentd/es-retry
6.3G	/var/log
6.3G	/var/lib/containers/storage/overlay
6.3G	/var/lib/containers/storage
6.3G	/var/lib/containers


Biggest offender seems to be:

/sysroot/ostree/deploy/rhcos/var/lib/fluentd

sh-4.4# du -hc . | sort -rh | head -20
89G	total
89G	.
64G	./buffer-output-es-config
25G	./es-retry

Any idea what to do here? Delete everything in that directory? Reboot the master (from the AWS Console)?

Comment 5 W. Trevor King 2019-12-07 02:17:23 UTC
Logging folks might know what to do about Fluentd disk space.

Comment 6 Jeff Cantrill 2019-12-07 16:21:03 UTC
(In reply to W. Trevor King from comment #5)
> Logging folks might know what to do about Fluentd disk space.

This is directly related to the following from fluent mounting the host and relying on it for buffering:

1. Fluent will buffer container logs to disk if it is unable to push them off the node.
2. Fluent writes its own logs to disk to avoid a feedback loop for ingestion.  These logs are supposed to be rotated by fluent's log library

The only work around at the moment is to truncate the directory from the node to remove the buffered logs

Comment 7 Scott Dodson 2019-12-09 13:32:48 UTC
Is this abnormally high levels of logging or would one expect this of all normal clusters?

Comment 9 Jeff Cantrill 2019-12-09 14:00:09 UTC
(In reply to Scott Dodson from comment #7)
> Is this abnormally high levels of logging or would one expect this of all
> normal clusters?

If ES is unable to ingest the logs or refuses and fluent is unable to push and clear it's buffers then I expect this can potentially occur.  High levels of logging (e.g. volume, DEBUG) could lead to this in any cluster.

Comment 10 Wolfgang Kulhanek 2019-12-09 14:07:01 UTC
So this is interesting. I checked this morning wanting to go and clean out the fluentd directories - and sure enough over the weekend my cluster had healed itself. The FluentD directories are now only 8k. And the upgrade actually ran through to 4.1.25.

But this still doesn't explain the 80G of FluentD data on Friday. Where did those come from? Where did they go? Is it possible the curator ran, cleaned up Elasticsearch and then fluentd was able to send things over? Although Elasticsearch hadn't shown any abnormalities last week (as in error messages). Elasticsearch has a 200Gi PVC. Although I see two of them - the newer one got created Friday and is the currently used. Time stamp is consistent with the upgrade.

This cluster isn't super busy quite yet. It's actually one of our less used clusters.

I'm at a bit of a loss as to what to tell you guys and where to go from here...

Comment 11 Jeff Cantrill 2019-12-09 16:55:35 UTC
(In reply to Wolfgang Kulhanek from comment #10)
> So this is interesting. I checked this morning wanting to go and clean out
> the fluentd directories - and sure enough over the weekend my cluster had
> healed itself. The FluentD directories are now only 8k. And the upgrade
> actually ran through to 4.1.25.

Fluent will push all its buffers if it is not receiving back pressure from ES.  Assuming the rate it pushes is faster then the node log rate it will eventually clear it's buffers.

> 
> But this still doesn't explain the 80G of FluentD data on Friday. Where did
> those come from? Where did they go? Is it possible the curator ran, cleaned
> up Elasticsearch and then fluentd was able to send things over? Although
> Elasticsearch hadn't shown any abnormalities last week (as in error
> messages). Elasticsearch has a 200Gi PVC. Although I see two of them - the
> newer one got created Friday and is the currently used. Time stamp is
> consistent with the upgrade.

We have an internal thread discussing this issue.  One point that was brought up related to fluent performance was that if ES is not properly sized it could have issues with ingestion and thus rejecting.  In regards to curation, maybe you were at the threshhold limits where ES will no longer index data thus causing backpressure.  Curator could indeed have removed indices, dropping you below the threshold and allowing ES to index more.  We are unable to know unless we had a snapshot of the environment or an understand of the state of the disk.

> 
> This cluster isn't super busy quite yet. It's actually one of our less used
> clusters.
> 
> I'm at a bit of a loss as to what to tell you guys and where to go from
> here...

Comment 12 Lukas Vlcek 2020-01-28 12:07:38 UTC
If this is standard OCP 4.x installation then a lot of internal ES cluster statistics are captured in Prometheus. We should definitely take advance of the data recorded in Prometheus to troubleshoot cases like this.

Comment 13 Jeff Cantrill 2020-01-31 17:04:41 UTC
Created https://bugzilla.redhat.com/show_bug.cgi?id=1796133 to address writing fluent's logs to disk.  This will provide partial relief of this issue.

Comment 14 Jeff Cantrill 2020-01-31 20:51:24 UTC
@Anping,

I believe by moving the container logs back to std out [1] and the correction [2] to how we calculate fluent's buffers, this issue would be resolved for 4.3 and 4.4.  For cluster logging that is only writing to our managed instance of Elasticsearch, I would expect fluent to stop reading files when it reached a total of 256M of buffers, which is the value we provide for FILE_BUFFER_LIMIT.  There is the main pipeline and the retry pipeline which equates to 2 outputs.  The buffers get chunked to 8M each so the total number between main and retry pipeline should result in 32 chunks max.  I would expect fluent to stop reading files when we get to this point.

If we can verify for these releases, we likely have to fix the calculation for 4.1 and prior where the number of outputs was not properly considered. 

I'm asking you verify we have resolved this issue in 4.4.  I think you could deploy a cluster logging instance and make sure logs flow and then remove the ES instance.  I expect once you get to the max number of chunks that fluent will stop collecting logs and you will see no additional buffers created.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1796133
[2] https://github.com/openshift/cluster-logging-operator/blob/master/files/fluentd/run.sh#L85-L130

Comment 16 Anping Li 2020-02-11 02:51:12 UTC
The buffer growed up to 7.7 G. fluentd image: openshift/ose-logging-fluentd:v4.4.0-202001310654


[anli@preserve-docker-slave 79939]$ oc get pod fluentd-z8fx7 -o yaml |grep BUFFER -A 2
    - name: BUFFER_QUEUE_LIMIT
      value: "32"
    - name: BUFFER_SIZE_LIMIT
      value: 8m
    - name: FILE_BUFFER_LIMIT
      value: 256Mi
    - name: FLUENTD_CPU_LIMIT

oc exec fluentd-z8fx7 -- du -sh /var/lib/fluentd/clo_default_output_es/
7.7G	/var/lib/fluentd/clo_default_output_es/

Comment 18 Jeff Cantrill 2020-04-02 15:28:56 UTC
*** Bug 1819964 has been marked as a duplicate of this bug. ***

Comment 19 Jeff Cantrill 2020-04-17 16:14:22 UTC
*** Bug 1824427 has been marked as a duplicate of this bug. ***

Comment 22 Anping Li 2020-05-06 02:22:55 UTC
Verified and pass using
image: registry.svc.ci.openshift.org/origin/4.5:logging-fluentd
imageID: registry.svc.ci.openshift.org/origin/4.5@sha256:2f6918a6ca84d954badd083dad704ae44f2896f5fa4321f9909cb6d8927c1223


Tue May  5 21:05:08 EDT 2020
218M	/var/lib/fluentd
206M	/var/lib/fluentd/clo_default_output_es
12M	/var/lib/fluentd/retry_clo_default_output_es
Tue May  5 21:06:10 EDT 2020
229M	/var/lib/fluentd
217M	/var/lib/fluentd/clo_default_output_es
12M	/var/lib/fluentd/retry_clo_default_output_es
Tue May  5 21:07:11 EDT 2020
262M	/var/lib/fluentd
250M	/var/lib/fluentd/clo_default_output_es
12M	/var/lib/fluentd/retry_clo_default_output_es
Tue May  5 21:08:13 EDT 2020
277M	/var/lib/fluentd
266M	/var/lib/fluentd/clo_default_output_es
12M	/var/lib/fluentd/retry_clo_default_output_es
Tue May  5 21:09:14 EDT 2020
279M	/var/lib/fluentd
268M	/var/lib/fluentd/clo_default_output_es
12M	/var/lib/fluentd/retry_clo_default_output_es
Tue May  5 21:31:50 EDT 2020
268M	/var/lib/fluentd
257M	/var/lib/fluentd/clo_default_output_es
12M	/var/lib/fluentd/retry_clo_default_output_es
Tue May  5 22:02:42 EDT 2020
268M	/var/lib/fluentd
257M	/var/lib/fluentd/clo_default_output_es
12M	/var/lib/fluentd/retry_clo_default_output_es
Tue May  5 22:16:05 EDT 2020
268M	/var/lib/fluentd
257M	/var/lib/fluentd/clo_default_output_es
12M	/var/lib/fluentd/retry_clo_default_output_es


 oc logs fluentd-ff8nn
2020-05-06 02:22:18 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block

Comment 24 Jeff Cantrill 2020-06-24 12:05:38 UTC
(In reply to Oscar Casal Sanchez from comment #23)
> Hello,
> 
> Is going this issue to be backported to OCP 4.2? 

We will not be backporting this change to 4.2

Comment 29 errata-xmlrpc 2020-07-13 17:12:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.