Description of problem: I have a cluster that was on 4.1.23 (upgraded continuously from about 4.1.4). The upgrade to either 4.1.24 or 4.1.25 fails with a download error: info: An upgrade is in progress. Unable to apply 4.1.25: could not download the update Updates: VERSION IMAGE 4.1.25 quay.io/openshift-release-dev/ocp-release@sha256:5f824fa3b3c44c6a78a5fc6a77a82edc47cf2b495bb6b2b31e3e0a4d3d77684b 4.1.24 quay.io/openshift-release-dev/ocp-release@sha256:6f87fb66dfa907db03981e69474ea3069deab66358c18d965f6331bd727ff23f oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.23 True True 44h Unable to apply 4.1.25: could not download the update All Cluster Operators show on 4.1.23. oc adm must-gather is at https://drive.google.com/open?id=18mqD6BpEwAQbApb1_5MD9j-cMBckBIA6 I can provide a kubeconfig as well to poke around there.
(In reply to Wolfgang Kulhanek from comment #0) > oc adm must-gather is at > https://drive.google.com/open?id=18mqD6BpEwAQbApb1_5MD9j-cMBckBIA6 Can you open this up to redhat.com users, if necessary moving it to our account?
Done. I think. :-)
> namespaces/openshift-cluster-version/pods.yaml all the pods that are used to download the release image are failing due to DiskPressure. ``` - apiVersion: v1 kind: Pod metadata: creationTimestamp: 2019-12-06T15:31:03Z generateName: version-4.1.25-zrktd- labels: controller-uid: 3ff37e46-183d-11ea-a000-02156538c61c job-name: version-4.1.25-zrktd name: version-4.1.25-zrktd-s22mg namespace: openshift-cluster-version ownerReferences: - apiVersion: batch/v1 blockOwnerDeletion: true controller: true kind: Job name: version-4.1.25-zrktd uid: 3ff37e46-183d-11ea-a000-02156538c61c resourceVersion: "104418852" selfLink: /api/v1/namespaces/openshift-cluster-version/pods/version-4.1.25-zrktd-s22mg uid: 69cb647a-183d-11ea-8d5e-0628dcb211ba spec: containers: - args: - -c - mkdir -p /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ && mv /manifests /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ/manifests && mkdir -p /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ && mv /release-manifests /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ/release-manifests command: - /bin/sh image: quay.io/openshift-release-dev/ocp-release@sha256:5f824fa3b3c44c6a78a5fc6a77a82edc47cf2b495bb6b2b31e3e0a4d3d77684b imagePullPolicy: IfNotPresent name: payload resources: {} securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/cvo/updatepayloads name: payloads - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-82cr5 readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: default-dockercfg-94rsb nodeName: ip-10-0-149-16.us-west-2.compute.internal nodeSelector: node-role.kubernetes.io/master: "" priority: 0 restartPolicy: OnFailure schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - key: node-role.kubernetes.io/master - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 volumes: - hostPath: path: /etc/cvo/updatepayloads type: "" name: payloads - name: default-token-82cr5 secret: defaultMode: 420 secretName: default-token-82cr5 status: message: 'Pod The node had condition: [DiskPressure]. ' phase: Failed reason: Evicted startTime: 2019-12-06T15:31:03Z ``` And the node is actually under disk pressure ``` apiVersion: v1 kind: Node metadata: annotations: machine.openshift.io/machine: openshift-machine-api/shared-rhpds-jhtvl-master-1 machineconfiguration.openshift.io/currentConfig: rendered-master-ef6f6422b89eea806cfd0b4a3ed3312f machineconfiguration.openshift.io/desiredConfig: rendered-master-ef6f6422b89eea806cfd0b4a3ed3312f machineconfiguration.openshift.io/reason: "" machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: "true" creationTimestamp: 2019-07-26T20:33:00Z labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/instance-type: m5.2xlarge beta.kubernetes.io/os: linux failure-domain.beta.kubernetes.io/region: us-west-2 failure-domain.beta.kubernetes.io/zone: us-west-2b kubernetes.io/hostname: ip-10-0-149-16 node-role.kubernetes.io/master: "" node.openshift.io/os_id: rhcos node.openshift.io/os_version: "4.1" name: ip-10-0-149-16.us-west-2.compute.internal resourceVersion: "104453442" selfLink: /api/v1/nodes/ip-10-0-149-16.us-west-2.compute.internal uid: 8f2b0fc8-afe4-11e9-901a-06682ae2372e spec: providerID: aws:///us-west-2b/i-0a2892adb5f0216dd taints: - effect: NoSchedule key: node-role.kubernetes.io/master - effect: NoSchedule key: node.kubernetes.io/disk-pressure timeAdded: 2019-12-02T16:49:29Z ``` is it possible to find out what's the taking up that space, and what's the disk size?
This is a regular IPI install on AWS. That node that these pods are running on is a master (master-1 to be exact). It's a regular m5.2xlarge instance. ebs for root device, 120GB Root disk shows 86% full $ du -hc / | sort -rh | head -20 204G total 204G / 102G /var 102G /sysroot/ostree 102G /sysroot 99G /sysroot/ostree/deploy/rhcos/var 99G /sysroot/ostree/deploy/rhcos 99G /sysroot/ostree/deploy 96G /var/lib 93G /sysroot/ostree/deploy/rhcos/var/lib 89G /var/lib/fluentd 89G /sysroot/ostree/deploy/rhcos/var/lib/fluentd 64G /var/lib/fluentd/buffer-output-es-config 64G /sysroot/ostree/deploy/rhcos/var/lib/fluentd/buffer-output-es-config 25G /var/lib/fluentd/es-retry 25G /sysroot/ostree/deploy/rhcos/var/lib/fluentd/es-retry 6.3G /var/log 6.3G /var/lib/containers/storage/overlay 6.3G /var/lib/containers/storage 6.3G /var/lib/containers Biggest offender seems to be: /sysroot/ostree/deploy/rhcos/var/lib/fluentd sh-4.4# du -hc . | sort -rh | head -20 89G total 89G . 64G ./buffer-output-es-config 25G ./es-retry Any idea what to do here? Delete everything in that directory? Reboot the master (from the AWS Console)?
Logging folks might know what to do about Fluentd disk space.
(In reply to W. Trevor King from comment #5) > Logging folks might know what to do about Fluentd disk space. This is directly related to the following from fluent mounting the host and relying on it for buffering: 1. Fluent will buffer container logs to disk if it is unable to push them off the node. 2. Fluent writes its own logs to disk to avoid a feedback loop for ingestion. These logs are supposed to be rotated by fluent's log library The only work around at the moment is to truncate the directory from the node to remove the buffered logs
Is this abnormally high levels of logging or would one expect this of all normal clusters?
(In reply to Scott Dodson from comment #7) > Is this abnormally high levels of logging or would one expect this of all > normal clusters? If ES is unable to ingest the logs or refuses and fluent is unable to push and clear it's buffers then I expect this can potentially occur. High levels of logging (e.g. volume, DEBUG) could lead to this in any cluster.
So this is interesting. I checked this morning wanting to go and clean out the fluentd directories - and sure enough over the weekend my cluster had healed itself. The FluentD directories are now only 8k. And the upgrade actually ran through to 4.1.25. But this still doesn't explain the 80G of FluentD data on Friday. Where did those come from? Where did they go? Is it possible the curator ran, cleaned up Elasticsearch and then fluentd was able to send things over? Although Elasticsearch hadn't shown any abnormalities last week (as in error messages). Elasticsearch has a 200Gi PVC. Although I see two of them - the newer one got created Friday and is the currently used. Time stamp is consistent with the upgrade. This cluster isn't super busy quite yet. It's actually one of our less used clusters. I'm at a bit of a loss as to what to tell you guys and where to go from here...
(In reply to Wolfgang Kulhanek from comment #10) > So this is interesting. I checked this morning wanting to go and clean out > the fluentd directories - and sure enough over the weekend my cluster had > healed itself. The FluentD directories are now only 8k. And the upgrade > actually ran through to 4.1.25. Fluent will push all its buffers if it is not receiving back pressure from ES. Assuming the rate it pushes is faster then the node log rate it will eventually clear it's buffers. > > But this still doesn't explain the 80G of FluentD data on Friday. Where did > those come from? Where did they go? Is it possible the curator ran, cleaned > up Elasticsearch and then fluentd was able to send things over? Although > Elasticsearch hadn't shown any abnormalities last week (as in error > messages). Elasticsearch has a 200Gi PVC. Although I see two of them - the > newer one got created Friday and is the currently used. Time stamp is > consistent with the upgrade. We have an internal thread discussing this issue. One point that was brought up related to fluent performance was that if ES is not properly sized it could have issues with ingestion and thus rejecting. In regards to curation, maybe you were at the threshhold limits where ES will no longer index data thus causing backpressure. Curator could indeed have removed indices, dropping you below the threshold and allowing ES to index more. We are unable to know unless we had a snapshot of the environment or an understand of the state of the disk. > > This cluster isn't super busy quite yet. It's actually one of our less used > clusters. > > I'm at a bit of a loss as to what to tell you guys and where to go from > here...
If this is standard OCP 4.x installation then a lot of internal ES cluster statistics are captured in Prometheus. We should definitely take advance of the data recorded in Prometheus to troubleshoot cases like this.
Created https://bugzilla.redhat.com/show_bug.cgi?id=1796133 to address writing fluent's logs to disk. This will provide partial relief of this issue.
@Anping, I believe by moving the container logs back to std out [1] and the correction [2] to how we calculate fluent's buffers, this issue would be resolved for 4.3 and 4.4. For cluster logging that is only writing to our managed instance of Elasticsearch, I would expect fluent to stop reading files when it reached a total of 256M of buffers, which is the value we provide for FILE_BUFFER_LIMIT. There is the main pipeline and the retry pipeline which equates to 2 outputs. The buffers get chunked to 8M each so the total number between main and retry pipeline should result in 32 chunks max. I would expect fluent to stop reading files when we get to this point. If we can verify for these releases, we likely have to fix the calculation for 4.1 and prior where the number of outputs was not properly considered. I'm asking you verify we have resolved this issue in 4.4. I think you could deploy a cluster logging instance and make sure logs flow and then remove the ES instance. I expect once you get to the max number of chunks that fluent will stop collecting logs and you will see no additional buffers created. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1796133 [2] https://github.com/openshift/cluster-logging-operator/blob/master/files/fluentd/run.sh#L85-L130
The buffer growed up to 7.7 G. fluentd image: openshift/ose-logging-fluentd:v4.4.0-202001310654 [anli@preserve-docker-slave 79939]$ oc get pod fluentd-z8fx7 -o yaml |grep BUFFER -A 2 - name: BUFFER_QUEUE_LIMIT value: "32" - name: BUFFER_SIZE_LIMIT value: 8m - name: FILE_BUFFER_LIMIT value: 256Mi - name: FLUENTD_CPU_LIMIT oc exec fluentd-z8fx7 -- du -sh /var/lib/fluentd/clo_default_output_es/ 7.7G /var/lib/fluentd/clo_default_output_es/
*** Bug 1819964 has been marked as a duplicate of this bug. ***
*** Bug 1824427 has been marked as a duplicate of this bug. ***
Verified and pass using image: registry.svc.ci.openshift.org/origin/4.5:logging-fluentd imageID: registry.svc.ci.openshift.org/origin/4.5@sha256:2f6918a6ca84d954badd083dad704ae44f2896f5fa4321f9909cb6d8927c1223 Tue May 5 21:05:08 EDT 2020 218M /var/lib/fluentd 206M /var/lib/fluentd/clo_default_output_es 12M /var/lib/fluentd/retry_clo_default_output_es Tue May 5 21:06:10 EDT 2020 229M /var/lib/fluentd 217M /var/lib/fluentd/clo_default_output_es 12M /var/lib/fluentd/retry_clo_default_output_es Tue May 5 21:07:11 EDT 2020 262M /var/lib/fluentd 250M /var/lib/fluentd/clo_default_output_es 12M /var/lib/fluentd/retry_clo_default_output_es Tue May 5 21:08:13 EDT 2020 277M /var/lib/fluentd 266M /var/lib/fluentd/clo_default_output_es 12M /var/lib/fluentd/retry_clo_default_output_es Tue May 5 21:09:14 EDT 2020 279M /var/lib/fluentd 268M /var/lib/fluentd/clo_default_output_es 12M /var/lib/fluentd/retry_clo_default_output_es Tue May 5 21:31:50 EDT 2020 268M /var/lib/fluentd 257M /var/lib/fluentd/clo_default_output_es 12M /var/lib/fluentd/retry_clo_default_output_es Tue May 5 22:02:42 EDT 2020 268M /var/lib/fluentd 257M /var/lib/fluentd/clo_default_output_es 12M /var/lib/fluentd/retry_clo_default_output_es Tue May 5 22:16:05 EDT 2020 268M /var/lib/fluentd 257M /var/lib/fluentd/clo_default_output_es 12M /var/lib/fluentd/retry_clo_default_output_es oc logs fluentd-ff8nn 2020-05-06 02:22:18 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block
(In reply to Oscar Casal Sanchez from comment #23) > Hello, > > Is going this issue to be backported to OCP 4.2? We will not be backporting this change to 4.2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409